Meta's Two Latest Agent Learning Papers Are Quite Interesting!

Hello everyone, I am PaperAgent, not an Agent!

Today, I'm sharing two recently published papers on Agent Learning from Meta SuperLabs:

2025.11 "Scaling Agent Learning via Experience Synthesis"
2025.10 "Agent Learning via Early Experience"

These two articles explore "how to obtain high-quality experience at low cost," forming a complete technical chain: offline expert data → early experience augmentation → synthetic environment acceleration → Sim-to-Real fine-tuning, providing a reproducible roadmap for language agents to enter the "scalable RL era."

1. Three Major Challenges of Agent RL

Rollout is too expensive: A complete interaction in WebArena takes about 30 seconds; running all 812 tasks takes approximately 6.8 hours. Training GRPO often involves 80k transitions.
Reward is sparse or even missing: Web scenarios lack ground-truth rewards; successful form submission does not always mean all fields are correct.
Insufficient task diversity: Manually writing 800 instructions is already a ceiling, making it difficult to support curriculum training.

2. Early Experience: Turning "Expert Demonstrations" into an "Ocean of Experience"

2.1 Core Idea

Instead of waiting for environmental rewards, let the agent "take a shot" – use the generated future states as supervisory signals.

2.2 Two Major Technical Approaches

Approach	Input	Output	Goal
Implicit World Modeling (IWM)	(s, a’)	s’	Learn to "predict the next world state"
Self-Reflection (SR)	(s, a_expert, a’, s’)	Natural language reflection c	Learn "why expert is better"

Fig-1 Comparison of three paradigms

Fig 1: From human data era → early experience era → full experience era

2.3 Data Flywheel

Sample state si from expert trajectories Dexpert.
Generate K alternative actions aji using initial policy πθ.
Execute in real environment, collect (si, aji, sji) to form Drollout.
Use Drollout for IWM or SR augmented training.
Fine-tuned πθ continues to produce more Drollout → positive loop.

2.4 Results Overview

Table2 8 benchmarks results

Table 2: Results on 8 benchmarks

OOD results

Conclusion: Only 1/8 of expert data is needed to match full IL performance, with greater OOD generalization gains.

3. DreamGym: Going a Step Further, Eliminating "Real Interaction"

3.1 Core Insight

Agent training does not need perfect simulation, only "sufficiently diverse, causally consistent, and explainable" experience.

Thus, the authors use an LLM to act as an Experience Model, directly "inferring" the next state and reward, forming an RL training ground with "zero real rollouts."

Fig-2 DreamGym Framework

Fig 2: Experience Model alternately interacts with Agent, Replay Buffer updates continuously, Task Generator dynamically produces high-entropy tasks

3.2 Three Major Components

Component	Role	Key Technique
Reasoning Experience Model	Given (s, a, τ, history, similar trajectory) → (s’, r) + CoT explanation	Abstract text state space, filter HTML noise
Experience Replay Buffer	Offline seed + online new generation, top-k similarity retrieval to prevent hallucinations	Continuous co-evolution with policy
Curriculum Task Generator	Select high-entropy tasks with “success rate ≈ 50%” → generate variants	Maximize information gain

Component

Role

Key Technique

Reasoning Experience Model

Given (s, a, τ, history, similar trajectory) → (s’, r) + CoT explanation

Abstract text state space, filter HTML noise

Experience Replay Buffer

Offline seed + online new generation, top-k similarity retrieval to prevent hallucinations

Continuous co-evolution with policy

Curriculum Task Generator

Select high-entropy tasks with “success rate ≈ 50%” → generate variants

Maximize information gain

3.3 Experimental Highlights

DreamGym results with different agent training algorithms

DreamGym achieves or even surpasses traditional RL with zero real interaction; adding 5k real rollouts (DreamGym-S2R) provides a direct +8-10% absolute gain.

case analysis

4. Technical Comparison: Early Experience vs DreamGym

Dimension	Early Experience	DreamGym
Real Environment Interaction	✅ Requires executing alternative actions	❌ Completely synthetic
Reward Signal	No reward needed, uses s’ for supervision	Self-generated reward r∈{0,1}
Data Efficiency	10× expert data compression	2k-10k transitions sufficient for training
Integration with RL	Provides warm-start, followed by GRPO	PPO/GRPO directly built-in
Biggest Bottleneck	Still requires real rollout collection	Relies on LLM reasoning capabilities, risk of hallucination

Experience is Data, Inference is Environment

From Early Experience to DreamGym, both works point to a core trend—

"Experience" is no longer an expensive and scarce commodity to be collected, but raw data that can be synthesized on demand by large models.

When "experience" can be infinitely generated and "rewards" can be instantly inferred, language agents truly enter the "scalable RL" flywheel era. For industry, this means "small sample expert trajectories + large model synthesis" will become the new standard paradigm, and the "real environment" will only be used for the final 5% calibration – lightweight, low-cost, scalable. The next agent explosion might just start here.

https://arxiv.org/pdf/2510.08558

https://arxiv.org/pdf/2511.03773

Meta's Two Latest Agent Learning Papers Are Quite Interesting!

Share Short URL