Hello everyone, I am PaperAgent, not an Agent!
Today, I'm sharing two recently published papers on Agent Learning from Meta SuperLabs:
2025.11 "Scaling Agent Learning via Experience Synthesis"
2025.10 "Agent Learning via Early Experience"
These two articles explore "how to obtain high-quality experience at low cost," forming a complete technical chain: offline expert data → early experience augmentation → synthetic environment acceleration → Sim-to-Real fine-tuning, providing a reproducible roadmap for language agents to enter the "scalable RL era."
1. Three Major Challenges of Agent RL
Rollout is too expensive: A complete interaction in WebArena takes about 30 seconds; running all 812 tasks takes approximately 6.8 hours. Training GRPO often involves 80k transitions.
Reward is sparse or even missing: Web scenarios lack ground-truth rewards; successful form submission does not always mean all fields are correct.
Insufficient task diversity: Manually writing 800 instructions is already a ceiling, making it difficult to support curriculum training.
2. Early Experience: Turning "Expert Demonstrations" into an "Ocean of Experience"
2.1 Core Idea
Instead of waiting for environmental rewards, let the agent "take a shot" – use the generated future states as supervisory signals.
2.2 Two Major Technical Approaches
Approach | Input | Output | Goal |
|---|---|---|---|
| Implicit World Modeling (IWM) | (s, a’) | s’ | Learn to "predict the next world state" |
| Self-Reflection (SR) | (s, a_expert, a’, s’) | Natural language reflection c | Learn "why expert is better" |
Fig 1: From human data era → early experience era → full experience era
2.3 Data Flywheel
Sample state si from expert trajectories Dexpert.
Generate K alternative actions aji using initial policy πθ.
Execute in real environment, collect (si, aji, sji) to form Drollout.
Use Drollout for IWM or SR augmented training.
Fine-tuned πθ continues to produce more Drollout → positive loop.
2.4 Results Overview
Table 2: Results on 8 benchmarks
OOD results
Conclusion: Only 1/8 of expert data is needed to match full IL performance, with greater OOD generalization gains.
3. DreamGym: Going a Step Further, Eliminating "Real Interaction"
3.1 Core Insight
Agent training does not need perfect simulation, only "sufficiently diverse, causally consistent, and explainable" experience.
Thus, the authors use an LLM to act as an Experience Model, directly "inferring" the next state and reward, forming an RL training ground with "zero real rollouts."
Fig 2: Experience Model alternately interacts with Agent, Replay Buffer updates continuously, Task Generator dynamically produces high-entropy tasks
3.2 Three Major Components
Component | Role | Key Technique |
|---|---|---|
| Reasoning Experience Model | Given (s, a, τ, history, similar trajectory) → (s’, r) + CoT explanation | Abstract text state space, filter HTML noise |
| Experience Replay Buffer | Offline seed + online new generation, top-k similarity retrieval to prevent hallucinations | Continuous co-evolution with policy |
| Curriculum Task Generator | Select high-entropy tasks with “success rate ≈ 50%” → generate variants | Maximize information gain |
3.3 Experimental Highlights
DreamGym results with different agent training algorithms
DreamGym achieves or even surpasses traditional RL with zero real interaction; adding 5k real rollouts (DreamGym-S2R) provides a direct +8-10% absolute gain.
case analysis
4. Technical Comparison: Early Experience vs DreamGym
Dimension | Early Experience | DreamGym |
|---|---|---|
| Real Environment Interaction | ✅ Requires executing alternative actions | ❌ Completely synthetic |
| Reward Signal | No reward needed, uses s’ for supervision | Self-generated reward r∈{0,1} |
| Data Efficiency | 10× expert data compression | 2k-10k transitions sufficient for training |
| Integration with RL | Provides warm-start, followed by GRPO | PPO/GRPO directly built-in |
| Biggest Bottleneck | Still requires real rollout collection | Relies on LLM reasoning capabilities, risk of hallucination |
Experience is Data, Inference is Environment
From Early Experience to DreamGym, both works point to a core trend—
"Experience" is no longer an expensive and scarce commodity to be collected, but raw data that can be synthesized on demand by large models.
When "experience" can be infinitely generated and "rewards" can be instantly inferred, language agents truly enter the "scalable RL" flywheel era. For industry, this means "small sample expert trajectories + large model synthesis" will become the new standard paradigm, and the "real environment" will only be used for the final 5% calibration – lightweight, low-cost, scalable. The next agent explosion might just start here.