Meta's Two Latest Agent Learning Papers Are Quite Interesting!

Hello everyone, I am PaperAgent, not an Agent!

Today, I'm sharing two recently published papers on Agent Learning from Meta SuperLabs:

  • 2025.11 "Scaling Agent Learning via Experience Synthesis"

  • 2025.10 "Agent Learning via Early Experience"

ImageImage

These two articles explore "how to obtain high-quality experience at low cost," forming a complete technical chain: offline expert data → early experience augmentation → synthetic environment acceleration → Sim-to-Real fine-tuning, providing a reproducible roadmap for language agents to enter the "scalable RL era."

1. Three Major Challenges of Agent RL

  1. Rollout is too expensive: A complete interaction in WebArena takes about 30 seconds; running all 812 tasks takes approximately 6.8 hours. Training GRPO often involves 80k transitions.

  2. Reward is sparse or even missing: Web scenarios lack ground-truth rewards; successful form submission does not always mean all fields are correct.

  3. Insufficient task diversity: Manually writing 800 instructions is already a ceiling, making it difficult to support curriculum training.

Image

2. Early Experience: Turning "Expert Demonstrations" into an "Ocean of Experience"

2.1 Core Idea

Instead of waiting for environmental rewards, let the agent "take a shot" – use the generated future states as supervisory signals.

2.2 Two Major Technical Approaches

Approach

Input

Output

Goal

Implicit World Modeling (IWM)

(s, a’)

s’

Learn to "predict the next world state"

Self-Reflection (SR)

(s, a_expert, a’, s’)

Natural language reflection c

Learn "why expert is better"

Fig-1 Comparison of three paradigms

Fig 1: From human data era → early experience era → full experience era

2.3 Data Flywheel

Image

  1. Sample state si from expert trajectories Dexpert.

  2. Generate K alternative actions aji using initial policy πθ.

  3. Execute in real environment, collect (si, aji, sji) to form Drollout.

  4. Use Drollout for IWM or SR augmented training.

  5. Fine-tuned πθ continues to produce more Drollout → positive loop.

Image

2.4 Results Overview

Table2 8 benchmarks results

Table 2: Results on 8 benchmarks

OOD results

OOD results

Conclusion: Only 1/8 of expert data is needed to match full IL performance, with greater OOD generalization gains.

Image

3. DreamGym: Going a Step Further, Eliminating "Real Interaction"

3.1 Core Insight

Agent training does not need perfect simulation, only "sufficiently diverse, causally consistent, and explainable" experience.

Thus, the authors use an LLM to act as an Experience Model, directly "inferring" the next state and reward, forming an RL training ground with "zero real rollouts."

Fig-2 DreamGym Framework

Fig 2: Experience Model alternately interacts with Agent, Replay Buffer updates continuously, Task Generator dynamically produces high-entropy tasks

3.2 Three Major Components

Component

Role

Key Technique

Reasoning Experience Model

Given (s, a, τ, history, similar trajectory) → (s’, r) + CoT explanation

Abstract text state space, filter HTML noise

Experience Replay Buffer

Offline seed + online new generation, top-k similarity retrieval to prevent hallucinations

Continuous co-evolution with policy

Curriculum Task Generator

Select high-entropy tasks with “success rate ≈ 50%” → generate variants

Maximize information gain

3.3 Experimental Highlights

DreamGym results with different agent training algorithms

DreamGym results with different agent training algorithms

DreamGym achieves or even surpasses traditional RL with zero real interaction; adding 5k real rollouts (DreamGym-S2R) provides a direct +8-10% absolute gain.

case analysis

case analysis

4. Technical Comparison: Early Experience vs DreamGym

Dimension

Early Experience

DreamGym

Real Environment Interaction

✅ Requires executing alternative actions

❌ Completely synthetic

Reward Signal

No reward needed, uses s’ for supervision

Self-generated reward r∈{0,1}

Data Efficiency

10× expert data compression

2k-10k transitions sufficient for training

Integration with RL

Provides warm-start, followed by GRPO

PPO/GRPO directly built-in

Biggest Bottleneck

Still requires real rollout collection

Relies on LLM reasoning capabilities, risk of hallucination

Experience is Data, Inference is Environment

From Early Experience to DreamGym, both works point to a core trend—

"Experience" is no longer an expensive and scarce commodity to be collected, but raw data that can be synthesized on demand by large models.

When "experience" can be infinitely generated and "rewards" can be instantly inferred, language agents truly enter the "scalable RL" flywheel era. For industry, this means "small sample expert trajectories + large model synthesis" will become the new standard paradigm, and the "real environment" will only be used for the final 5% calibration – lightweight, low-cost, scalable. The next agent explosion might just start here.

https://arxiv.org/pdf/2510.08558

https://arxiv.org/pdf/2511.03773

Main Tag:Agent Learning

Sub Tags:Reinforcement LearningSimulation-to-RealExperience SynthesisLarge Language Models


Previous:AI Research Revolution: Oxford Team Uses "World Model" to Complete Half a Year of Scientific Research Overnight!

Next:"AI Has Peaked" Is the Biggest Misconception, Anthropic Top Researcher: AI Is Still Accelerating Exponentially, Soon to Achieve "8 Hours of Autonomous Work"

Share Short URL