RL Remembers More Firmly, SFT More Forgetful? Princeton Danqi Chen's Team Rewrites Post-Training Cognition

As large model scales continue to expand, post-training has become a key stage influencing the model's final performance. It makes the model more aligned with human preferences, but also brings a stubborn side effect—forgetting. The model becomes more natural in interaction, but often declines in reasoning and knowledge tasks.

This phenomenon is called alignment tax by researchers: the more thorough the alignment, the more fragile the memory. Among various post-training methods, supervised fine-tuning (SFT) and reinforcement learning (RL) are the two most common paths. SFT relies on high-quality labeled data and is stable and reliable; RL optimizes generation strategies through rewards and is more adaptive.

From a theoretical intuition, SFT is considered more robust, while RL's goals are more aggressive and seem more prone to forgetting. However, recent empirical results show the opposite—RL retains more original capabilities after long-term training.

This phenomenon caught the interest of Princeton's Danqi Chen team. They posed a core question:

"When RL and SFT are trained under the same conditions, what causes the systematic difference in their 'memory retention'?"

To answer this, the research team designed rigorous controlled experiments and established a theoretical model to analyze the root cause of forgetting. They ultimately found that the issue does not stem from the algorithm form, but from the mismatch between data distribution and model behavior.

This study not only compares the differences between the two post-training paradigms but also reveals the mechanism behind memory retention. The following sections will explain from theoretical and empirical perspectives why RL can "learn longer and remember more firmly".

Paper Title:

Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting

Paper Link:

https://arxiv.org/pdf/2510.18874

Research Background

In the development of language models, "alignment" has long been a standard process. Models learn language structures from massive unsupervised corpora, but to truly understand human intent, they need to undergo post-training: through SFT or RLHF, making the model outputs align with human expectations.

However, the side effects of alignment are equally significant—catastrophic forgetting. The model performs better on new tasks but experiences performance drops on old tasks.

To systematically study this phenomenon, Princeton's Danqi Chen team selected the two most representative post-training methods—SFT and RL—and conducted controlled training on Llama-3 and Qwen-2.5 series models with the same compute and data budget, covering three typical tasks: instruction following, general reasoning, and arithmetic reasoning.

The goal of this study is not to judge which method is stronger, but to explore the deeper mechanism:

When the model learns new objectives, why does its old knowledge leak? And what allows certain methods to retain memory during learning?

Driven by this question, the paper builds a complete analysis path from theory to empirics—gradually revealing: memory retention has nothing to do with the algorithm, but everything to do with data distribution.

From Two KLs to the Key Mechanism of "Memory Retention"

In the post-training stage of large language models (LLMs), two mainstream methods are commonly used: SFT (supervised fine-tuning) and RL (reinforcement learning). On the surface, they differ only in optimization objectives; but in the authors' view, the core difference between these methods lies in how they handle the model's "memory".

2.1 From KL: Two Completely Different Learning Directions

The relationship between SFT and RL can be unified under the same mathematical framework. The former minimizes forward KL divergence, meaning the model must "cover" the entire target distribution; the latter minimizes reverse KL divergence, tending to "select" the most probable part of the target distribution.

▲ Figure 1. Core differences between Forward KL and Reverse KL

The former is like "trying to cover all peaks," while the latter focuses on "climbing the highest peak," vividly described as "mode-covering" vs. "mode-seeking."

According to past intuition, RL with reverse KL would "discard old modes" and seem more prone to forgetting. However, when researchers experimented on real LLM distributions, they discovered the complete opposite phenomenon.

2.2 Toy Model Derivation: Why RL "Remembers Better" in Reality

To understand this reversal, the research team designed a minimalist mixture distribution experiment, modeling the "old task" and "new task" as two probability peaks respectively:

The training goal is to make the model distribution retain the mass of the old peak as much as possible while learning the new task. Researchers measure this "memory retention" by defining the overlap area:

▲ Figure 2. Single-peak distribution: SFT slightly superior

In simple tasks, SFT's forward KL can indeed boost the new peak while maintaining the old peak.

▲ Figure 3. Multi-peak distribution: RL surpasses

When tasks are complex and outputs diverse, SFT's forward KL, to "cover" the new target, pulls probability mass, causing significant decay in the old peak; conversely, RL's reverse KL directly "shifts the new peak" close to the target without touching the old peak.

This means that what truly causes the model to forget old tasks is not the direction of KL, but whether the data distribution is consistent. SFT trains on offline static data (off-policy), always facing the past; RL samples under the model's current policy (on-policy), always facing the present.

The author team thus provides the core insight—forgetting is not an algorithm problem, but a distribution mismatch problem.

2.3 Ablation Analysis: The Key is On-Policy, Not Regularization

To further verify this, the authors systematically removed components from the RL objective: removing the KL regularization term, replacing GRPO with REINFORCE for advantage estimation, and found that—the model's forgetting resistance remained almost unchanged.

▲ Figure 4. RL maintains low forgetting even without KL regularization

The figure above compares GRPO performance at β = 0 (no regularization) vs. β = 0.05 (with regularization). Except for slight differences in Llama series on IFEval, the two are almost identical in gain-drop balance, indicating KL regularization is not the key factor.

In other words, regardless of whether KL regularization is added, as long as the training data comes from on-policy distribution, the model can stably retain old knowledge. Subsequent experiments further show that this stability does not depend on specific algorithmic components, but mainly originates from the on-policy sampling mechanism itself.

This finding directly rewrites the previous mainstream understanding that "reverse KL causes forgetting".

Experimental Results

The intuitions from the method are supported by large-scale empirical evidence. The authors compared SFT, Self-SFT, REINFORCE, and GRPO on Llama-3 and Qwen-2.5 series models across three typical tasks: IFEval (instruction), MMLU (general), Countdown (arithmetic).

For each task, they recorded the gain on the target task and the drop on non-target tasks.

▲ Figure 5. RL performs more stably on most tasks

Solid bars indicate target task Gain, diagonal shaded bars indicate non-target task Drop. On most models and datasets, RL (GRPO) shows smaller drops on non-target tasks while improving the target task.

In other words, RL not only "learns new things" but also "remembers old things." In contrast, SFT often pays a higher forgetting cost for high gains.

3.1 Learning Rate's "Memory Cost"

Researchers also observed a highly engineering-relevant phenomenon: in SFT training, learning rate (LR) and forgetting show a typical seesaw relationship.

▲ Figure 6. Higher SFT learning rate, heavier forgetting

High LR rapidly improves IFEval metrics but causes significant drops in MMLU and Countdown; lowering LR alleviates forgetting but stalls the target task. This further confirms the toy model conclusion: SFT's problem is not poor LR selection, but updating on "outdated data."

3.2 Quantitative Results: RL's Forgetting Nearly Zero

The paper lists quantitative results in Table 1 for the three tasks: SFT typically shows clear performance drops (Drop ≈ -3~-7), while REINFORCE and GRPO have drops nearly 0, even slight positive gains in some tasks.

▲ Table 1. Performance comparison of different methods on three tasks

RL exhibits stable "no-forgetting" characteristics across all tasks, while SFT shows clear degradation.

3.3 Making SFT "Learn Like RL"

The paper finally explores a practical question: since RL's stability comes from on-policy data, can SFT simulate this "dynamic update" mechanism?

The authors propose two schemes: Iterative-SFT (regenerate training samples with the current model each epoch) and RL-to-SFT (sample with RL first, then SFT on those data).

▲ Figure 7. Iterative-SFT successfully replicates RL's forgetting resistance

The figure compares three SFT variants on Qwen 2.5 1.5B and 7B models for IFEval and MMLU: Iterative-SFT, Self-SFT, and traditional SFT.

Results show Iterative-SFT matches RL (GRPO) on target tasks, with significantly reduced drops on non-target tasks, proving that approximate on-policy data replicates RL's forgetting resistance.

Summary: The Essence of Forgetting is Distribution Mismatch

This study shows that the "memory" of language models is not determined by algorithmic complexity, but closely related to how it learns. When the model continuously trains on its own generated data, it naturally maintains capability continuity; when training and behavior disconnect, forgetting quietly occurs.

This provides a new perspective on post-training problems: alignment does not necessarily come with a cost; the key is to let the model learn in understanding and consolidate in action. This work reminds us that the advantage of reinforcement learning may not lie in the reward signal, but in providing a learning rhythm closer to the model itself.

For future large model training, this may imply a simpler yet profound insight—stable model memory does not rely on freezing parameters, but on whether it truly "participates in its own learning process."

RL Remembers More Firmly, SFT More Forgetful? Princeton Danqi Chen's Team Rewrites Post-Training Cognition

Share Short URL