Recently, countless papers have found:
• RL has an upper limit
• SFT is essentially a special form of RL
• RL effects can be achieved by adopting and optimizing probability distributions
Today, I came across another paper that distinguishes the differences in parameter updates for LLMs between RL and SFT from a theoretical perspective. Recording it here.
The Path Not Taken: RLVR Provably Learns Off the Principals https://arxiv.org/pdf/2511.08567
"RLVR isn't learning new knowledge; it's learning how to use knowledge for reasoning."
Mathematical Background Supplement
In Transformer models, each layer has multiple weight matrices (e.g., Q, K, V, O matrices). You can view a d x d weight matrix W as a function that converts a d-dimensional input vector into a d-dimensional output vector.
Any matrix W can be decomposed into the product of three matrices: W = U * Σ * V^T.
• U and V are "direction" matrices; their column vectors define a set of orthogonal "input directions" and "output directions."
• Σ is a diagonal matrix, with values on the diagonal called singular values. These values are non-negative and arranged from largest to smallest.
What are principal directions: The size of a singular value indicates how important the corresponding direction is. The directions in U and V corresponding to the largest singular values are the principal directions. They are the directions with the maximum stretching amplitude and information carrying capacity during the weight matrix's transformation, storing the model's core functions.
What problem does the paper solve? Why is it important? What value does it bring?
Problem Solved: The paper aims to address a phenomenon called the "RLVR paradox." RLVR (Reinforcement Learning with Verifiable Rewards) is a computationally intensive training method that significantly improves large model reasoning abilities. However, its modifications to model parameters are surprisingly "sparse," changing only a very small portion of the weights.
In contrast, simpler supervised fine-tuning (SFT) leads to "dense," global parameter updates. The paper's core question is:
Why does this high-cost, high-reward training process result in such small and patterned underlying parameter changes? What is the mechanism behind this sparsity?
Why Important:
• Understanding Core Technology: RLVR is key to driving the most advanced reasoning models (e.g., DeepSeek-R1). Without understanding how it works, it's like driving a high-performance sports car without knowing the engine principles, making optimization and improvement impossible.
• Guiding Future Research: Many efficient algorithms for RL fine-tuning (e.g., LoRA and other PEFT methods) are directly borrowed from the SFT era. Without understanding the fundamental differences in parameter update mechanisms between RL and SFT, we might be using "hammer-designed tools to unscrew screws," leading to inefficiency or instability.
Value Brought:
• Provides "White-Box" Explanation: The paper reveals RLVR's training dynamics at the parameter level for the first time, making a "black-box" process transparent.
• Designs New Algorithms: By understanding RLVR's intrinsic preferences, it inspires researchers to design more RL-suited, "geometry-aware" parameter-efficient fine-tuning (PEFT) methods, achieving better results with fewer computational resources.
• Improves Model Training Efficiency and Stability: Explains why some SFT-era methods fail or cause training crashes in RL, providing valuable practical guidance for future RL training.
Has this problem been solved before? What are the shortcomings of previous work and how does this paper differ?
This problem has not been systematically solved before. Previous research and shortcomings:
Phenomenon Observers: Prior studies (e.g., Mukherjee et al., 2025) observed the RLVR update sparsity but failed to explain the reasons, only speculating it might relate to zero gradients. They answered "what," not "why" or "where."
Focusing on Policy Layer, Not Parameter Layer: Other works analyzed mainly from the policy level, finding RL-trained models behaviorally close to originals (KL divergence), but not explaining parameter-level changes.
This Paper's Differences:
• From Observation to Mechanism: This is the first to delve from phenomenon to mechanism, proposing a complete explanatory framework, not just describing the phenomenon.
• Introduces Core Concept: Creatively proposes "model-conditioned optimization bias," stating that parameter update patterns are determined by the pretrained model's own "geometric structure," not data or RL algorithms.
• Parameter Space and Geometric Perspective: The core distinction is analyzing in parameter space (weight space) and geometric view (optimization geometry), directly comparing "paths" of RLVR and SFT in weight updates.
Author's Thought Process Simulation
1. Discover Anomaly: "Huh, everyone noticed RL updates are sparse—that's weird. Is it random sparsity?"
2. Verify Consistency: "We run five different RL experiments (different data, algorithms) on the same model. (See Fig. 2) Wow! Update locations are highly consistent, like stripes! Definitely not random, not data or algorithm caused—must be something in the model itself." → Proposes "model-conditioned optimization bias."
3. Find Reason: "Why does the model guide updates to specific areas? Pretrained model parameter space isn't chaos; it has intrinsic structure. Like a terrain map: mountains (high curvature, main functional areas, called 'Principal Directions'), plains (low curvature, secondary areas). RL's KL constraint is like a 'rubber band,' not allowing big moves. Minimal cost (stable structure) for max reward: walk plains, not shake mountains." → Proposes "geometric structure guidance."
4. Build Theory: "We summarize this into a theory. First, a 'rubber band' (Gate I: KL Anchor) limits step distance. Then, terrain (Gate II: Model Geometry) decides direction—flat, off-principal. Why 'sparse'? Many tiny steps on plains too small for bfloat16 precision (Gate III: Precision), look unchanged. Three gates cause observed phenomenon." → Proposes "Three Gates Theory."
Pipeline Explanation (Example: Fine-tuning Qwen3-8B with RLVR for Math Problems)
• Input: A pretrained Qwen3-8B model, batch of math problems, corresponding answer verifiers (reward signal).
• Processing Flow (One Training Step):
Gate I: KL Anchor
Model attempts to generate solution steps for a math problem. RL algorithm (e.g., PPO) aims to maximize reward for correct answers.
But there's a KL divergence penalty (explicit or implicit): "You can update, but post-update behavior can't deviate much from pre-update."
This sets an upper limit on parameter update ΔW magnitude. Model can only make a small "shift."
Gate II: Model Geometry
Now, which direction for this small "shift"? Qwen3-8B weights aren't random; SVD reveals few large "singular values," principal directions storing core knowledge/functions (e.g., language structure, basic arithmetic). Changing them causes drastic behavior changes—like "mountains."
To improve performance without disrupting core structure, optimizer avoids "principal directions," modifies smaller singular value directions (off-principal, like "plains"). Small impact on stability, effective for strategy adjustment.
Result: ΔW concentrates on weights corresponding to these "off-principal directions."
Gate III: Precision
Many updates on "off-principal" are tiny (e.g., 1e-7).
bfloat16 precision limited. For weight=1.0, 1e-7 update may be below min representable unit (ULP), "swallowed," stored as 1.0.
Result: Only sufficiently accumulated updates on off-principal recorded. Tiny ones "zeroed."
Output:
• Fine-tuned Qwen3-8B model.
• Comparing pre/post weights: Only small portion visibly changed, positions patterned (stripe-like), mostly off core "principal directions." Overall "spectral structure" (singular value distribution) nearly unchanged.
Does this paper have theoretical basis explaining why the method works?
Yes, very solid. Paper provides mathematical proofs for each part of "Three Gates Theory":
Gate I Theoretical Basis:
Proposition 3.1 & 3.2: Proves single-step policy gradient update bounds policy KL divergence, translatable to parameter update ||ΔW|| bound. Simply, mathematically proves RL updates tied by "invisible rope."
Gate II Theoretical Basis:
Theorem 3.3 (based on Wedin theorem) & Corollaries 3.4, 3.5: From classic matrix perturbation theory. Prove when ||ΔW|| small:
1. Singular subspaces (functional directions) rotate minimally.
2. Singular values (importance) change minimally.
3. Top-k energy nearly unchanged.
Simply, mathematically proves small updates naturally preserve spectral structure, avoiding "principal directions."
Gate III Theoretical Basis:
Corollary 3.6 & Lemma E.2: Based on floating-point basics. Proves in bfloat16, weight change depends on update exceeding ULP for its magnitude. Explains tiny updates filtered.
What are the experimental validation conclusions?
Experiments cleverly designed, strongly validate theory.
Conclusion 1: RLVR Preserves Spectral Geometry, SFT Disrupts It (Fig. 4)
Comparison shows RLVR post-training singular value distributions and principal directions nearly match pretrained. SFT drastically changes them. Confirms RLVR takes "spectral-preserving" flat path.
Conclusion 2: RLVR Avoids Principal Weights, SFT Attacks Them (Fig. 5)
Defines "Principal Weights" as core function proxy. RLVR updated weights overlap less than random with principal, showing active avoidance.
Conclusion 3: Disrupt Geometry, Bias Disappears (Fig. 6)
Clever causal experiment. Authors "rotate" some layer weights (function unchanged, geometric basis altered), disrupting pretrained geometry. In disrupted layers, consistent update patterns vanish, become random. Strongly proves pretrained geometry is bias source.
Conclusion 4: SFT-Era PEFT Methods Unsuitable for RL (Sec. 5)
Sparse Fine-Tuning Exp (Fig. 9): Updating only "off-principal weights" yields performance/trajectory near full fine-tuning. Updating only "principal" (SFT way) disastrous.
LoRA vs. PiSSA (Fig. 10): PiSSA, SFT-designed LoRA variant targeting principal directions. In RLVR, not better than vanilla LoRA, worse—crashes easier forcing "mountain" path.
My Thoughts
1. RLVR tends not to modify model's "declarative knowledge" ("what is"), but optimizes "procedural knowledge" ("how to do").
2. Is training sparsity good or bad?
I think neutral phenomenon. From paper, reflects "good" trait: efficient, safe. Model learns new complex skills (reasoning) without disrupting hard-earned knowledge system (preserves pretrained geometry). Elegant "minimally invasive surgery."
3. Does RL sparsity limit capability ceiling?
Depends on "ceiling" definition.
• Knowledge Ceiling: Yes, RLVR can't teach entirely new knowledge absent from pretraining. E.g., if model never saw "Aurelle," RLVR can't make it know who I am.
• Skill/Reasoning Ceiling: Opposite, RLVR massively breaks skill ceiling. Most complex problems (math, coding) need no new knowledge, but flexible multi-step combination of existing. RLVR optimizes procedural knowledge, boosting reasoning from 1 to 100. Doesn't add library books, upgrades retrieval/integration to future tech.
4. Can SFT break original knowledge?
SFT excels at "instilling" new knowledge. For 2025 events or new domains, SFT (esp. distillation) most direct. Directly modifies core-knowledge principal weights—like forcibly "swapping books" in library.
But SFT risky. Violent mods easily cause forgetting old knowledge (catastrophic forgetting), or surface imitation without true logic (overfitting).