LLMs Dominate Math Boards, Yet Forget How to Chat? CMU et al. Reveal Striking Differences Between SFT and RL!

In recent years, we have witnessed the rapid development of large language models (LLMs) in mathematical reasoning. Major models continue to break records on high-difficulty math benchmarks like MATH and AIME, even surpassing the average human expert level, with weekly leaderboards full of intense competition 🔥.

Mathematics, as the cornerstone of science, with its clear problems, unique answers, and simple evaluation, naturally becomes the "gold standard" for measuring LLM reasoning capabilities. However, when we shift our focus from pure math competitions to broader real-world applications, a critical question emerges: Can these astonishing achievements in mathematics truly translate into general problem-solving abilities? Will a mathematically gifted LLM become stronger in other fields such as scientific Q&A, code generation, conversational interaction, and instruction following, or will it merely "specialize" at the expense of other skills?

To answer this core question, a research team from Carnegie Mellon University (CMU), the University of Washington, the University of Pennsylvania, M-A-P, and The Hong Kong Polytechnic University conducted an in-depth and comprehensive study. They not only evaluated over 20 open-source reasoning-tuned models but also revealed the significant impact of different training paradigms on model capability transferability through precise control experiments and in-depth mechanism analysis.

Paper link: https://hf.co/papers/2507.00432

🧐 Surprising Discovery: Not All Math Experts Are Generalists

The research team first conducted a comprehensive assessment of over 20 open-source models that performed excellently in mathematics, testing their performance in other reasoning tasks (such as scientific QA, code generation, agent planning) and non-reasoning tasks (such as conversational Q&A, instruction following) beyond mathematics.

To quantify the effect of model capability transfer, they proposed a novel metric—Transferability Index (TI). Simply put, a positive TI indicates that the gains obtained by the model in mathematics successfully "transferred" to other domains; a negative value means that while improving mathematical ability, the model sacrificed performance in other domains, leading to capability degradation.

The results were surprising:

Figure 1: Mathematical Capability Transferability Index of Different Models

This figure illustrates the transferability of various models from mathematics to other domains. The horizontal axis represents base models with different parameters, and the vertical axis is the transferability index (using a signed logarithmic transformation for visualization). We can clearly see that models fine-tuned through Reinforcement Learning (RL) (orange dots) almost all show positive transfer, while models trained through Supervised Fine-Tuning (SFT) (blue dots) extensively show negative transfer, especially in non-reasoning tasks, indicating that while they became proficient in math, their general capabilities declined instead.

This phenomenon reveals a key divergence point: the model's fine-tuning paradigm. Regardless of model size, architecture, or training data, models fine-tuned with Reinforcement Learning (RL) generally exhibit stronger generalization capabilities, whereas Supervised Fine-Tuning (SFT) models often suffer from "catastrophic forgetting" and perform poorly on non-mathematical tasks.

🔬 Controlled Experiment: SFT vs. RL Head-to-Head

To verify the hypothesis that "fine-tuning paradigm is key," the research team conducted a rigorous controlled experiment. They selected the powerful Qwen3-14B as the base model and used identical, high-quality mathematical datasets for training.

• SFT Path: Researchers first had a stronger "teacher model" (Qwen3-32B) generate detailed problem-solving steps (i.e., CoT, Chain of Thought), and then used these "standard answers" to supervised fine-tune Qwen3-14B, teaching it to imitate step-by-step.

• RL Path: Researchers did not provide problem-solving steps, only informed Qwen3-14B whether the final answer was correct or incorrect, using this as a reward signal, allowing the model to learn how to get the correct answer through exploration.

The experimental results perfectly confirmed previous findings:

Figure 2: Impact of SFT and RL on Model General Capabilities

This figure shows the impact of SFT and RL on the performance of the same base model (Qwen3-14B-Base) after training solely with mathematical data (improvement relative to the baseline). RL-trained models (left) not only made progress in mathematical and other reasoning tasks but also demonstrated broad generalization capabilities in non-reasoning tasks. In contrast, SFT-trained models (right), while generalizing somewhat in reasoning tasks, showed very limited transferability in non-reasoning tasks, and even experienced performance degradation.

This result strongly proves that even when trained only with mathematical data, RL can effectively improve the model's reasoning capabilities without harming or even enhancing its general capabilities. SFT, however, more easily leads the model to "memorize blindly," causing it to become "rigid" when facing tasks outside its training domain.

🧠 In-depth Exploration: Why Is RL's Generalization Stronger?

To uncover the deeper mechanistic differences behind these two training paradigms, the research team used two major "tools" to peer into the model's "inner world": latent space representation analysis and token space distribution shift analysis.

1. Latent Space: SFT's "Major Overhaul" vs. RL's "Precise Fine-Tuning"

Through Principal Component Analysis (PCA), researchers can observe how much the model's internal representation of information changes after training. They found:

• SFT causes drastic representation and output drift. This means SFT training is like a "major overhaul," significantly altering the model's original knowledge structure to adapt to mathematical tasks, leading to incompatibility when handling non-reasoning tasks.

• RL, on the other hand, better preserves the structure of general domains. RL training is more like "precise fine-tuning," strengthening reasoning-related pathways without destroying the model's original general knowledge framework.

2. Token Space: SFT's "Catch-All" vs. RL's "Focusing on the Important"

By analyzing the changes in the probability of the model selecting each token when generating text, researchers discovered a more interesting phenomenon.

Figure 3: Significantly Changed Tokens in Math Tasks for RL and SFT Models

This word cloud illustrates which tokens' output probabilities changed significantly when RL models (left) and SFT models (right) processed mathematical tasks. It can be observed that RL models primarily changed words related to logical structure (highlighted in red, such as But, So. Blue highlights indicate content-specific words), achieving efficient improvement in reasoning ability. SFT models, however, disturbed a large number of tokens, both relevant and many irrelevant to the task, showing their learning approach to be coarser and more superficial.

🔥 Conclusion and Implications

This research reveals a crucial yet often overlooked point behind improving LLM reasoning capabilities: the training method is more important than we imagined.

• Reinforcement Learning (RL) is key to achieving capability transfer: RL-tuned models can enhance specific reasoning abilities, such as mathematics, while maintaining or even strengthening their general capabilities in other domains, achieving a balance between "specialist" and "generalist."

• Supervised Fine-Tuning (SFT) needs to guard against the "specialization trap": especially when using "perfect" data distilled from strong models for SFT, while it can quickly boost leaderboard scores, it is highly likely to harm the model's generality, leading to "catastrophic forgetting."

• Deep understanding of mechanisms leads to lasting results: Through analysis of the model's internal representations and output distributions, we understand that RL's advantage lies in its "precise" and "minimally invasive" optimization approach, whereas SFT can be too "brute-force," destroying the model's valuable pre-trained knowledge.

This work undoubtedly points the way for how to build more powerful and general AI reasoning models in the future. Perhaps the community should rethink its reliance on SFT distilled data and explore and apply RL more to push LLMs from "problem-solving masters" towards becoming true "general problem solvers." 🚀

LLMs Dominate Math Boards, Yet Forget How to Chat? CMU et al. Reveal Striking Differences Between SFT and RL!

Share Short URL