Tsinghua Research: A Reversal? Confirming RL Doesn't Truly Enhance Base Model Reasoning Ability!

You might think you hired a postdoctoral researcher for scientific innovation, but it turns out you just hired a test-taking expert who endlessly drilled problems, making them even more proficient at what they already knew. (Published on May 16, 2025, by LEAP lab Tsinghua University, Shanghai Jiao Tong University)

Phase One: Core Concept Identification

Analysis of the Paper's Motivation

Currently, the AI community widely believes that using Reinforcement Learning with Verifiable Rewards (RLVR) is a trump card technology for improving Large Language Models (LLMs) performance in complex reasoning tasks like mathematics and programming. It is generally believed that just as traditional reinforcement learning (e.g., AlphaGo) can enable agents to discover new, human-transcending strategies through exploration, RLVR can also make LLMs “self-evolve” and learn entirely new reasoning abilities that were not present in their original (Base) models.

However, the authors of this paper raise a sharp question: Is reinforcement learning truly incentivizing LLMs to develop new reasoning capabilities beyond their base models? Or is it merely making the models more efficiently “utilize” abilities already “hidden” within the base models? In other words, is RLVR “creating new knowledge” or “optimizing the retrieval efficiency of existing knowledge”? This question is crucial because it directly impacts our judgment of LLM's capacity limits and future development paths. The authors' motivation is to use rigorous experiments to peel back the surface of RLVR's success and explore its true underlying mechanisms.

Analysis of the Paper's Main Contributions

Listing the Paper's Claimed Main Innovations

Revealing the Fundamental Limitations of Current RLVR: Through extensive experiments, the paper demonstrates that current mainstream RLVR methods do not bring fundamentally new reasoning patterns to LLMs. The reasoning capacity boundary (upper limit) of RL-trained models is actually limited by their corresponding Base Models, and this boundary may even shrink after training.

Proposing and Systematically Applying a More Accurate Evaluation Paradigm: The paper points out that traditional evaluation metrics (such as pass@1, the success rate in a single attempt) can only reflect the model's “average performance” and cannot measure its “capacity upper limit.” The authors systematically use pass@k (the probability of at least one success in k attempts) as the core evaluation metric, especially when k is large, as it can more accurately probe the model's reasoning capacity “ceiling.”

Proving that the Core Function of RLVR is to Improve Sampling Efficiency: The paper finds that the reason RL-trained models perform better in conventional tests is not because they have learned to solve new problems, but because they can generate the correct answers that the Base Model could already generate with a higher probability and more quickly (within fewer attempts). This is essentially a “purification” or “focusing” of the distribution, rather than an “expansion” of capabilities.

Distinguishing the Essential Differences between RLVR and Distillation: Through comparative experiments, the paper finds that knowledge distillation (learning from a stronger teacher model) can genuinely introduce new reasoning patterns to the student model, thereby expanding its capacity boundary. This, in turn, highlights the limitations of current RLVR methods.

Identifying Key Technologies or Methods Supporting These Innovations

Core Evaluation Method pass@k (for large k values): This is the “microscope” of the entire paper. By comparing the pass@k curves of the Base Model and the RL Model at different k values, the authors were able to observe the critical phenomenon that “the RL Model leads at small k values, but is surpassed by the Base Model at large k values.”

Perplexity Analysis: To explain “why reasoning paths already exist,” the authors calculated the perplexity of the Base Model for the correct answers generated by the RL Model. The results showed very low perplexity, indicating that these “new” answers were not at all “unexpected” for the Base Model and were completely within its generation distribution.

Solvable Problem Set Coverage Analysis: This is a direct set-theoretic argument. The authors directly compared the set of problems that the Base Model could solve with the set of problems that the RL Model could solve, finding that the latter was almost a subset of the former, intuitively proving that the RL Model did not solve new problems.

Comprehensive Experimental Design: Experiments across multiple model families (Qwen, LLaMA), various tasks (mathematics, code, visual reasoning), and multiple RL algorithms (PPO, GRPO, etc.) greatly enhanced the generality and persuasiveness of the conclusion.

Significant Results of the Paper The most significant result is undoubtedly the crossing phenomenon of the pass@k curves (see Figure 2 in the paper). The RL-trained model's curve is higher on the left side of the graph (smaller k), indicating better performance; however, as k increases, the Base Model's curve catches up with a steeper slope and eventually overtakes the RL Model on the right side of the graph (larger k). This “crossing” clearly and visually demonstrates that the RL Model “starts fast but lacks stamina,” while the Base Model has “great potential but needs more attempts.” This result is highly disruptive because it challenges the intuition that “RL training is always beneficial.”

Identifying Understanding Difficulties

Analysis of Key Concepts/Methods for Understanding the Paper

The Deep Meaning of the pass@k Metric: Understanding why pass@1 represents “average performance” and large k values of pass@k represent “capacity boundary” is crucial.

Similarities and Differences between RLVR and Traditional RL: It is necessary to understand why RL, which can continuously explore new strategies in Atari games, seems to “fail” with LLMs. This involves understanding LLM pre-training priors and the enormous action space.

The Phenomenon of “Capacity Boundary Shrinkage”: Intuitively, training should make the model stronger, so why would the capacity boundary narrow instead? This needs to be understood in conjunction with the RL algorithm's objective function (maximizing the likelihood of reward samples).

Identifying the Most Challenging Part of These Concepts The most challenging part is deriving the conclusion “RLVR limits the model's capacity upper limit” from the crossing of the pass@k curves. This requires readers to shift their thinking, not just viewing pass@k as a performance score, but understanding it as a “probe” into the model's “potential knowledge base.” When k is sufficiently large, the value of pass@k approximates the proportion of problems the model can solve, i.e., its “capacity coverage.”

Determining Core Concepts that Need Emphasis The core concept that most needs in-depth explanation is: How pass@k functions as a capacity boundary detector, and how it reveals the conflicting effects of RLVR: “sampling efficiency improvement vs. capacity boundary shrinkage.”

Concept Dependencies

Starting Point: The pass@k Metric. This is the foundation of all analysis, serving as the measuring stick.

Core Argument: Based on pass@k, compare the curves of the Base Model and the RL Model, leading to the “crossing” phenomenon.

Phenomenon Interpretation: Explain why crossing occurs. When k is small, the RL Model wins, indicating sampling efficiency improvement; when k is large, the Base Model wins, indicating a broader capacity boundary (and even boundary shrinkage for the RL Model).

Cause Exploration: Why does this happen? Introduce perplexity analysis and solvable problem set analysis, proving that the solutions generated by the RL Model all originate from the Base Model.

Deepening Understanding: Consolidate and generalize this conclusion by comparing it with distillation and analyzing different RL algorithms.

Phase Two: In-depth Explanation of Core Concepts

Designing a Real-life Analogy: A Library Book-Finding Adventure

Imagine a very knowledgeable scholar (our Base Model), whose lifelong learning is stored in an enormous private library. This library is incredibly rich in books, ranging from mainstream textbooks to obscure ancient manuscripts, covering solutions to almost all problems.

However, this scholar has a small habit: he's a bit disorganized, and the books in his library are somewhat messy. When you ask him a complex question (like an Olympiad math problem), he'll go into the library and randomly pull a book from a shelf for you.

The probability that he picks the correct solution on his first try might not be high (this corresponds to the model's pass@1).

However, if you give him enough attempts (e.g., let him try 1000 times, i.e., pass@1024), as long as the solution truly exists in the library, he will eventually find the correct book. The volume of books in this library represents the scholar's knowledge boundary or capacity upper limit.

Now, an expert in test-oriented education arrives to give the scholar “reinforcement learning special training” (i.e., RLVR training), with the goal of enabling him to answer questions more quickly. The expert's approach is:

Give the scholar a large number of standard exam questions to practice.

The scholar answers by finding a book from the library each time.

If the answer is correct (earning a reward), the expert puts a big “Key Point” label on the book and places it on the most prominent shelf.

If the answer is incorrect, the expert makes the scholar stuff the book into a corner.

After a period of special training, the scholar (now the RL-trained Model) has changed. When you ask him questions again, he will prioritize finding books from the shelves labeled “Key Point.”

Establishing Correspondence Between Analogy and Actual Technology

Element in Analogy: Knowledgeable Scholar

Actual Technical Concept: Base Model

Explanation: A pre-trained LLM with vast but unorganized knowledge.

Element in Analogy: Enormous Private Library

Actual Technical Concept: Base Model's Knowledge/Capacity Space

Explanation: The set of all possible reasoning paths contained within the model's parameters.

Element in Analogy: Library's Book Volume

Actual Technical Concept: Base Model's Capacity Boundary/Upper Limit

Explanation: The set of all problems the model can theoretically solve.

Element in Analogy: Asking the Scholar a Question

Actual Technical Concept: Inputting a Reasoning Task (prompt)

Explanation: Giving the model a math problem, programming problem, etc.

Element in Analogy: Scholar's First Book Pick

Actual Technical Concept: Model performs one sample generation (pass@1)

Explanation: The model generates an answer to see if it's correct.

Element in Analogy: Giving the Scholar k Attempts to Pick a Book

Actual Technical Concept: Performing k samples (pass@k)

Explanation: Measures whether the model can solve the problem within k attempts.

Element in Analogy: Finding the Correct Book

Actual Technical Concept: Generating the Correct Reasoning Path and Answer

Explanation: The model's Chain-of-Thought (CoT) output is valid.

Element in Analogy: Test-Oriented Special Training (RLVR)

Actual Technical Concept: Reinforcement Learning from Verifiable Rewards

Explanation: Fine-tuning the model with labeled data (correct/incorrect).

Element in Analogy: “Key Point” Labels and Prominent Shelves

Actual Technical Concept: RL Algorithm's Adjustment of Probability Distribution

Explanation: RL increases the generation probability of correct reasoning paths and decreases the probability of incorrect paths.

Element in Analogy: Specially Trained Scholar

Actual Technical Concept: RL-trained Model

Explanation: An LLM fine-tuned using RLVR.

Delving into Technical Details

Now let's connect the analogy with the key technical aspects of the paper.

Core Evaluation Metric: pass@k

pass@k means “the probability of at least one success in k independent attempts.”

When k=1, it is the single-attempt success rate. In our analogy, it's the probability that the specially trained scholar picks the correct answer from the “Key Point” shelf on his first try. Due to the special training, this probability is high.

When k is very large (e.g., 1024), it measures “whether the problem can be solved if given enough opportunities.” This is like allowing the scholar to completely search his entire library for the answer. At this point, the competition is no longer about “how fast one finds it,” but “whether the book actually exists in the library.”

This is the essence of the pass@k curve crossing phenomenon in the paper:

Left Side (smaller k): The specially trained scholar (RL Model), having “Key Point” shelves, finds answers quickly and accurately, thus having a higher pass@k value.

Right Side (larger k): When unlimited attempts are allowed, the untrained scholar (Base Model), although slow in searching, might have some obscure but equally correct solutions hidden in his unorganized, more extensive library that the specially trained scholar overlooked during his “test-oriented education.” Therefore, his capacity upper limit (total book volume) is actually greater, and his pass@k curve will eventually overtake.

Key Analysis: Sampling Efficiency Gap

The paper defines an interesting metric to quantify this difference:

Original Mathematical Form:

(In the paper, the authors usually take k as 256)

Sampling Efficiency Gap = RL Model's Single-Attempt Success Rate - Base Model's Highest Success Rate within k Attempts

Mapping to Analogy: This formula measures: “How much does the specially trained scholar's ability to answer correctly on the first try fall short of the total number of problems the untrained scholar can solve after thoroughly searching his entire library?”

The authors found that this value is consistently large and negative (because pass@1(RL) is much smaller than pass@k(Base)), indicating that RL training has far from fully exploited the existing potential of the Base Model. It merely transforms the model from a “knowledgeable but slow-reacting scholar” into an “efficient but narrower-minded test-taking expert.”

RL Model Single-Attempt Success Rate: The specially trained scholar's “high test score ability.”

Base Model's Highest Success Rate within k Attempts: The untrained scholar's “breadth of knowledge” (library's book volume).

Mapping Technical Details to Analogy

Technology: RL algorithms maximize the log-likelihood of reward samples.

Analogy Reflection: This is precisely the process of “labeling correct books as key points and placing them in prominent positions.” The algorithm's goal is to make the model more inclined to generate paths verified as “good.”

Technology: The pass@k curve shows the Base Model surpassing the RL Model as k increases.

Analogy Reflection: The specially trained scholar (RL Model) relies too much on “Key Point” shelves and might fail to find solutions to problems requiring searches in non-key areas for obscure solutions. The Base Model, though slow, has a complete library and will eventually find it if given enough time.

Technology: The perplexity of solutions generated by the RL Model is low under the Base Model.

Analogy Reflection: The “Key Point” books found by the specially trained scholar actually already existed in the untrained scholar's library. For the untrained scholar, seeing these books is not at all “perplexing” or “surprising,” because “I already have them here.” This proves that RL does not create new knowledge.

Limitations of the Analogy This analogy is very apt, but there is a subtle difference. In the analogy, the special training process seems to only “move and mark” books, not “discard” them. In actual RL training, due to adjustments in the model's probability distribution, the generation probability of certain reasoning paths may be suppressed to near zero, making them ungeneratable within a limited number of sampling attempts (even if k is large). This can be understood as the training expert not only marking key points but also packaging some “non-exam-relevant” books and throwing them into the basement, causing the specially trained scholar's knowledge base to actually shrink. This is what the paper refers to as “capacity boundary shrinkage.”

Summary

Through the “library book-finding” analogy, we can clearly understand the paper's core viewpoint:

The Base Model is like a knowledgeable but disorganized library, with huge potential but low efficiency.

Reinforcement Learning (RLVR) is like test-oriented special training, which greatly improves the efficiency of finding answers (pass@1 improvement) by labeling “exam-relevant” books as “key points” and prioritizing them.

However, the cost of this special training is that the model may over-rely on these “key points” and overlook other valuable books in different corners of the library, leading to a limitation or even shrinkage of its knowledge breadth (pass@k at large k).

Ultimately, RLVR does not teach the scholar new knowledge that was not originally in the library; it merely transforms him into a more efficient “librarian” rather than a more learned thinker.

Phase Three: Detailed Process Steps

Introduction

In this phase, we will meticulously break down how the paper's authors designed their experimental process to incrementally verify their core hypothesis. This process itself is a significant contribution of the paper, as it provides a rigorous analytical paradigm for subsequent research. We can view the entire process as an “LLM Reasoning Capacity Boundary Detector.”

Input

An LLM family to be evaluated (e.g., Qwen-2.5 series).

Two key models from this family:

Base Model: e.g., Qwen-2.5-7B-Base.

RLVR-trained Model: e.g., Qwen-2.5-7B trained on the GSM8K dataset using the GRPO algorithm.

An evaluation dataset with verifiable answers (e.g., the AIME24 math competition problem set).

A fixed Prompt template, ensuring that the questioning method is identical for both models.

Processing Flow

Step One: Large-Scale Sample Generation (Data Generation)

Sampling from the Base Model:

Iterate through each problem in the evaluation dataset.

For each problem, input it to the Base Model using a unified Prompt.

Set a higher temperature (e.g., 0.6) and top-p (e.g., 0.95) to encourage diverse answer generation, then have the model independently generate n candidate answers (n is a large number, such as 1024 or 2048).

Store these n generated answers (including the complete reasoning process and final result), associated with the corresponding problem ID.

Sampling from the RL Model:

Repeat the above process, but this time using the RL-trained Model.

For the same problem in the evaluation dataset, use the exact same Prompt and sampling parameters, and similarly generate n candidate answers, then store them.

Step Two: Automated Verification and Result Statistics (Verification & Statistics)

Building a Verifier:

Based on the task type, design a program that can automatically determine if an answer is correct.

For math problems, the verifier extracts the final answer from the model's generated text (e.g., “73” from oxed{73}) and compares it with the standard answer.

For programming problems, the verifier executes the generated code and uses preset Unit Tests to check its correctness.

Batch Verification:

For each group of answers generated in the previous step (e.g., 1024 answers for a certain problem from the Base Model), have the verifier check them one by one.

Count the number of correct answers, denoted as c.

Step Three: Calculating and Plotting the pass@k Curve (Calculation & Plotting)

Calculating pass@k:

For every integer k from 1 to n, use an unbiased estimation formula to calculate the value of pass@k. The formula used in the paper is: 1 - ((N-C)/N)^k. Where N is the total number of samples, and C is the number of correct samples. The intuitive meaning of this formula is “1 minus the probability of all k attempts failing.”

Averaging and Plotting:

Calculate the average value of pass@k for all problems in the entire evaluation dataset.

Thus, we obtain two curves: one is the Base Model's average pass@k curve, and the other is the RL Model's average pass@k curve.

Plot these two curves on the same graph, with the x-axis representing k (usually on a logarithmic scale) and the y-axis representing pass@k.

Step Four: In-depth Analysis and Cause Exploration (In-depth Analysis)

Solvable Problem Coverage Analysis:

For the Base Model, find all problems where pass@n > 0, forming set A.

For the RL Model, find all problems where pass@n > 0, forming set B.

Compare these two sets to see if B is a subset of A (B ⊆ A). The paper found that they are astonishingly close to a subset relationship (see Tables 4 and 5), meaning that problems solvable by the RL Model can almost all be solved by the Base Model.

Perplexity Analysis:

Randomly sample some correct answers generated by the RL Model (denoted as R_correct).

Then, input these R_correct samples into the Base Model and calculate the Base Model's perplexity for generating these answers (P_Base(R_correct)).

The authors found that this perplexity value was very low (see Figure 6), indicating that the Base Model considered these correct answers from the RL Model to be “expected” and completely consistent with its own generation habits. This proves that RL does not create new knowledge.

Case Study:

Manually select some successful reasoning samples from difficult problems that only the Base Model could solve (or required many attempts to solve) (see Figures 19 and 20).

This intuitively shows readers that the Base Model indeed possesses intrinsic potential for solving complex problems, rather than just guessing by chance.

Final Output

A complete analysis report, including pass@k comparison graphs, coverage tables, perplexity charts, and specific case studies, collectively forming a strong chain of evidence, ultimately concluding: The primary role of current RLVR methods is to improve sampling efficiency, rather than to expand the boundaries of reasoning ability, and their capacity upper limit is restricted by the Base Model. Through this detailed process, a researcher who has not read the paper can also understand how the authors systematically and progressively validated their core argument, and would be capable of reproducing this research paradigm.

Phase Four: Experimental Design and Verification Analysis

Interpretation of Main Experimental Design: Verification of Core Argument

Core Claim

The paper's core claim is: Reinforcement Learning with Verifiable Rewards (RLVR) does not create new reasoning abilities, but rather optimizes the sampling efficiency of existing reasoning paths within the Base Model, and this process may lead to a reduction in the model's reasoning capacity “ceiling.”

Main Experimental Design

The main experiment's design is very direct and clever, with its core being the comparison of the Base Model and the RL-trained Model's pass@k performance at different sampling counts (k).

How to Verify the Claim: If RLVR were to create new capabilities, then the RL model's pass@k curve should be higher than or equal to the Base Model's for all k values, especially when k is large, indicating a broader capacity boundary. Conversely, if the paper's claim holds true, we would observe:

When k is small (e.g., k=1), the RL model's curve is higher (high sampling efficiency).

As k increases, the Base Model's curve catches up and even overtakes the RL model (broader capacity boundary). The appearance of this “crossing point” is the most crucial evidence supporting its core argument.

Analysis of Experiment Selection Rationality

Datasets: The authors selected standard benchmarks from multiple domains, such as:

Mathematical Reasoning: GSM8K, MATH500, AIME24, Olympiad, etc. These datasets incrementally increase in difficulty, covering a range from elementary school application problems to competition-level challenges.

Code Generation: LiveCodeBench, HumanEval+.

Visual Reasoning: MathVista, MathVision.

Rationality: These choices are very reasonable. First, they are all verifiable, meaning they have clear correct answers or test cases, which is crucial for calculating rewards and pass@k. Second, they exhibit strong diversity, spanning multiple domains requiring complex reasoning such as mathematics, code, and multimodal tasks, proving the generality of the conclusion rather than a coincidence on a specific task.

Evaluation Metrics:

The core metric is pass@k.

Rationality: This is the soul of the entire paper. If only pass@1 (average accuracy) were used, the authors would only arrive at the superficial conclusion that “RLVR effectively improved model performance,” failing to uncover the deeper issue. pass@k (especially for large k values) can probe the model's potential and capacity boundary, making it an ideal tool for measuring “whether the model can solve it” rather than “whether the model got it right on the first try.” The choice of this metric perfectly aligns with the core question the paper aims to investigate.

Baseline Methods:

The most crucial baseline is the model's own Base Model version.

Rationality: This constitutes the fairest and most direct “self-comparison.” Any improvement should be achieved on this basis. Furthermore, when analyzing different RL algorithms, the authors also compared various mainstream RL methods like PPO, GRPO, etc., demonstrating that this is not an issue with a specific algorithm but with the RLVR paradigm itself.

Main Experiment Conclusion

Figures 2, 4, and other main experimental results in the paper clearly demonstrate the crossing phenomenon of the pass@k curves. Across all tested models, datasets, and tasks, the RL Model performs excellently at small k values, but as k increases, its performance curve flattens out and is eventually surpassed by the continuously rising Base Model curve. This convincingly proves that: RLVR improves the model's “average performance” but sacrifices its “potential upper limit.”

Ablation Study Analysis: Contribution of Internal Components

Ablation Point 1: Different RL Algorithms (Figure 8)

Purpose: To prove that the “capacity boundary shrinkage” phenomenon is not specific to a particular RL algorithm (e.g., GRPO) but is a common problem with the RLVR paradigm.

Design: The authors used the same Base Model and trained it with various mainstream RL algorithms such as PPO, GRPO, Reinforce++, and RLOO, then compared their pass@k curves.

Conclusion: All models trained with RL algorithms exhibited a similar pass@k crossing phenomenon, and their capacity upper limit gap (measured by `Sampling Efficiency Gap` or related metrics) with the Base Model was significant. This proves that the problem lies in the paradigm itself, not in specific implementations.

Ablation Point 2: Progress of RL Training (Figure 1, right)

Purpose: To investigate how the capacity boundary changes as RL training progresses.

Design: The authors evaluated models at different checkpoints during the RL training process (e.g., steps 150, 300, 450) using pass@k.

Conclusion: The experimental results surprisingly found that as the training steps increased, the model's pass@1 (average performance) indeed steadily improved, but its pass@256 (capacity boundary) continuously decreased. This quantitatively proves that “the RL training process is accompanied by capacity boundary shrinkage,” providing dynamic evolutionary evidence for the core argument.

Ablation Point 3: Key Hyperparameters (KL Divergence Constraint, Number of Rollouts) (Figure 15)

Purpose: To examine whether some common RL training techniques can alleviate this problem.

Design:

Add a KL divergence penalty term to prevent the RL model from deviating too far from the Base Model.

Increase the number of Rollouts per prompt (from 8 to 32) for broader exploration.

Conclusion: After adding the KL constraint, pass@1 was similar, but the capacity boundary (pass@128) decreased even more significantly, indicating that strictly limiting model exploration does not solve the problem. Although increasing the number of Rollouts slightly improved pass@k at high k values, it was still far below the Base Model. This proves that simple hyperparameter tuning and increased exploration cannot fundamentally reverse the situation.

In-depth/Innovative Experiment Analysis: Insights into Method's Intrinsic Properties

Exploratory Experiment 1: Perplexity Analysis (Figure 6)

Experiment Purpose: To verify whether “the correct reasoning paths generated by the RL model already existed in the Base model's knowledge base.”

Clever Design: The design of this experiment is very novel. Instead of directly searching, it asks in reverse: How “surprised” would the Base Model be when it sees a correct answer generated by the RL Model? The “degree of surprise” is measured by Perplexity. If the Base Model finds the answer very “logical” (low perplexity), it indicates that the answer was originally within its generation distribution.

Experiment Conclusion: The results showed that the perplexity value was very low, almost the same as the perplexity of high-frequency answers generated by the Base Model itself. This conclusion convincingly proves that: The RL model did not “invent” new problem-solving approaches; it merely learned to more frequently “reiterate” those approaches that the Base Model already knew and was somewhat inclined to generate.

Exploratory Experiment 2: Solvable Problem Coverage Analysis (Tables 4 & 5)

Experiment Purpose: To directly verify, from a set-theoretic perspective, whether the RL model's problem-solving capacity range is “contained” within that of the Base Model.

Clever Design: The design is very direct. Through large-scale sampling, the sets of problems that each model can solve (solvable if pass@k>0) are determined separately, and then the relationship between these two sets is compared.

Experiment Conclusion: The set of problems solvable by the RL model is almost a proper subset of the set of problems solvable by the Base Model. This provides the most intuitive evidence that the RL model has not learned to solve any new problems that the Base Model was completely unable to solve.

Exploratory Experiment 3: Comparison with Knowledge Distillation (Figure 7)

Experiment Purpose: To answer a potential question: “Does all post-training on the Base Model lead to capacity boundary shrinkage?”

Clever Design: The authors introduced an important control group—knowledge distillation. They distilled the output (long Chain-of-Thought reasoning) from a more powerful model (Teacher) to a weaker model (Student) and then evaluated the pass@k curve of this “distilled” student.

Experiment Conclusion: The distilled model's pass@k curve was consistently higher than its Base Model's, meaning its capacity boundary was truly expanded. This comparative experiment is crucial as it successfully attributes the problem to the specific paradigm of RLVR, rather than to generalized “post-training” processes, significantly enhancing the rigor of the paper's argument.

Exploratory Experiment 4: Case Study (Figures 19 & 20)

Experiment Purpose: To provide concrete, perceptible evidence that the Base Model indeed possesses the ability to solve complex problems, rather than just stumbling upon solutions through “random guessing.”

Clever Design: The authors showcased complete, logically clear, and correct reasoning processes generated by the Base Model after multiple samplings, taken from the most difficult AIME24 competition problems.

Experiment Conclusion: These cases are convincing, showing that powerful, coherent reasoning abilities are indeed embedded within the Base Model. This makes the rise of the pass@k curve no longer an abstract statistical number, but one supported by real, complex reasoning capabilities.

Paper Title: Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Tsinghua Research: A Reversal? Confirming RL Doesn't Truly Enhance Base Model Reasoning Ability!

Share Short URL