Counter-Intuitive RL Research: Directly Providing Answers to LLMs is More Effective Than Detailed Step-by-Step Instructions!

A quick summary: When large language models learn mathematics, reinforcement learning doesn't teach them to solve problems by rote, but rather how to "gain insight" and consult references. Providing intermediate steps of standard answers can sometimes be less effective than letting them figure things out on their own. (Original paper title at the end of the article, Published on arXiv on 05 Jun 2025 by University of Wisconsin-Madison, Salesforce AI Research)

First Phase: Identifying Core Concepts

Analysis of the Paper's Motivation

Currently, we all know that using Reinforcement Learning (RL) to "train" Large Language Models (LLMs) for mathematical reasoning yields excellent results, with models achieving astonishingly high scores on various math competition leaderboards. However, there's a "knowing how, but not why" problem here.

Therefore, the core motivation of this paper is to open the "black box" of how reinforcement learning enhances LLM reasoning capabilities, meticulously and measurably analyzing which specific capability dimensions RL has improved, and how these improvements occurred.

Analysis of the Paper's Main Contributions

• Proposing the SPARKLE Analysis Framework: This is the paper's most central contribution. It no longer settles for a single accuracy metric but "dissects" the model's reasoning process from three key dimensions:

• Plan-following and execution: Is the model good at planning its own problem-solving steps, or is it better at executing plans given by others?

• Knowledge utilization: Is the model's self-stored knowledge incomplete, or does it have knowledge (e.g., formulas, theorems) but doesn't know how to apply it?

• Problem decomposition: Can the model break down a complex large problem into a series of smaller problems and solve them one by one?

• Constructing the SPARKLE Benchmark Dataset: To operationalize the SPARKLE framework, the authors "enhanced" existing math problem datasets by manually annotating each problem with auxiliary information across the three dimensions mentioned: high-level solution plans, required background knowledge, and sequences of decomposed subproblems. This created a unique "proving ground" for fine-grained analysis.

• Proposing a Multi-stage RL Training Strategy: Based on the analysis of problem difficulty, the authors designed a more efficient RL training process. It first conducts general training on a large number of problems of varying difficulties (Stage 1), then carries out targeted intensive training on selected "challenging problems" (Stage 2), providing some "hints" (partial solution steps) to the model during training on these difficult problems to help it learn effectively from them.

• Key Technologies/Methods Supporting Innovation:

• SPARKLE Framework: By designing three different testing modes (with/without plans, with/without knowledge, solving full problems/solving subproblems) to isolate and evaluate the model's capabilities across the three dimensions.

• Multi-stage RL training: Utilizes the GRPO (Group Relative Policy Optimization) algorithm, combined with the idea of Curriculum Learning, progressing from easy to difficult, and "augmenting" difficult problems (providing partial solution plans as context).

• Significant Results and Implications: The most important results of this paper are some counter-intuitive yet highly insightful findings, which are more valuable than mere State-of-the-Art (SOTA) scores:

• "Well-intentioned but Counterproductive" External Plans: For a base model, providing it with a detailed solution plan can actually degrade its performance. This indicates that the model has its own "fixed mindset" or reasoning pattern, and forcing it to follow external logic can be counterproductive. In contrast, RL-trained models can better adapt to external plans and even benefit from them, demonstrating higher "flexibility." • The Core of RL is "Learning How to Learn": RL-trained models show significant performance improvement when given external knowledge (e.g., formulas). This suggests that RL not only makes models "remember" more knowledge but also teaches them a capability of "how to integrate and apply new information."

• "Ambitious but Incompetent" Subproblem Solving Ability: Even powerful RL models, while able to solve complex overall problems, see a sharp drop in success rate when asked to systematically solve all decomposed subproblems step-by-step. This reveals a bottleneck in current model reasoning capabilities: they may rely on an "intuitive" holistic reasoning rather than rigorous, step-by-step logical deduction.

Identifying Challenges in Understanding

• Core Concept: The design philosophy of the SPARKLE analysis framework is the key to understanding the entire paper. Readers need to understand why these three axes (planning, knowledge, decomposition) are critical, and how the authors designed experiments to independently evaluate each axis.

• Most Challenging Part: The most challenging part is not a complex mathematical formula, but understanding the logic behind the experimental design. For example, how to interpret the phenomenon where "providing a plan to the model actually decreases performance" and connect it to the role of RL. Additionally, the GRPO algorithm, as the core of the training, requires some understanding of its objective function.

• Core Concepts to Emphasize: We will focus on explaining the three analysis dimensions of the SPARKLE framework and illustrate them with a vivid analogy. At the same time, we will delve into the GRPO algorithm, as it is the "engine" driving the model's evolution behind the scenes.

Concept Dependencies

1. Starting Point: The best starting point is to explain why the SPARKLE framework is needed (i.e., the limitations of traditional accuracy evaluation).

2. Dependencies:

• Understanding the SPARKLE framework is necessary to grasp the meaning of various figures in the paper (e.g., Figure 3, 4, 5).

• The SPARKLE framework reveals specific weaknesses of the model (e.g., knowledge integration, handling difficult problems).

• These findings, in turn, inspired the design of the multi-stage RL training strategy, which aims to address these weaknesses in a targeted manner.

• The GRPO algorithm is the specific technical means to implement this training strategy. Therefore, our explanation order will be: SPARKLE Framework -> GRPO Algorithm -> Multi-stage Training Process.

Second Phase: In-depth Explanation of Core Concepts

Designing a Real-life Analogy

Imagine we are training an intern chef (Base LLM), with the goal of turning him into a Michelin-starred chef (RL-tuned LLM) capable of independently creating top-tier French haute cuisine (such as "Beef Wellington"). And we are the experienced culinary coach (RL training process).

Traditional evaluation methods are like merely tasting the final Beef Wellington and giving it a "delicious" or "not delicious" score (corresponding to accuracy). But as coaches, we want to know what areas the intern chef lacks in, so we can provide tailored instruction.

At this point, we introduce the SPARKLE Culinary Analysis Method, to "dissect" his cooking skills from three dimensions:

1. Planning and Execution Ability (Plan-following):

• Test A: Give him a very detailed recipe (external plan) and ask him to follow it strictly.

• Test B: Only tell him to make Beef Wellington, and let him use his own understanding and memory to perform (internal plan).

• Comparative Analysis: If he fumbles when following the recipe and the final product is worse, it indicates he is not yet adapted to or doesn't understand the recipe's logic, preferring his own immature process. If he can execute the recipe well, it means he has strong execution but lacks planning ability.

2. Knowledge Application Ability (Knowledge utilization):

• Place a "Dictionary of Culinary Terms" (external knowledge) nearby, explaining terms like "Maillard reaction," "puff pastry leavening principle," etc.

• Test: Observe whether he actively consults, understands, and applies this knowledge to improve his operations during cooking. For example, does he understand that high-heat searing is needed to lock in meat juices (Maillard reaction)?

• Analysis: If he has the book but the steak he makes is still tough, it indicates poor knowledge integration ability. If he can use the book well, it means he "knows how to learn."

3. Problem Decomposition Ability (Problem decomposition):

• We break down the complex dish of Beef Wellington into several independent subtasks (Subproblems): 1) prepare mushroom duxelles, 2) sear beef tenderloin, 3) roll out puff pastry, 4) wrap and bake.

• Test: Have him complete these four subtasks separately, and we taste and score each semi-finished product.

• Analysis: Perhaps his mushroom duxelles is perfect, and the steak is seared just right, but the puff pastry breaks during wrapping, or the baking temperature is wrong. This indicates that while his individual steps might be fine, his ability to seamlessly connect them and achieve the final goal is lacking.

Establishing Correspondence Between Analogy and Actual Technology

Table: Metaphorical Elements and Corresponding Technical Concepts

Metaphorical Element: Intern Chef

Actual Technical Concept: Base Large Language Model (Base LLM)

Reasonableness Explanation: Initial state, limited ability, requires training and guidance.

Metaphorical Element: Michelin-Starred Chef

Actual Technical Concept: RL-tuned Model (RL-tuned LLM)

Reasonableness Explanation: Significant ability improvement through extensive practice and feedback.

Metaphorical Element: Culinary Coach

Actual Technical Concept: Reinforcement Learning (RL) Training Process

Reasonableness Explanation: Guides model optimization through rewards (tasty) and penalties (bad taste).

Metaphorical Element: Final Dish Taste

Actual Technical Concept: Accuracy of the Final Answer

Reasonableness Explanation: This is the most direct, but also the crudest, evaluation metric.

Metaphorical Element: Detailed Recipe

Actual Technical Concept: External Planning (Planning Skeleton)

Reasonableness Explanation: Provides macro-level steps for problem-solving.

Metaphorical Element: "Dictionary of Culinary Terms"

Actual Technical Concept: External Knowledge (Knowledge Components)

Reasonableness Explanation: Provides theorems, formulas, and other background knowledge needed for problem-solving.

Metaphorical Element: Step-by-step Production Tasks

Actual Technical Concept: Chain of Subproblems

Reasonableness Explanation: Decomposes a complex problem into multiple independently solvable smaller problems.

Metaphorical Element: Coach's Guidance Method

Actual Technical Concept: GRPO Algorithm

Reasonableness Explanation: This is the specific, quantitative methodology the coach uses to guide the chef's progress.

Delving into Technical Details: The GRPO Algorithm

Now, let's see how the "culinary coach" specifically guides the "intern chef." He uses the GRPO method. The coach has the chef make several attempts at a dish (generate multiple solutions for a math problem), then adjusts the teaching strategy based on the quality of these attempts.

Its core is to optimize the following objective function:

L(θ) = E[(min( (π_θ(a|s) / π_{θ_old}(a|s)) * Â_{i,t}, clip( (π_θ(a|s) / π_{θ_old}(a|s)), 1-ε, 1+ε) * Â_{i,t} ))] - β * D_KL[π_θ || π_ref]]

The specific form of L_KL(π_θ,π_ref) is: E_{τ∼π_θ}[logπ_θ(a|s) − logπ_ref(a|s)]

This formula looks intimidating, but substituting it with our analogy and natural language makes it clearer:

Symbol Replacement Version: Overall objective for model optimization = Considering all problems and all attempts comprehensively ( Average improvement across all attempts for a single problem )

Average improvement across all attempts for a single problem = Evaluate each step of each attempt ( Take the smaller of the following two values ( "New model's propensity" × "Advantage of this step", "Propensity constrained within a small range" × "Advantage of this step" )) - Penalty term to prevent model deviation

Step-by-step Explanation:

• π_θ(...) / π_{θ_old}(...) (New model's propensity):

• Mathematical Meaning: The ratio of the probability of the new model π_θ generating a certain step to the probability of the old model π_{θ_old} generating that step.

• Chef Analogy: The coach observes that the intern chef in one attempt "added salt before oil." If this approach yielded excellent results, the coach wants the new generation of you (new model) to be more inclined to "add salt before oil" (probability ratio > 1).

• Â_{i,t} (Advantage of this step):

• Mathematical Meaning: Advantage estimate, measuring how much better taking a certain action (generating a word) in the current state is compared to the average. If a solution ultimately scores very high, each step it contains receives positive "advantage" credit.

• Chef Analogy: For a successful cooking session, the "high-heat searing" step is considered crucial. So, this "high-heat searing" operation receives a high advantage value. The coach will specifically praise and reinforce this behavior.

• clip(...) (Propensity constrained within a small range):

• Mathematical Meaning: Clips the probability ratio within the small interval [1-ε, 1+ε].

• Chef Analogy: While the coach encourages innovation, he is also afraid of the intern chef overreaching. If the chef suddenly jumps from French cuisine to molecular gastronomy in one attempt, even if the effect is amazing, the coach would say: "Very good, but let's not be so aggressive for now, take it slowly." This prevents the model from updating too quickly and leading to performance collapse. The purpose of min(...) is to adopt a conservative strategy; when you want to move forward significantly, the clip term pulls you back, allowing you to proceed more steadily.

• β * D_KL[...] (Penalty term to prevent model deviation):

• Mathematical Meaning: KL divergence, measuring the difference between the overall policy of the new model π_θ and a reliable reference model π_ref (usually the SFT model before training). The larger the difference, the greater the penalty.

• Chef Analogy: The coach allows the chef to develop their own style but not completely abandon the fundamental rules of French cuisine. This penalty term is like saying: "You can improvise, but your dish must still be recognizable as Beef Wellington; it cannot become something completely unrelated."

Mapping Technical Details to the Analogy

• Technical Steps in the Analogy: The entire GRPO process is like the coach (RL algorithm) having the chef (LLM) make multiple attempts for a recipe (problem). Then, the coach tastes each finished product (calculates the Reward) and identifies which steps were "strokes of genius." - How the Analogy Helps Understand Technical Details: The analogy transforms abstract mathematical symbols, such as probability ratios, advantage functions, and KL divergence, into concrete, motivated behaviors, such as "encouraging good operations," "preventing deviation," and "maintaining fundamentals." This makes the design philosophy behind the algorithm intuitively understandable.

Summary

• Core Connection: The SPARKLE framework is like a precise diagnostic tool used to identify the intern chef's "skill gaps"; while the GRPO algorithm is the effective teaching method in the coach's hands, used to address these gaps and ultimately train the intern chef into a Michelin-starred chef.

• Summary of Key Mathematical Principles: The essence of GRPO lies in finding an optimal balance between encouraging exploration (based on Advantage) and maintaining stability (based on clip and KL divergence). It generates learning signals by comparing the quality of a set of attempts, which is more stable and efficient than single-sample learning.

Third Phase: Detailed Explanation of Process Steps

Process One: Model Capability Dissection Using the SPARKLE Framework

The goal of this process is to evaluate an existing LLM. First, input a problem from the SPARKLE benchmark dataset (including the problem itself, standard answer, plan, knowledge, subproblems) to the model.

Processing Flow:

• Benchmark Testing (without auxiliary information): Input only the problem description, allow the LLM to generate solution ideas and the final answer, obtaining the model's original problem-solving performance as a baseline.

• Axis 1 Evaluation: Plan-following and Execution Ability: Input problem description and the planning skeleton, allowing the LLM to solve the problem under guidance. Compare against the baseline; if performance improves, execution ability is strong; if it declines, external plans cause interference.

• Axis 2 Evaluation: Knowledge Utilization Ability: Input problem description and relevant knowledge points. Compare against the baseline; if performance significantly improves, the bottleneck is in knowledge; otherwise, it is in application ability.

• Axis 3 Evaluation: Problem Decomposition Ability: This is a sequential process, inputting subproblems one by one along with their preceding answers, allowing the model to solve them step by step. Finally, calculate the Subproblem Success Rate (SSR). Compare against the baseline; if SSR is much lower than the overall solution rate, it indicates the model is not adept at step-by-step logical reasoning.

Final Output: A detailed capability profile of the LLM across the planning, knowledge, and decomposition dimensions.

Process Two: Multi-stage RL Training Process

The goal of this process is to train a more powerful reasoning model.

• Input: A base LLM (e.g., Qwen-2.5-Math-7B), a large training set containing 40K math problems, an augmented training set containing 5.7K challenging problems.

• Processing Flow: Divided into two stages.

• Stage One: General Capability RL Fine-tuning: Training is conducted on 40K general problems. For each problem, the model generates 8 solutions, which are scored by a reward function, and the model is updated using the GRPO algorithm. This stage aims to build strong fundamental reasoning capabilities, outputting the model SparkleRL-Stage 1.

• Stage Two: Challenging Problem RL Fine-tuning: The SparkleRL-Stage 1 model continues training on 5.7K challenging problems. These challenging problems are augmented, meaning that upon input, they are randomly accompanied by 0 to 4 solution "hint blocks." The training process is similar to Stage One, but a larger KL divergence penalty is used to prevent the model from "forgetting" general capabilities. This stage aims to specifically strengthen the ability to solve high-difficulty problems, outputting the final model SparkleRL-Stage 2-aug.

Fourth Phase: Experimental Design and Validation Analysis

Interpretation of Main Experiment Design: Validation of Core Arguments

Core Claim Validation: The paper's core claims are: 1) Their proposed multi-stage RL training is effective and significantly enhances the model's reasoning capabilities; 2) Specialized enhanced training for challenging problems (Stage 2-aug) yields additional performance improvements.

Experimental Design Analysis:

• Datasets: The authors selected AIME24, AMC23, MATH500, GSM8K, OlympiadBench. This selection is highly reasonable because these datasets cover a complete gradient of difficulty from elementary school to international Olympiads, which is crucial for validating hypotheses about "challenging problems."

• Evaluation Metric: Avg@8. This means that if the model generates 8 answers and at least one is correct, it is considered a pass. This metric measures the model's core reasoning ability more effectively than a single attempt (pass@1) and is a recognized standard in the field.

• Baseline Methods: The experiment set Qwen-2.5-Math-7B-Base (untrained) as an external baseline, and SparkleRL-Stage 1 (general training only) as an internal baseline. This design allows for a very clear isolation of the specific performance gains brought by each training stage.

Results and Conclusion:

• The results in Table 1 clearly support the core claim. SparkleRL-Stage 1 shows significant improvement over the Base model across all datasets (average increased from 35.23% to 65.01%), demonstrating the effectiveness of general RL training.

• More critically, the SparkleRL-Stage 2-aug model achieved the best average performance (67.03%) among all models, especially an astonishing score of 50.42% on the most difficult AIME24. This directly proves the paper's second core claim: specialized training using challenging problems with partial solution hints can further unleash the model's performance potential.

Ablation Experiment Analysis: Contribution of Internal Components

The "ablation experiment" here is very clever; it is realized through the three analysis axes of the SPARKLE framework, which can be called "analytical ablation," i.e., by controlling input information to "ablate" the model's need in a certain capability dimension.

• Ablation Component 1: Autonomous Planning Ability (Figure 3)

• How to Ablate: By providing the model with a complete planning skeleton, thereby "removing" the need for the model to perform macro-level planning itself.

• Results and Proof: The experiment found that for the Base model, providing plans generally led to a decrease in performance. This demonstrates that autonomous planning is part of its inherent reasoning path, and external interference is detrimental. In contrast, the RL model's performance remained stable, indicating that RL-trained models have more flexible and powerful planning abilities, capable of accommodating and even utilizing external plans. This quantitatively proves RL's significant contribution to "planning flexibility."

• Ablation Component 2: Knowledge Retrieval Ability (Figure 4)

• How to Ablate: By providing the model with all necessary knowledge points for problem-solving, thereby "removing" the need for the model to recall or retrieve knowledge itself.

• Results and Proof: The Base model's performance still decreased (average -5.4%) after receiving knowledge, whereas the RL model's performance significantly improved (average +4.2%). This stark contrast powerfully demonstrates that one of RL training's key contributions is empowering the model with the ability to integrate and apply external knowledge, not merely to remember knowledge. This module (knowledge integration ability) is a unique and indispensable advantage of RL models.

• Ablation Component 3: Holistic Reasoning Ability vs. Step-by-step Reasoning (Figure 5)

• How to Ablate: By decomposing the problem into a chain of subproblems, forcing the model to solve them step by step, thereby "removing" its possibility of performing leapfrogging, holistic reasoning.

• Results and Proof: The success rate (SSR) of all models (including the strongest RL model) in solving all subproblems was significantly lower than their success rate in solving the original problem. This demonstrates that the model's success is not built on a perfect, decomposable logical chain. This reveals an important limitation of the model's capabilities, proving that the model's "high-level integrated reasoning" is an indispensable, yet currently mysterious, component.

In-depth/Innovative Experiment Analysis: Insights into the Intrinsic Characteristics of the Method

The Most Ingenious Experiment: Performance Gain Analysis by Difficulty Layer (Figure 6)

• Experiment Purpose: This experiment aims to answer a deeper question: For which problem difficulties are the two types of help—providing "planning" and "knowledge"—most effective? This can reveal the model's core bottlenecks at different challenge levels.

• Experiment Design: The authors divided the test set into 10 difficulty levels. Then, for each level, they calculated the change in performance (gain or loss in pass@1) when providing "planning" and "knowledge," respectively, compared to no assistance. This is like using two drugs on patients of different levels to observe their efficacy.

• Experimental Conclusions and Value:

• Impact of Planning (Figure 6a): The help (or harm) from providing planning is largely unrelated to problem difficulty; the curve is relatively flat.

• Impact of Knowledge (Figure 6b): The help from providing knowledge increases sharply with problem difficulty. For problems at difficulty level 10, providing knowledge to the RL model can yield up to a 100% performance gain!

• Profound Insight: This result reveals a crucial intrinsic characteristic: For simple problems, the model might know everything; but for truly difficult problems, the model's bottleneck is not "not knowing how to do it (planning)," but rather "lacking necessary knowledge." This finding has strong guiding significance for future research directions. For example, for difficult problems, instead of optimizing the model's planning ability, it might be more effective to equip it with a powerful knowledge retrieval system (like RAG). This experiment is a brilliant stroke, elevating the paper's analysis from "what it is" to "why" and "what to do."

Paper title: Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning

Counter-Intuitive RL Research: Directly Providing Answers to LLMs is More Effective Than Detailed Step-by-Step Instructions!

Share Short URL