Microsoft Introduces rStar2-Agent: "Thinking Smarter" Proves Far More Effective and Efficient Than Simply "Thinking Longer"

Currently, large language models (LLMs) are making significant progress in complex reasoning tasks. A key trend is "test-time scaling," which involves having models generate longer Chains-of-Thought (CoT), essentially encouraging them to "think longer." Leading models such as OpenAI's o3 and DeepSeek-R1 have demonstrated the effectiveness of this approach.

However, "longer" doesn't always mean "smarter." For extremely complex problems that are prone to errors in intermediate steps or require creative shifts in thinking, lengthy CoTs often fall short. The internal self-reflection mechanisms that models rely on often struggle to identify their own fundamental errors.

So, can models learn to think smarter, much like humans, by utilizing external tools to aid thinking, validate ideas, and learn from tool feedback? This is the core idea behind Agentic Reinforcement Learning. It involves making the model an active agent that interacts with an external environment (like a Python interpreter) and adjusts its reasoning strategy based on environmental feedback.

Paper: rStar2-Agent: Agentic Reasoning Technical Report
Link: https://arxiv.org/pdf/2508.20722

This paper from Microsoft Research is a significant achievement in this field. They successfully trained a pre-trained model with only 14 billion (14B) parameters into a "top expert" in mathematical reasoning through their innovative agentic reinforcement learning framework. Its performance rivals and even surpasses that of the 671 billion (671B) parameter DeepSeek-R1 model. Even more remarkably, this powerful capability was achieved with just one week of training on 64 GPUs and 510 RL steps, a true example of "achieving much with little effort."

Next, we will delve into how this research was accomplished and what makes it so exceptional.

Core Innovation One: GRPO-RoC Algorithm – Efficient Learning in Noisy Environments

Having a model use code tools for reasoning sounds promising, but the first obstacle encountered in practice is environmental noise. Imagine a student just starting to use a calculator to solve problems; they might press the wrong keys, and the calculator will show an error. Their attention is diverted from "solving the problem" to "figuring out how to use the calculator." The same applies to models: the code they generate might have syntax errors or logical flaws (like infinite loops), causing the Python environment to return error messages (Traceback) instead of useful results. These error feedbacks are irrelevant to problem-solving reasoning and constitute significant environmental noise.

In traditional Reinforcement Learning (RL), rewards are usually given only based on the correctness of the final answer (outcome-only reward). This leads to a serious problem: a reasoning trajectory, even if all intermediate tool calls are wrong, can receive a full reward if the final answer happens to be correct. This is equivalent to telling the model: "It doesn't matter if you make mistakes in the middle, as long as the result is correct." This can lead to the model generating many lengthy, low-quality, error-prone reasoning processes, resulting in inefficient learning.

How can we solve the noise problem without modifying the reward function and avoiding reward hacking?

rStar2-Agent provides a concise and efficient answer: the GRPO-RoC (Group Relative Policy Optimization with Resample-on-Correct) algorithm. Its core is a strategy called "Resample-on-Correct" (RoC).

The RoC strategy works as follows:

Oversample: For each problem, first generate 2G reasoning trajectories (Rollouts) using the current model, instead of the standard G trajectories.
Classification and Asymmetric Downsampling:

Tool Error Rate (p_err): The proportion of erroneous tool calls in a trajectory. More errors lead to a higher score, indicating poorer quality.
Format Violation Rate (p_format): For example, reasoning (<reason>) appears after the final answer (<answer>), which is an invalid format. More severe violations lead to a higher score.

These trajectories are divided into positive samples (correct answer) and negative samples (incorrect answer) based on the correctness of the final answer.
For negative samples: We randomly and uniformly downsample to half the quantity. The goal is to retain a variety of failure modes, so the model learns "what is wrong" and avoids repeating mistakes.
For positive samples: This is key! Instead of random selection, we prioritize selecting successful trajectories that are of "higher quality." How is quality measured? The paper defines two penalties:
Calculate the total penalty score p_total = p_err + p_format, then downsample based on the probability from low to high penalty scores (i.e., from high to low quality). This means that successful trajectories where tools were used accurately and formats were standardized have a higher chance of being selected to guide model updates.

Policy Update: Finally, we use the downsampled G trajectories (containing high-quality positive samples and diverse negative samples) to calculate the Advantage function and update the model.

The brilliance of this algorithm lies in: It does not change the simple and reliable reward principle of "only reward if the final answer is correct." Instead, by manipulating the data filtering layer, it cleverly "feeds" the model more high-quality positive examples and diverse negative examples. This is akin to a teacher grading essays: not only looking at the final score but also highlighting well-written, fluent examples from high-scoring essays for special praise, while also collecting and explaining various typical mistakes. In this way, the student (model) can learn more efficiently how to produce (reason out) high-quality content.

As shown in the figure, under the original GRPO algorithm, the tool error rate in successful trajectories stabilized at a relatively high level (10%-15%), meaning the model consistently generated a lot of noise. After adopting GRPO-RoC, the tool error rate was continuously and significantly reduced, demonstrating the effectiveness of this strategy.

Core Innovation Two: Large-scale Agentic RL Infrastructure – Supporting Efficient Training

Good algorithms require robust infrastructure to support them. Agentic RL training is extremely costly because it involves frequent interaction between the model and the environment. The paper reveals two major engineering challenges:

Massive Concurrent Tool Calls: A single training iteration (step) might generate tens of thousands of Python code execution requests. If run directly on a local Python interpreter, this would instantly overwhelm the CPU, leaving GPUs idle, causing immense resource waste and efficiency bottlenecks. More dangerously, model-generated code is uncontrollable and may contain infinite loops, malicious code, or unkillable processes, threatening the stability of the training host.
Highly Imbalanced Multi-round Rollouts: In agentic RL, a complete reasoning process consists of multiple rounds of dialogue (model output -> tool execution -> model output again...). The difficulty of each problem varies, and the number of tokens generated and tool calls in each round are highly imbalanced. If tasks are statically allocated evenly to all GPUs, as in traditional RL, some GPUs will inevitably finish early and remain idle for extended periods, waiting for "slow" GPUs, leading to severe load imbalance and synchronization delays.

To address Challenge One, rStar2-Agent built a high-throughput, isolated code environment service.

The design of this service is highly sophisticated. It is isolated from the main training process and deployed independently on the CPUs of the computing cluster. A central task queue receives all code execution requests, which are then batched by multiple "sending workers" and distributed to numerous "execution workers" for actual running. The execution results are then returned to the sending workers and finally transmitted back to the RL process. This architecture acts like an efficient "code execution cloud service," specifically designed to handle massive, untrusted code tasks, ensuring the stability and high throughput of the main training flow.

Performance data shows that this environment service can process over 45,000 tool calls per second, with an average latency as low as 0.3 seconds, perfectly meeting the demands of large-scale training.

To address Challenge Two, rStar2-Agent designed a dynamically load-balanced Rollout scheduler.

Traditional static allocation (top figure) is plagued with problems. rStar2-Agent's scheduler (bottom figure) is highly intelligent: it no longer statically assigns a fixed number of tasks to each GPU; instead, it real-time monitors the remaining capacity of the KV cache on each GPU. The KV cache can be understood as the memory reserved by the GPU for generating text. The scheduler estimates how many new generation tasks each GPU can safely handle and then dynamically assigns tasks from the waiting queue to it. This ensures that all GPUs are always "busy but not overwhelmed," maximizing the utilization of computing resources and avoiding computational waste and waiting caused by KV cache overflow.

Core Innovation Three: Efficient Training – Forging a Superbrain at Low Cost

With algorithms and infrastructure in place, the final step is to design the training workflow to achieve the best performance with minimal cost. rStar2-Agent's training recipe is also unique and differs significantly from mainstream methods.

Step One: "Non-Reasoning" Supervised Fine-Tuning (Non-Reasoning SFT)

Typically, before RL, models are fine-tuned with data containing detailed reasoning chains, known as "reasoning SFT," which is like giving a student a problem set with detailed solutions to imitate. However, rStar2-Agent takes the opposite approach, performing only "Non-Reasoning SFT". Purpose: Not to teach the model how to reason, but to teach it how to follow instructions, how to use tool interfaces (JSON format), and how to output answers in a standardized way (<reason>,<answer>,\boxed{}). The data used for SFT primarily consists of tool calls, instruction following, and dialogue data, almost entirely excluding mathematical reasoning data. Benefits:

It prevents the model from "overfitting" to certain fixed reasoning patterns during the SFT stage, preserving room for subsequent RL to explore better solutions.
After this SFT, the model's initial response length is very short (~1K tokens), laying the groundwork for efficient RL training with shorter context lengths.

As shown in the table, after "Non-Reasoning SFT," the model significantly improved its capabilities in tool usage (BFCL), instruction following (IFEval), and dialogue (Arena-Hard), while mathematical reasoning abilities (MATH-500, AIME) showed little change compared to the base model, confirming that the objectives of this stage were met.

Step Two: Multi-stage Agentic RL Training

Next, reinforcement learning is conducted using the GRPO-RoC algorithm and infrastructure described earlier. The entire process is divided into three stages, like levels in a game:

Stage 1 (concise RL, 8K length): Training on all 42K math problems, but limiting the maximum response length to 8K tokens. Although more than 10% of trajectories were truncated due to length in the early stages, this forced the model to use tools more efficiently and precisely for reasoning within a limited "scope," rather than aimlessly "trial and error." The model quickly adapted, with response lengths stabilizing around 4K, and performance significantly improved.
Stage 2 (12K length): When the model's performance stabilized under the 8K limit, the length cap was raised to 12K, giving the model more space to handle more complex problems. The average response length increased to 6K, and performance climbed further.
Stage 3 (focus on difficult samples, 12K length): At this point, the model could solve many simple problems with 100% accuracy. To continue improving, difficult problems that the model still struggled with (approximately 17.3K problems) were actively filtered out, and training was conducted only on these challenging samples. The average response length increased to 8K, ultimately pushing the model to its performance peak.

The table compares rStar2-Agent's training recipe with other mainstream models. Its most prominent features are: no reasoning SFT, very few total RL steps (510 steps), and extremely short training lengths (8K->12K). This stands in stark contrast to other methods that involve tens of thousands of steps and 16K+ training lengths, making its efficiency advantage immediately apparent.

Experimental Results and Performance – Comprehensive Leadership, Strong Generalization

After the efficient training described above, the rStar2-Agent-14B model demonstrated extremely powerful performance.

Mathematical Reasoning, Surpassing Giants

On the most challenging math competition benchmarks, AIME 2024 and 2025, rStar2-Agent-14B achieved average pass@1 rates of 80.6% and 69.8%, respectively, outperforming numerous giants such as OpenAI o3-mini (medium), DeepSeek-R1 (671B), and Claude Opus 4.0. This not only proves the effectiveness of Agentic RL but also pioneers the precedent of "smaller models surpassing larger models."

Even more astonishing is that performance improvements were not achieved through "brute-force piling on" (generating long texts). As shown in Table 4, rStar2-Agent-14B's average response length (~9K-10K tokens) is significantly lower than that of comparison models (~14K-17K tokens). This means it learned to use tools more intelligently and precisely, accomplishing more difficult tasks with fewer "words."

Powerful Generalization, One Solution Fits All

The most compelling evidence is its strong generalization capability. rStar2-Agent was only trained with RL on mathematical data, yet it performed astonishingly well in tests in other domains.

Scientific Reasoning (GPQA-Diamond): Accuracy jumped from 42.1%** after SFT to 60.9%**, even surpassing the specially trained DeepSeek-V3 (59.1%). This indicates that reasoning patterns learned from mathematics can transfer to general scientific reasoning.
Tool Usage (BFCL v3) and Alignment (IFEval, Arena-Hard): On these non-reasoning tasks, performance remained roughly at the post-SFT level, indicating that mathematical RL training did not harm the model's existing other capabilities.

Deep Analysis: How Does the Agent "Think Smarter"?

To investigate the intrinsic mechanism by which the model became "smarter," the paper conducted an analysis from the perspective of token entropy. Higher entropy indicates greater uncertainty and more choices for the model when generating that token, typically occurring at critical moments of decision-making and reflection.

Researchers discovered two key high-entropy patterns:

Forking Tokens: These high-entropy tokens usually appear when the model is self-reflecting, posing questions, or planning verification, such as: "But before...", "Let me double-check...", "rerun...". This pattern is also common in traditional CoT RL; it drives the model to explore and avoid following a single path blindly.
Reflection Tokens: This is unique to Agentic RL! When the model receives feedback from the code environment (whether successful output or an error message), it generates a series of high-entropy tokens to analyze, interpret, and respond to this feedback.

An example of successful execution: After seeing the tool's return, the model generates high-entropy tokens to plan how to verify it ("To verify"), demonstrating careful consideration.
A more impressive example of error handling: After the model's code execution fails, it doesn't give up or make wild guesses. Instead, it generates many high-entropy tokens to analyze the cause of the error ("The error occurred because..."), devise solutions ("an easier workaround is to...", "Alternatively"), and finally generate corrected code. This closely resembles a programmer debugging, showcasing advanced cognitive abilities.

The conclusion is: Agentic RL not only retains the self-reflection capabilities found in traditional CoT but, more importantly, it adds the ability to deeply reflect on environmental feedback and adjust behavior. This is precisely why it is inherently "smarter" than simply a "longer chain of thought."

Some Discussions

The paper also candidly shares some failed attempts, and these experiences are equally valuable:

Overlong Filtering: Directly discarding trajectories truncated due to excessive length (without negative rewards) was initially intended to avoid penalizing trajectories that were merely long but correctly reasoned. However, it was found that this instead led the model to produce lengthy, repetitive text more frequently due to the lack of negative feedback signals. Ultimately, retaining truncated trajectories and giving negative rewards proved more effective.
N-gram Repetition Detection: Attempts to filter out successful trajectories containing repetitive n-grams using rules to improve quality often erroneously harmed legitimate, similar tool calls made for verification purposes. This suggests that overly complex and fine-grained rule-based reward or filtering mechanisms might be more detrimental than beneficial in LLM RL.

These lessons further confirm the superiority of their simple reward design (relying only on the correctness of the final answer) and the RoC data-level filtering strategy: reducing bias, maintaining exploration, and achieving robust learning.

Furthermore, experiments revealed an RL performance ceiling. In the later stages of training, continuing to train after performance peaked led to collapse, with various tuning methods proving ineffective. This indicates that RL primarily serves to activate the intrinsic potential a model has already acquired during pre-training, rather than bestowing new capabilities beyond its inherent capacity. Therefore, efficiently reaching this ceiling with minimal RL computational cost becomes crucial. rStar2-Agent successfully accomplished this.

Conclusion

The work on rStar2-Agent is an outstanding achievement integrating algorithmic innovation, system engineering, and training techniques. Its core contributions are:

GRPO-RoC Algorithm: Cleverly overcomes the noise problem in code environments through the "Resample-on-Correct" strategy, guiding the model to produce high-quality reasoning while maintaining a simple reward system.
High-performance Infrastructure: Built a training system capable of supporting massive concurrent tool calls and dynamic load balancing, making large-scale Agentic RL feasible and efficient.
Efficient Training Recipe: The combination of "Non-Reasoning SFT" and "Multi-stage RL," achieved top-tier mathematical reasoning performance for a small model with minimal computational cost (510 steps, 64 GPUs for one week).

This research strongly demonstrates that the agentic path of "thinking smarter" is far more effective and efficient than simply "thinking longer". It pioneers the precedent of smaller models surpassing larger ones, providing the AI community with valuable algorithms, systems, and insights. Its code and recipes have been open-sourced and are sure to drive further exploration in efficient, intelligent reasoning models across the field. Looking ahead, expanding this paradigm beyond mathematics to broader reasoning domains and tool-use scenarios holds immense promise.