Comprehensive Summary: Reinforcement Learning Implementation Paths for Reasoning Models

The MLNLP community is a well-known machine learning and natural language processing community both domestically and internationally, reaching NLP master's and doctoral students, university professors, and corporate researchers.

The community's vision is to promote exchange and progress among natural language processing and machine learning academia, industry, and enthusiasts, especially for beginners.

Source | Zhihu

Author | zss123

Reasoning models are emerging and have recently sparked a lot of research. This article primarily summarizes the reinforcement learning implementation paths for recent reasoning models, offering some references for related fields.

Core Reinforcement Learning Training Methodology

This chapter will delve into the foundational aspects of reinforcement learning training, methodologies that are evident across various applications, even in scenarios not explicitly involving external tools. However, it's worth noting that many modern large language model reinforcement learning applications inherently involve some form of “tool,” such as a code execution environment, even if it's not an external Application Programming Interface (API).

Reinforcement Learning Data Management: The Unsung Hero

Data plays a crucial role in any machine learning paradigm, and reinforcement learning is no exception. High-quality, highly relevant data is the cornerstone for training high-performing agents.

1. Data Selection Strategies: Beyond Quantity to Quality and Relevance

The selection of reinforcement learning training data increasingly emphasizes quality and relevance over sheer quantity. Research indicates that obtaining data from diverse and closely related domains to the target task is vital. For example, in mathematical reasoning tasks, researchers tend to use datasets such as OpenThoughts, NuminaMATH, MATH, and DeepScaleR. For broader question answering tasks, Natural Questions (NQ), TriviaQA, HotpotQA, and SQuAD datasets are common choices. In specialized tasks like CUDA kernel generation, specific datasets like KernelBench are adopted.

Selecting verifiable questions or tasks is a critical strategy, which greatly facilitates the subsequent definition and calculation of reward functions. Furthermore, balancing the difficulty distribution and diversity of datasets is also emphasized. For instance, the TORL framework uses LIMR technology to extract high-quality samples with a balanced difficulty distribution, while DeepResearcher focuses on multi-hop reasoning scenarios by adjusting the proportions of different datasets.

2. Data Cleaning and Filtering: Ensuring Signal Purity

To ensure that the signals fed to reinforcement learning algorithms are pure and effective, data cleaning and filtering are indispensable steps.

Rigorous validation processes are commonly used, often involving dual verification by human experts and powerful pre-trained models (e.g., Deepseek-R1 used in ReTool) to filter out invalid or low-quality data. The TORL framework filters out proof-based questions and those with ambiguous validation criteria. DeepResearcher filters out questions that are time-sensitive, highly subjective, or potentially harmful.

Preventing models from relying on memorized information rather than learning expected skills is a core challenge. DeepResearcher implements a “contamination detection” mechanism, which excludes questions that the base model can answer without search tools, ensuring the agent learns skills like searching rather than exploiting data leakage. This strategy effectively compels the model to learn to use tools or engage in deeper reasoning.

Format standardization and validation are crucial for the efficiency and stability of subsequent reinforcement learning processes. For example, ReTool performs format validation on its code integration data to ensure that calculation tool call triggers can be efficiently detected.

3. Data Augmentation and Preparation for Reinforcement Learning Trajectories

In addition to selecting and cleaning existing data, augmenting and preparing data in specific formats for reinforcement learning needs are common practices.

For “cold start” scenarios, such as tool integration tasks, data is often augmented based on existing text-based reasoning data. The ReTool framework uses structured prompt templates to automatically convert text-based reasoning data (Dinit) into code-integrated reasoning data (DCI), where manual computation steps are replaced with corresponding code snippets and their interpreter execution results.

To simplify the calculation of reward functions, answer formats are sometimes transformed. For example, the DAPO-Math-17K dataset converts mathematical problem answers into integer form, simplifying rule-based reward calculation and minimizing errors that formula parsers might introduce. This pragmatic approach makes complex reasoning tasks easier to apply with reinforcement learning.

The meticulous work of data filtering and preparation goes far beyond simple data preprocessing. These steps actually constitute an implicit shaping of the learning environment. By carefully selecting, for instance, verifiable questions, discarding ambiguous content, or transforming data formats to simplify the identification of correct results (e.g., converting answers to integers, or generating code-integrated data), researchers are guiding the agent towards desired behavior patterns even before the reward function comes into play. Ensuring data is “verifiable” means the reward mechanism can be more reliable; converting answers to integers simplifies the reward mechanism, reducing potential noise or complexity in the learning signal. This suggests that “reinforcement learning data engineering” is becoming a highly specialized field, where data preparation is no longer just a preliminary step but an indispensable component of reinforcement learning design, subtly influencing policy learning by pre-adjusting the learning environment.

At the same time, data strategies also reflect an active avoidance of the model’s “learning shortcuts” problem. For example, DeepResearcher’s contamination detection mechanism (filtering out questions that the base model can answer without tools) and the focus on verifiable, unambiguous questions both reflect a proactive strategy. Researchers foresee that large language models, as powerful pattern matchers, will exploit any “shortcuts” if the data allows. If a model can find an answer directly from its parameterized knowledge, it might not learn to use tools. If data is not filtered for such “shortcuts,” the reinforcement learning agent might maximize rewards by simply recalling information or exploiting dataset biases, rather than learning the expected complex skills (e.g., multi-hop reasoning, tool use). This leads to poor generalization on tasks that truly require those skills. This highlights a fundamental challenge in large language model reinforcement learning: ensuring the agent learns the process, not just mimicking surface correlations in the data. Data management is the first line of defense against this challenge.

Table 1: Overview of Reinforcement Learning Training Data Strategies

Reinforcement Learning Algorithm Implementation Details: The Learning Engine

Reinforcement learning algorithms are at the core of driving agent learning. In recent years, researchers have made numerous improvements and innovations to classic algorithms, tailored to the characteristics of large language models.

1. Mainstream Algorithms: PPO and Its Variants

Proximal Policy Optimization (PPO) is currently one of the most widely applied algorithms in the field of large language model reinforcement learning. It is used as a foundational algorithm by many frameworks. PPO's objective function (e.g., given by Equation 1 in ReTool's research) aims to optimize the policy model while limiting the divergence between the new and old policies by clipping importance sampling weights or adding a KL divergence penalty, thereby improving training stability.

Group Relative Policy Optimization (GRPO) is a popular variant of PPO, which typically estimates the advantage function by normalizing rewards for multiple responses generated from the same prompt, thus avoiding the training of a separate value network (critic). The DAPO algorithm also uses naive GRPO as a baseline for comparison. This method can reduce computational overhead, especially for large models.

Beyond PPO and GRPO, a series of specialized variants have emerged, designed for specific problems or to improve particular performance:

• DAPO (Decoupled Clip and Dynamic sAmpling Policy Optimization) introduces a “Clip-Higher” mechanism to promote exploration, filters out uninformative prompts through “dynamic sampling,” employs token-level policy gradients, and designs an “overlength reward adjustment” mechanism.

• VAPO (Value-model-based Augmented PPO) adds various techniques to PPO, such as length-adaptive Generalized Advantage Estimation (GAE), token-level policy gradient loss, value pretraining, decoupled GAE, Clip-Higher, and positive sample language model loss.

• Dr. GRPO (GRPO Done Right) is an improvement on GRPO, aiming to eliminate response-level length bias and problem-level difficulty bias by removing the normalization term in advantage calculation, thereby reverting to the standard PPO objective using Monte Carlo return estimation of advantage.

• StarPO (State-Thinking-Actions-Reward Policy Optimization) is a general trajectory-level agent reinforcement learning framework that supports PPO and GRPO, and proposes a more stable variant, StarPO-S.

2. Key Algorithmic Improvements and Techniques

To better apply reinforcement learning to large language models, researchers have introduced several key techniques based on core algorithms:

• Advantage Estimation: PPO typically uses Generalized Advantage Estimation (GAE). VAPO introduces length-adaptive GAE and decoupled GAE. GRPO and Dr. GRPO use group-based or Monte Carlo return estimation methods.

• Clipping Strategies: PPO’s clipping mechanism is crucial for maintaining training stability. DAPO and VAPO enhance this with “Clip-Higher” technology, which decouples the upper and lower clipping bounds of the importance sampling ratio, allowing for larger probability increases for low-probability tokens, thereby encouraging exploration. RAGEN’s StarPO-S also adopts a similar decoupled clipping strategy.

• Value Function Handling: Although GRPO typically omits a learned value function, PPO-based methods like VAPO invest resources in robust value model training, including the use of value pretraining to mitigate initialization bias. StarPO-S also reintroduces a critic-based baseline to improve stability.

• Token-level vs. Sample-level Loss: DAPO and VAPO advocate for the use of token-level policy gradient loss. This approach assigns uniform weights to all tokens in a training batch, addressing the issue where longer sequences contribute less to the loss in sample-level loss, and preventing undesired patterns (like meaningless content or repetition) in long samples from having a disproportionately low impact on the loss.

• Exploration Enhancements: To encourage models to explore a wider policy space, researchers employ various strategies, such as omitting KL loss or setting a higher training temperature in TORL, using Clip-Higher in DAPO and VAPO, and removing the KL term in StarPO-S.

3. Reward Function Design: Guiding the Agent

The reward function is the core mechanism in reinforcement learning that guides agent behavior. Its design directly impacts learning efficiency and ultimate performance.

• Outcome-Based Rewards: A commonly adopted approach is to use simple, rule-based accuracy rewards. For example, in tasks with verifiable answers (like math problems), if the final predicted answer is equivalent to the true answer, the reward is +1; otherwise, it is -1 or 0.

• Combined Rewards: The DeepRetrieval framework employs a composite reward function, which consists of task-specific retrieval performance rewards (rretrieval, e.g., Recall@K for literature search, NDCG@K for classic information retrieval, or SQL execution accuracy) and format adherence rewards (rformat, rewarding the model for following specific output structures, such as and tags).

• Penalties: To suppress undesirable behaviors, penalty terms are introduced. Kevin-32B gives a 0 score reward for responses that use PyTorch functions or do not contain CUDA kernels (aiming to mitigate reward hacking). TORL experimented with a code executability penalty (-0.5) but found it did not improve model performance. DAPO applies a “soft overlength penalty” to truncated samples that exceed the maximum generation length. RAGEN penalizes responses that do not conform to the format.

• Discount Factors: In multi-turn interaction settings, a discount factor is used to balance the importance of immediate versus future rewards. Kevin-32B used a discount factor of 0.4 in multi-turn training, where the reward for a response was the discounted sum of the scores of the current kernel and all its subsequent kernels.

• Avoiding Neural Reward Models: SEARCH-R1 explicitly states that they avoid training neural reward models due to the sensitivity of large language models to specific reward forms and additional computational costs in large-scale reinforcement learning. This contrasts with other RLHF (Reinforcement Learning from Human Feedback) methods not detailed in these materials.

Regarding advantage estimation, the choice between “critic-full” and “critic-less” methods reflects a trade-off between simplicity/efficiency and stability/guidance.

GRPO’s popularity stems from its avoidance of training a separate value network, which simplifies implementation and reduces computational burden, especially for large language models, where training two large models (actor and critic) simultaneously is costly. However, methods like VAPO and StarPO-S deliberately reintroduce or improve the critic.

VAPO emphasizes obtaining better value estimates through “value pretraining” and “decoupled GAE.” StarPO-S utilizes critic baselines to stabilize training. A well-trained critic can significantly reduce the variance of advantage estimates, leading to more stable and efficient policy updates.

However, a poorly trained or misaligned critic can hinder learning. The choice of method depends on the specific problem, computational budget, and the perceived stability of critic-less advantage estimation for the task. This suggests that there is no one-size-fits-all solution for advantage estimation in large language model reinforcement learning.

The field is actively exploring this trade-off, leading to hybrid methods or more robust critic training techniques. Even within the “critic-less” paradigm, the evolution from GRPO to Dr. GRPO also shows improvements in baseline estimation methods.

Mitigating reward hacking is an ongoing “arms race” that requires multi-faceted solutions. Several studies acknowledge and address the reward hacking problem. ReTool uses simple outcome-based rewards to mitigate this issue. Kevin-32B imposes strict format checks on responses and penalizes undesirable shortcuts (e.g., using PyTorch fallbacks). DAPO’s overlength reward adjustment mechanism prevents “score farming” by generating excessively long, potentially correct but inefficient responses. Large language models are very adept at finding loopholes in reward functions. If the reward function is too simple or does not account for all undesirable behaviors, the agent will learn to maximize the reward signal in unexpected ways, failing to achieve the actual task objectives.

Designing robust reward functions is both an art and a science. It often requires iterative improvements based on observed failure modes. The trend is towards more nuanced reward components (e.g., combining task and format rewards as in [2]) and careful consideration of edge cases, rather than relying solely on a single, simple outcome metric, especially as tasks become more open-ended.

Table 2: Summary of RL Algorithm Implementations and Key Features

Reinforcement Learning Training Process: Meticulously Orchestrated Learning

The reinforcement learning training process is a meticulously designed system engineering, involving multiple stages and optimization techniques, aimed at efficiently and stably improving the agent's policy.

1. Key Stages in the Training Workflow

A typical reinforcement learning training workflow usually includes the following key iterative sub-stages:

• Optional Supervised Fine-tuning (SFT) / Cold Start: Some frameworks choose to perform supervised fine-tuning on carefully curated datasets before reinforcement learning. This provides a robust initialized model for subsequent RL stages. For example, ReTool performs SFT on code-augmented datasets (DCI) to teach the model when and how to call the code interpreter. DeepRetrieval adopts SFT as a cold start strategy in SQL database search tasks. However, some studies take different paths. TORL directly starts reinforcement learning from a base language model, without an SFT stage. VAPO explicitly states that, for fair comparison with other methods, it does not introduce any SFT data during the reinforcement learning training process.

• Iterative Reinforcement Learning Loop: This is the core of reinforcement learning, typically involving continuous iteration of the following sub-stages: Rollout/Generation: The policy model generates action sequences (i.e., trajectories) based on the current prompt or state.

• Evaluation/Reward Calculation: The generated trajectories are evaluated, and rewards are calculated based on their interaction results with the environment or final output.

• Learning/Policy Update: Based on the received rewards and generated trajectories, the selected reinforcement learning algorithm (e.g., PPO, GRPO) updates the policy model (and the value model, if present).

2. Optimization Techniques and Stability Measures

To ensure the stability and efficiency of the training process, researchers employ various optimization techniques:

• Loss Masking: When the output of external tools or retrieved information is part of the input sequence, these external tokens are typically masked out in the reinforcement learning loss calculation. This prevents external tokens from interfering with policy gradient optimization and ensures training stability.

• KL Divergence Regularization: This is a common technique that penalizes the KL divergence between the current policy and a reference policy (usually an SFT model or the policy from a previous iteration), preventing the learned policy from deviating too far, thereby helping maintain training stability. However, in some cases, such as TORL and StarPO-S, the KL penalty term is intentionally omitted or its coefficient is set to 0 to enhance exploration.

• Gradient Clipping: To prevent gradient explosion leading to training instability, especially when dealing with large models or long sequences, aggressive gradient norm clipping strategies are sometimes employed.

• Dynamic Sampling / Trajectory Filtering: The “dynamic sampling” technique in the DAPO framework filters out prompts where all generated outputs have an accuracy of 0% or 100%, to ensure that the training batch contains valid gradient information. StarPO-S uses variance-based trajectory filtering, retaining highly uncertain prompts for training.

• Warm-up Phases: Learning rate warm-up or value model warm-up (e.g., VAPO) helps stabilize the learning process in the early stages of training.

3. Distributed Training and Efficiency Considerations

As model scale increases and task complexity rises, training efficiency becomes a critical issue.

• Frameworks for Scale: Researchers have developed specialized frameworks like veRL and HybridFlow to support efficient reinforcement learning training for large language models, which typically have built-in distributed training capabilities. Parallelism: HybridFlow uses tensor parallelism during training and mixed data-model parallelism during inference.

• KV-Cache Reuse: ReTool caches key-value (KV) caches before code execution and only computes and appends KV caches from interpreter feedback, to reduce memory costs during the deployment process. Asynchronous Operations: ReTool uses an asynchronous code sandbox to accelerate the reinforcement learning training process.

• Parameter-Efficient Training: The RAGEN framework explores using LoRA (Low-Rank Adaptation) for parameter-efficient training.

The conceptual differences in initialization and skill acquisition are reflected in the choice between “SFT then RL” and “direct RL” paths. ReTool and DeepRetrieval (for SQL tasks) explicitly use SFT as a “cold start” or to provide “robust initialization.” This approach, by pre-training the model to master desired behaviors or tool interaction formats, makes the initial RL exploration phase more targeted and efficient. However, it can also bias the model towards the distribution of SFT data, potentially limiting the breadth of exploration in the RL phase.

Conversely, TORL advocates “direct RL from a base model” without SFT, and VAPO avoids using SFT data in RL for fair comparison. Direct RL on a powerful base model might discover more novel strategies but could also face more severe cold-start problems. This choice may depend on the complexity of the target behavior, the quality of available SFT data, and the capabilities of the base LLM. Currently, academia is still exploring how best to combine supervised learning and reinforcement learning—whether as sequential processes, interleaved processes, or primarily using SFT models as reference policies.

Stability in large language model reinforcement learning is a multifaceted battle that requires a combination of algorithmic adjustments, data strategies, and process management to address. Numerous techniques aim to stabilize the training process: KL regularization, PPO’s clipping mechanism (widely used), decoupled clipping, value pretraining, dynamic sampling/filtering, external token loss masking, gradient clipping, and careful hyperparameter tuning. The training of large language models itself is sensitive, and reinforcement learning adds another layer of complexity due to exploration, sparse rewards, and potentially noisy value estimates. Without these stabilization measures, training can easily diverge, leading to policy collapse or the model producing meaningless outputs.

Therefore, achieving stability in large language model reinforcement learning does not rely on a single “silver bullet” but requires systematically addressing potential points of failure throughout the training process. This holistic approach is crucial for reinforcement learning to become a reliable tool for large language model enhancement. The emergence of specialized frameworks (e.g., veRL, HybridFlow) also indicates the need for specifically designed infrastructure to handle these complexities.

Hyperparameter Deep Dive: The Tuning Knobs

Hyperparameters are crucial “knobs” in the reinforcement learning training process, and their settings directly affect learning efficiency, stability, and ultimate performance.

1. Key Hyperparameters and Their Impact

• Learning Rates (Actor & Critic Learning Rates): Typically set to be small, e.g., actor learning rate of 1×10−6, critic learning rate of 1×10−5 or 2×10−6. If a critic is used, the relative size of actor and critic learning rates can be important.

• Batch Sizes (Rollout & Mini-batch Sizes): Rollout batch sizes can be large, e.g., 128 in TORL, 512 in ReTool, SEARCH-R1, DAPO, and 8192 in VAPO. Mini-batch sizes for gradient updates are smaller, e.g., 16 in DeepRetrieval, 64 or 256 in SEARCH-R1, 512 in ReTool, DAPO, VAPO. RAGEN uses 8 prompts per batch, with each prompt generating 16 rollout trajectories.

• KL Coefficient (β): Controls the penalty for the policy diverging from the reference policy. Values vary, e.g., 0.01 in ReTool, 0.001 in DeepRetrieval, SEARCH-R1, RAGEN, and omitted in TORL. This choice reflects a trade-off between stability and exploration.

• PPO Clipping Parameter (ϵ): Standard value is usually 0.2. DAPO and VAPO use decoupled ϵlow=0.2 and ϵhigh=0.28.

• GAE Parameters (λ and γ): Discount factor γ is usually set to 1.0 for non-episodic tasks or tasks that highly value future rewards. Trace decay parameter λ is also typically set to 1.0 for PPO, but VAPO uses length-adaptive λ for the policy network and λ=1.0 for the value network.

• Maximum Sequence/Response Lengths: Very important for managing computational resources and defining the generation scope, e.g., 16384 in ReTool, task-specific settings in DeepRetrieval, 4096 in SEARCH-R1, 16384-20480 in DAPO.

• Temperature for Rollout/Generation: Higher temperatures (e.g., 0.6 in DeepRetrieval, 1.0 in TORL, SEARCH-R1, DAPO, VAPO) are used during training rollouts to encourage exploration.

• Epochs/Training Steps: ReTool trains for 2 epochs on cold-start data. SEARCH-R1 trains for 500 steps. VAPO achieved state-of-the-art performance by training for 5000 steps on the AIME 2024 dataset. RAGEN uses 200 rollout-update iterations.

2. Tuning Strategies and Typical Ranges (Implicit)

Although the literature does not always explicitly detail hyperparameter tuning strategies, the variations in hyperparameter settings across different studies suggest that actual tuning is often empirically adjusted based on the specific model, dataset, and task. Learning rate warm-up schedules are a common practice. Monitoring key intermediate results during training, such as generated response length, reward dynamics, and model entropy, is crucial for identifying problems and guiding tuning.

The choice of hyperparameters often reflects an implicit understanding of the exploration-exploitation-stability dilemma under specific tasks and model scales. For example, setting the KL coefficient to 0.01 or removing the KL term, combined with a higher generation temperature, indicates researchers' intent to push for greater exploration, possibly because the task is complex and the initial policy is far from optimal. Conversely, when stability is paramount or the policy is already quite good, a non-zero KL coefficient and more conservative clipping strategies might be used. The “Clip-Higher” mechanism is a subtle attempt to achieve more exploration without sacrificing too much stability. Hyperparameters directly control learning dynamics.

Aggressive exploration settings may lead to faster discovery of novel solutions but also carry the risk of policy collapse. Conservative settings ensure stability but may lead to slow convergence or getting stuck in local optima. This indicates that there may not be a universal set of “best” hyperparameters; the optimal values are highly context-dependent. This also highlights the need for robust hyperparameter optimization techniques and a deeper understanding of how each hyperparameter affects the large language model reinforcement learning process. The field could benefit from more systematic research on hyperparameter sensitivity and interdependencies.

Table 3: Hyperparameter Settings in Different RL Models/Studies

Reinforcement Learning with External Tools and Knowledge Bases

As the capabilities of large language models enhance, enabling them to effectively utilize external tools (such as code interpreters, search engines, databases) and knowledge bases has become an important direction in reinforcement learning research. This integration aims to compensate for the shortcomings of large language models in precise computation, real-time information acquisition, and interaction with structured data.

Data Strategies for Tool-Augmented Reinforcement Learning

When a reinforcement learning agent needs to learn to interact with external tools, data strategies need to be adjusted and optimized accordingly.

1. Data Selection for Tool Interaction Scenarios

Data selection is primarily driven by the task itself, especially those tasks that inherently require or benefit from tool use.

• For mathematical reasoning tasks, ReTool and TORL used math competition problems, which often involve complex calculations where a code interpreter can serve as an effective auxiliary tool.

• For query generation tasks, DeepRetrieval adopted information retrieval (IR) and SQL datasets, where models need to interact with search engines or databases.

• For web research tasks requiring extensive background knowledge or up-to-date information, DeepResearcher used question answering datasets that necessitate web searching and browsing.

• Similar to general reinforcement learning, in tool-augmented reinforcement learning, the verifiability of tool use results is crucial for setting up the reward function.

2. Data Cleaning and Filtering in the Context of Tool Output

In tool integration scenarios, data cleaning and filtering not only focus on the quality of raw data but also need to consider the complexity introduced by tool interactions.

• Initial Data Quality Control: Similar to general reinforcement learning, initial datasets are first cleaned. For example, ReTool ensures the quality of text reasoning data through manual management and model evaluation before augmenting it into code-integrated data.

• Verification of Augmented Data: ReTool further verifies its automatically generated code-integrated data (DCI), including format validation (ensuring the correctness of tool call triggers) and answer validation (ensuring the final output matches the correct solution). This ensures that the “augmented data” used to train the model to learn tool use is itself of high quality.

• Filtering to Ensure Genuine Tool Need: DeepResearcher’s contamination detection mechanism is particularly crucial here. By filtering out questions that the model can answer without search tools, it ensures that the model learns to use search tools only when truly needed, rather than treating them as a “panacea.”

3. Data Augmentation for Tool Learning

To enable the model to learn how to use tools effectively, data often needs to be augmented in specific ways.

• Automated Construction of Tool-Integrated Data: ReTool’s conversion of text-based reasoning processes (Dinit) into code-integrated reasoning processes (DCI) is a typical data augmentation strategy. This process replaces manual computation steps in the original reasoning process with corresponding code snippets and their interpreter execution results, thereby providing the model with “cold start” data for learning tool use.

For tool-augmented reinforcement learning, data management often involves creating “exemplars” of desired tool interaction patterns. ReTool’s process of automatically constructing code-integrated data is not just about providing problems where tools can be used; more importantly, it actively demonstrates how tools are integrated into the reasoning chain. This augmented data, especially during the cold-start supervised fine-tuning phase, serves as initial supervised samples. Without such exemplars, large language models might struggle to discover how to format tool calls, parse outputs, or even when to call tools. Augmented data effectively guides this learning process by providing concrete interaction examples.

This indicates that for complex tool usage scenarios, starting entirely from scratch and relying solely on outcome-based reinforcement learning might be highly inefficient. A more pragmatic approach combines supervised learning using tool integration exemplars with subsequent reinforcement learning fine-tuning. “Data” itself becomes the medium guiding the tool interaction protocol.

Algorithm Implementation for External Tool Integration

Integrating external tools into the reinforcement learning loop requires adaptive adjustments at the algorithmic level and the design of appropriate reward mechanisms.

1. Reinforcement Learning Algorithm Adjustments for Tool Use

While standard algorithms like PPO and GRPO remain central, some key adjustments are needed to accommodate tool interaction:

• Structured Output for Tool Calls: Models are typically trained to generate specific tokens or structures to trigger tool use. For example, ReTool detects the code block end token to execute code. SEARCH-R1 uses

and tokens to call the search engine. DeepRetrieval uses and tags, the latter containing the augmented query. DeepResearcher also uses and tags and embeds tool calls within them. RAGEN also adopts the and structure.

• Parsing Tool Output: The system needs to be able to parse output from tools (e.g., code interpreter results, search snippets) and feed it back into the model’s context. This is typically achieved through special tags, such as in ReTool or in SEARCH-R1.

2. Specialized Reward Mechanisms for Tool Efficacy

The design of reward mechanisms is crucial for guiding the model to use tools effectively.

• Primary Reliance on Final Results: Even with the introduction of tools, most systems still primarily rely on the final task result to provide reward signals. If the use of a tool ultimately leads to the correct solution of a problem, then that tool-using behavior is positively reinforced.

• Implicit Reward for Tool Use: If the task itself cannot be solved without using a tool, then the reward for successfully solving the task implicitly includes a reward for successful tool use.

• Explicit Tool-Related Rewards (Less Common or Ineffective): TORL experimented with providing rewards for code executability but found that it did not improve model performance. ReTool also primarily focuses on final results and does not introduce code executability rewards. This suggests that directly rewarding intermediate steps of tool use (e.g., whether code is executable) may be difficult to design or less effective than rewarding the final outcome.

• DeepRetrieval’s reward function includes a format adherence reward (rformat), which can indirectly support correct tool calls if the tool call syntax is within a specific format.

In tool-augmented reinforcement learning, outcome-based rewards dominate, implying an “outcome-oriented” strategy where the large language model’s own reasoning capabilities are relied upon to optimize tool usage. Although the tool interaction process can be very complex, most frameworks (e.g., ReTool, TORL, SEARCH-R1) still choose to give rewards based on the correctness of the final answer. Attempts to add explicit rewards for intermediate steps (e.g., code executability) are not always effective. Directly rewarding specific mechanisms of tool use (e.g., “Did the code run successfully?”) might lead the agent to learn to generate runnable but useless code.

By focusing on the final result, the reinforcement learning process compels the large language model to learn effective tool use—that is, tool use that contributes to solving the problem. The model's internal reasoning is expected to bridge the gap between tool calls and problem solving. This approach places high demands on the reasoning capabilities of large language models and the ability of reinforcement learning algorithms to appropriately assign credit in what can be long chains of tool interactions. It also highlights the challenge of designing good intermediate rewards for complex cognitive tasks; often, sparse, outcome-based rewards, while potentially less sample-efficient, are more robust.

Training Process Involving External Tools

When a reinforcement learning agent needs to interact with external tools, its training process has unique characteristics and challenges.

1. Interleaving Reasoning and Tool Execution

A core feature of tool-augmented reinforcement learning is that the model generates partial reasoning, then pauses to call an external tool, receives feedback from the tool, and continues subsequent reasoning and generation based on that feedback.

• ReTool’s process is: the large language model generates text, and when the code block end token is detected, the generated code is sent to a sandboxed code interpreter for execution. The interpreter’s output (successful result or error message) is then encapsulated within tags and fed back to the model, which then continues generating subsequent reasoning trajectories. This creates a hybrid reasoning path interweaving text, code, and interpreter feedback.

• TORL’s model outputs reasoning content including code blocks. When the code termination identifier '''output is detected, text generation pauses, and the latest code block is extracted and handed to a code interpreter (e.g., Sandbox Fusion) for execution. The structured execution result (OBSERVATION) is inserted back into the context, and the model then continues generating subsequent natural language reasoning, potentially producing more code blocks until a final answer is given.

• SEARCH-R1’s model, when generating text, if it produces specific

tokens, the system extracts the query content, calls the search engine, and injects the retrieved results into the model’s context via tags for the model to perform subsequent reasoning and answer generation.

• DeepResearcher’s agent first reasons within tags, then calls web search or web browsing tools as needed. Observations obtained from these tools update the agent’s short-term memory, assisting subsequent decisions.

2. Tool Feedback and Error Management During Training

How to handle feedback from external tools, especially error messages, is an important part of the training process.

• Error Information as Learning Signal: Error information from tool execution (e.g., code compilation errors or runtime errors) is often intentionally returned to the large language model. This helps the model learn to generate syntactically correct and semantically reasonable tool inputs. For example, TORL explicitly states that error messages from failed code executions will be returned to the model to enhance its ability to generate correct code subsequently. ReTool’s description also mentions that the sandbox returns error messages, implying a similar mechanism.

• Masking Tool Output to Avoid Interfering with Loss Calculation: As discussed in the core reinforcement learning methodology (Section II.C.2), the actual content output from tools (e.g., execution results from a code interpreter, text snippets returned by a search engine) is typically masked out in the reinforcement learning loss calculation. This is done to ensure that the model learns to utilize this information for reasoning, rather than simply mimicking or copying this external information. At the same time, this also helps maintain training stability, preventing externally introduced tokens, which might be inconsistent with the model's own generation logic, from interfering with policy gradient calculation.

3. Sandbox Environment and Security Protocols

When integrated external tools have the ability to execute arbitrary code or interact uncontrollably with the external world, security becomes paramount.

• Code Execution in Sandbox: For tools like code interpreters, their execution process is typically placed within a sandbox environment. A sandbox provides an isolated environment for executing code generated by the large language model, thereby ensuring security and controllability, preventing potential malicious code or accidental operations from harming the system. TORL chose Sandbox Fusion as its code execution environment due to its better stability.

• Asynchronous Sandbox for Efficiency: To accelerate the training process, especially in scenarios requiring frequent interaction with tools like code interpreters, ReTool designed an asynchronous code sandbox environment. In this environment, sandbox instances act as workers in a worker pool, independently pulling and executing tasks, thereby forming an efficient load balancing mechanism and supporting parallel environment interaction.

4. Controlling Tool Interaction Frequency

Unlimited tool calls can lead to inefficient training or redundant interactions. Therefore, mechanisms are needed to control the frequency of tool use.

• Maximum Tool Call Limits: The TORL framework introduces a hyperparameter C to control the maximum number of tool calls allowed during a single response generation. If this threshold is exceeded, subsequent tool execution requests will be ignored, forcing the model to switch to pure text reasoning mode. This helps maintain training speed while ensuring a certain depth of exploration. SEARCH-R1 also uses a maximum action budget B to limit the number of searches. DeepResearcher allows a maximum of 10 tool calls per rollout trajectory.

“Loss masking” on tool output is a key technique whose purpose is to force large language models to learn “how to think using tools,” rather than just “what the tool will output.” Multiple studies explicitly mention masking out tokens from tool output (e.g., code interpreter results, search snippets) during reinforcement learning loss calculation. If these external tokens were included in the loss calculation for policy updates, large language models might learn to simply predict or copy these tokens, especially if they are verbose or contain strong signals. This would bypass the intended learning objective of having the model understand and utilize the information provided by the tool to guide its own subsequent reasoning.

By masking, gradients only flow through the tokens generated by the model itself, thereby strengthening its reasoning and decision-making capabilities (e.g., deciding what to do next given a tool's output). This highlights a subtle yet crucial aspect of training large language models to use tools: distinguishing between integrating information and merely reiterating it. Effective tool use requires large language models to act as intelligent consumers and integrators of external information, and the training process must be meticulously designed to foster this capability.

The iterative loop of “generate-execute-feedback-regenerate” in tool-augmented reinforcement learning, to some extent, reflects the human problem-solving process, but it also demands meticulous management of state and context. The descriptions from ReTool, TORL, SEARCH-R1, and DeepResearcher all detail such a process: the large language model generates some reasoning or tool query, the external tool executes the query, and then the result is fed back into the model's context for the next generation step. This iterative process allows large language models to decompose complex problems, incrementally gather information or perform calculations, and adjust their strategy based on intermediate results.

However, this also introduces challenges: the context window can become very large, state representation needs to effectively integrate different types of feedback (text, numbers, errors), and credit assignment becomes more difficult in long multi-step interactions. This paradigm is very powerful for solving complex multi-step tasks. However, its success depends on efficient context management (e.g., ReTool's KV-cache reuse, Kevin-32B's Chain-of-Thought summarization), robust error handling, and reinforcement learning algorithms capable of learning from delayed rewards across these extended interactions. Developing “reasoning trajectories” or “interaction trajectories” that blend natural language and tool interactions is a key research direction.

Hyperparameter Considerations for Tool-Integrated Reinforcement Learning

In tool-integrated reinforcement learning, in addition to general reinforcement learning hyperparameters, some specific hyperparameters related to tool interaction characteristics also need to be considered.

1. Tool Interaction-Specific Hyperparameters

• Maximum Tool Calls / Action Budget: As previously discussed, TORL uses hyperparameter C, SEARCH-R1 uses a maximum action budget B, and DeepResearcher limits up to 10 tool calls. These parameters are used to balance the thoroughness of exploration with training efficiency.

• Maximum Length for Retrieved Content / Tool Output: SEARCH-R1 sets a maximum length of 500 tokens for retrieved content. This affects the amount of information fed back to the model and, consequently, the management of the context window and the model's attention allocation.

2. Adjustments to General Reinforcement Learning Hyperparameters

Core reinforcement learning hyperparameters (such as learning rate, batch size, etc.) remain crucial in tool integration scenarios. However, because tool interaction changes learning dynamics (e.g., rewards might become sparser if successful tool use is complex; or trajectory lengths might change), the optimal values for these hyperparameters may shift.

The literature does not always explicitly distinguish between hyperparameter settings for tool-integrated versus non-tool-integrated reinforcement learning. But overall, the introduction of tool interaction can increase the complexity of the learning task, thus potentially requiring more careful tuning, or a tendency to choose more robust, stable settings. For example, ReTool sets the KL coefficient to 0.01 in its tool integration framework, which might be to encourage the model to engage in wider exploration when learning tool use patterns.

Hyperparameters controlling the “granularity” and “volume” of tool interaction (such as maximum calls, maximum output length) are crucial for balancing learning effectiveness with computational constraints. Parameters like maximum tool calls and maximum retrieved content length directly influence the complexity of trajectories that the reinforcement learning agent explores and learns.

More tool calls or longer outputs can provide more information, but they also increase sequence length, computational cost per step, and potentially noise in the learning signal. If the maximum tool call limit is set too low, the agent may not be able to solve complex multi-step problems. If set too high, training could become very slow, or the agent might learn inefficient, lengthy strategies.

Similarly, excessively long tool outputs might exceed context window limits or dilute important signals. Therefore, optimizing these tool-specific hyperparameters is essential for practical tool-augmented reinforcement learning. This is a trade-off between giving the agent enough freedom to learn complex interactions and keeping the training process manageable and focused. This may drive the development of adaptive strategies where these limits dynamically change during training.

Table 4: External Tools/Knowledge Bases in RL Training

Synthesis, Advanced Insights, and Recommendations

Through a deep analysis of the reinforcement learning training methodologies discussed above, we can observe some converging themes and differentiated strategies, identify emerging trends and challenges, and accordingly propose some best practice recommendations and future research directions.

Comparative Analysis: Converging Themes and Differentiated Strategies

Across numerous studies applying reinforcement learning to large language models, several common trends and methodological choices have emerged:

Converging Themes:

• Dominance of PPO/GRPO: PPO and its variants like GRPO have become de facto standard algorithms for training large language models with reinforcement learning, benefiting from their balance between stability and relative implementation simplicity.

• Outcome-Oriented Reward Functions: Despite diverse tasks, most studies tend to use reward functions based on final task results (e.g., answer correctness, task completion), which is straightforward and can, to some extent, prevent overfitting to intermediate processes.

• Criticality of High-Quality, Refined Data: Research universally emphasizes the importance of high-quality, carefully filtered and managed data for successful training, including noise removal, ensuring relevance, and preventing data contamination.

• Pervasiveness of External Information Loss Masking: When integrating external tools or knowledge bases, it is common practice to mask external information (e.g., tool output, retrieved content) during loss calculation, ensuring the model learns to utilize information for reasoning rather than simply imitating.

Differentiated Strategies:

• Choice between SFT and Direct RL: For model initialization, some studies adopt a strategy of supervised fine-tuning (SFT) followed by reinforcement learning, while others choose to start reinforcement learning directly from a base model. This reflects different trade-offs between initialization efficiency and exploration freedom.

• Advantage Estimation with or Without Critics: PPO typically relies on a learned value network (critic) to estimate the advantage function, while methods like GRPO avoid the critic through techniques such as group-wise reward normalization, creating a trade-off between computational overhead and estimation accuracy.

• Specific Techniques for Exploration and Stability: Although the goal is consistent, different studies employ different specific techniques to balance exploration and stability, such as Clip-Higher, length-adaptive GAE introduced in DAPO and VAPO, etc.

• Complexity of Reward Functions: The design of reward functions varies in complexity depending on the task and research objectives, ranging from simple binary rewards to composite rewards incorporating format adherence, efficiency considerations, and multiple components.

Emerging Trends and Overall Challenges

The application of reinforcement learning in the field of large language models is showing several positive trends, but also faces ongoing challenges:

Emerging Trends:

• Increasingly Complex and Customized Algorithms: Researchers are developing increasingly complex and customized reinforcement learning algorithms tailored to the characteristics of large language models and specific task needs, such as VAPO, DAPO, Dr. GRPO, and StarPO, which incorporate numerous innovations based on classic algorithms.

• Diversification and Deepening of External Tool Integration: The types of external tools integrated into models are becoming increasingly rich, expanding from initial calculators and code interpreters to search engines, database interfaces, and even complex web browsing and specialized development environments.

• Focus on Multi-Turn Interaction and Trajectory-Level Optimization: As task complexity increases, more attention is being paid to multi-turn interaction and optimization at the level of entire interaction trajectories, as shown in studies by StarPO and Kevin-32B.

• Strengthening of Data-Centric Methods: The understanding of data's role in reinforcement learning is deepening, leading to more refined data processing methods such as contamination filtering and strategic data augmentation.

Overall Challenges:

• Sample Efficiency: Especially for complex tasks with sparse rewards or high interaction costs, improving the sample efficiency of reinforcement learning remains a core challenge.

• Long-Range Credit Assignment: In long interaction trajectories involving multi-step reasoning and tool use, accurately assigning final rewards to key decisions in the sequence is a difficult problem.

• Training Scalability: As model scale and tool interaction complexity increase, efficient and scalable reinforcement learning training remains an ongoing engineering challenge.

• Generalization Capability: Ensuring that the learned tool use strategies or reasoning patterns can generalize to new tools, tasks, or unseen data distributions is crucial for measuring the true capability of the model.

• Reward Hacking and True Understanding: Designing reward functions that can effectively avoid reward hacking behavior and truly reflect the model's understanding remains an open problem.

Best Practices and Recommendations for Designing Reinforcement Learning Training Workflows

Based on current understanding, the following general recommendations can be provided for designing reinforcement learning training workflows:

• Data is King: Start with high-quality, carefully managed, and filtered data highly relevant to the target skill. Consider data diversity, difficulty distribution, and potential contamination issues.

• SFT Guidance: If high-quality supervised fine-tuning data is available, consider using SFT to guide the model in learning complex behaviors or the basic format of tool interactions, which helps accelerate subsequent reinforcement learning convergence.

• Algorithm Selection and Adaptation: Choose a mature reinforcement learning algorithm family (e.g., PPO/GRPO) and adapt it based on computational budget, stability requirements, and task characteristics. For example, consider critic-less methods when computational resources are limited, and explore more advanced value estimation algorithms like VAPO when pursuing higher performance.

• Reward Design: Reward function design should be as simple and clear as possible, while also being robust against reward hacking behavior. Prioritize rewards based on the final task outcome.

• Loss Masking: For any external information (such as tool output) integrated into the model's context, it is crucial to mask it during the reinforcement learning loss calculation.

• Stable Training: Employ multiple techniques to ensure the stability of the training process, including but not limited to KL divergence regularization, gradient clipping, and meticulous hyperparameter tuning.

• Tool Usage Guidelines: When integrating external tools, ensure execution in a secure environment (e.g., sandbox) and provide tool error feedback as a learning signal to the model. Simultaneously, control the frequency of tool interactions by setting mechanisms such as maximum call limits.

• Iterative Monitoring and Optimization: Reinforcement learning training is an iterative process. Continuously monitor training dynamics (e.g., reward curves, generated content quality, model entropy, etc.) and adjust data, reward functions, and hyperparameters based on observations.

Future Potential Research Directions

Looking ahead, the application of reinforcement learning in large language models still has vast exploration space:

• Higher Sample Efficiency Algorithms: Explore techniques such as model-based reinforcement learning and offline reinforcement learning to further improve sample efficiency.

• Hierarchical Reinforcement Learning: For scenarios requiring the handling of complex, multi-level tasks and tool use, hierarchical reinforcement learning may offer more effective solutions.

• Automated Reward Design: Research how to automatically design or learn effective reward functions to reduce the burden and bias of manual reward design.

• Improved Long Trajectory Credit Assignment: Develop more advanced credit assignment methods to address the challenges of learning in long interaction sequences.

• Standardized Benchmarks and Environments: Establish standardized benchmarks and simulation environments for tool-augmented reinforcement learning to facilitate fair comparisons and reproducible research among different methods.

• Deep Integration of Parameterized Knowledge and External Information: Further research how large language models can effectively weigh, integrate, and reason between their parameterized knowledge and external information obtained through tools.

Conclusion

Summary of Key Findings

This article systematically reviews and analyzes reinforcement learning training methods. Key findings include: Data management plays a fundamental and increasingly important role in reinforcement learning, with refined data selection, cleaning, filtering, and augmentation strategies being crucial for successful training;

Policy optimization algorithms represented by PPO and GRPO are current mainstream choices, while a series of innovative algorithms and techniques such as DAPO, VAPO, Dr. GRPO, and StarPO have emerged, addressing the characteristics of large language models and specific task needs; the training process typically involves optional supervised fine-tuning, an iterative reinforcement learning loop, and widely adopts stability measures like loss masking and KL regularization; fine-tuning hyperparameters is critical for balancing exploration, exploitation, and stability.

In particular, integrating reinforcement learning with external tools and knowledge bases has become an important way to enhance the capabilities of large language models. To this end, researchers have developed targeted data augmentation methods (e.g., automated construction of tool interaction exemplars), supported structured tool calls and feedback parsing at the algorithmic level, implemented interleaved reasoning and tool execution during training, ensured security through sandbox environments, and guided the model to learn effective tool usage strategies through mechanisms like loss masking and error feedback.

Final Thoughts on the Development Landscape of Reinforcement Learning Training

The application of reinforcement learning in large language models is rapidly evolving from early direct application of general algorithms to highly specialized techniques tailored to model characteristics and task requirements. The synergy among refined data strategies, continuous algorithmic innovation, and systematic management of the training process is key to unlocking the powerful potential of large language models in complex reasoning and tool use.

In the future, progress in this field will likely continue to rely on sustained breakthroughs in these areas, particularly in improving sample efficiency, enhancing algorithmic scalability, and enabling models to learn from increasingly complex interactions and feedback. As research continues to deepen, we have reason to believe that reinforcement learning will contribute core strength to building more intelligent and versatile artificial intelligence systems.

Long press to add assistant

Scan QR code to add assistant's WeChat

Please note: Name-School/Company-Research Direction

(e.g., Xiaozhang-Harbin Institute of Technology-Dialogue Systems)

You can apply to join the Natural Language Processing/Pytorch and other technical exchange groups

About Us

MLNLP community is a civilian academic community jointly built by machine learning and natural language processing scholars from home and abroad. It has now developed into a well-known machine learning and natural language processing community both domestically and internationally, aiming to promote progress among the academic, industrial, and enthusiastic communities of machine learning and natural language processing.

The community provides an open exchange platform for relevant practitioners in further studies, employment, and research. We welcome everyone to follow and join us.

Comprehensive Summary: Reinforcement Learning Implementation Paths for Reasoning Models

Share Short URL