ARPO: Agentic Reinforced Policy Optimization, Enabling Agents to Explore One Step Further at Critical Moments

圖片

The first author of this paper is Dong Guanting, currently a first-year Ph.D. student at Gaoling School of Artificial Intelligence, Renmin University of China, advised by Professor Dou Zhicheng and Professor Wen Jirong. His research focuses primarily on large language model reasoning, multi-agent reinforcement learning, and deep search agents. He has published multiple papers at top international conferences such as ICLR, ACL, and AAAI, and has interned with large model teams including Kuaishou's Large Model Application Group and Alibaba's Tongyi Qianwen Group. His representative works include AUTOIF, Tool-Star, RFT, Search-o1, WebThinker, Qwen2, and Qwen2.5. The corresponding authors of this paper are Professor Dou Zhicheng from Renmin University of China and Zhou Guorui from Kuaishou Technology.

Driven by Verifiable Reinforcement Learning (RLVR), large language models have demonstrated impressive performance in single-turn reasoning tasks. However, in real-world reasoning scenarios, LLMs often need to combine external tools for multi-turn interaction. Existing RL algorithms still fall short in balancing the model's long-range reasoning and multi-turn tool interaction capabilities.

To address this, we propose a novel Agentic Reinforced Policy Optimization (ARPO) method, specifically designed for multi-turn interactive LLM agents.

ARPO is the first to discover that model reasoning uncertainty (high entropy) significantly increases after calling external tools. Based on this, it introduces an entropy-driven adaptive rollout strategy to enhance exploration during high-entropy tool-calling steps. Simultaneously, by incorporating advantage attribution estimation, the model can more effectively understand the value differences of each step in tool interactions. Across 13 challenging benchmarks in computational reasoning, knowledge reasoning, and deep search, ARPO significantly outperforms existing sample-level RL methods while using only half the tool-calling budget, providing a scalable new solution for efficient training of multi-turn reasoning agents.

圖片

Paper Title: Agentic Reinforced Policy Optimization

Paper Link: https://arxiv.org/abs/2507.19849

Code Repository: https://github.com/dongguanting/ARPO

Open-source Data & Models: https://huggingface.co/collections/dongguanting/arpo-688229ff8a6143fe5b4ad8ae

This research has garnered significant attention on X (formerly Twitter) and ranked first on the Huggingface Paper daily and weekly charts.

圖片

Research Motivation: Seizing High-Entropy Moments After Tool Calls

In recent years, large-scale reinforcement learning with verifiable rewards has fully unleashed the potential of frontier large language models in single-turn reasoning tasks, showing impressive performance. However, in open-ended reasoning scenarios, LLMs not only need long-range planning and adaptive decision-making capabilities but also dynamic multi-turn interaction with external tools. This has given rise to a new paradigm, Agentic RL, shifting training from static problem-solving to dynamic agent-environment reasoning. Existing Agentic RL methods mostly employ sample-level algorithms (e.g., GRPO, DAPO), independently sampling complete tool-calling trajectories under fixed special tokens and relying on a final output reward model. However, this approach often undervalues multi-turn interaction due to sparse rewards and excessive tool use, neglecting fine-grained behavioral exploration at each step during tool calls.

By analyzing the token entropy distribution of LLMs in deep search tasks, our study found that the entropy value significantly increases during the initial generation phase after each tool call, indicating that external tool feedback introduces high uncertainty, which is an exploration opportunity not fully utilized by existing methods.

圖片

Figure 1: The left image shows the high-entropy phenomenon of large models after tool calls, the right image compares ARPO and baseline performance.

ARPO Framework: Training Models to Autonomously Perform Multi-Tool Calls During Reasoning

Addressing the above findings, we propose Agentic Reinforced Policy Optimization (ARPO). Its core idea is to adaptively branch sampling during high-entropy tool-calling steps, exploring more diverse reasoning paths. Specifically, our contributions are as follows:

We quantified the token entropy changes of LLMs during Agentic reasoning, revealing the inherent limitations of sample-level RL algorithms in aligning LLM agents.

We proposed the ARPO algorithm, introducing an entropy-based adaptive rollout mechanism that encourages branching sampling in high-entropy tool-calling steps while maintaining global sampling. Additionally, ARPO integrates advantage attribution estimation, helping LLMs better internalize advantage differences in step-level tool usage behaviors.

Beyond heuristic motivation, we also theoretically substantiated the rationality of introducing the ARPO algorithm in LLM agent training.

Experiments on 13 challenging benchmarks show that ARPO consistently outperforms mainstream RL algorithms, even when using only half the tool-calling training budget, providing a feasible reference and practical insights for exploring Agentic RL.

Entropy Variation Phenomenon of Tool Calls: High-Entropy Moments and Exploration Dilemmas

圖片

Figure 2: Cross-dataset analysis of token entropy changes and token frequency distribution for LLM-based tool-using agents.

By analyzing the token entropy values of large models when combining tools to perform complex search and reasoning tasks, we found the following:

1. Entropy significantly increases within the first 10–50 tokens after each tool call.

2. Entropy often increases during the initial stages of reasoning but remains below the level observed after the large model receives tool call feedback.

3. The entropy fluctuation introduced by search engine feedback is greater than that from code compiler execution feedback.

These phenomena can be attributed to the token distribution shift between external feedback and the model's internal reasoning, which can even lead to the introduced reasoning uncertainty exceeding the original input problem. Furthermore, search engines typically provide rich textual content, while code compiler outputs consist of deterministic numbers, leading to greater entropy fluctuation in the former.

Tool Design: Diverse Tools Supporting Agentic Reasoning

This study focuses on optimizing training algorithms for LLM-based tool-using agents. After reviewing existing Agentic RL research, we selected three representative types of tools for empirical evaluation of ARPO's effectiveness:

Search Engine: Retrieves relevant information by executing web search queries, supporting both local and online modes.

Web Browsing Agent: Accesses and parses web links returned by search engines, extracting and summarizing key information in response to queries.

Code Interpreter: Automatically executes LLM-generated code, returning results if successful, otherwise returning compilation error messages.

These tools cover multiple functionalities such as information retrieval, content parsing, and program execution, providing strong support for multi-turn interaction and complex reasoning scenarios.

ARPO Algorithm: Utilizing Entropy Signal to Guide LLM Step-by-Step Tool Optimization

Entropy-based Adaptive Rollout Mechanism

ARPO's core idea is to combine global sampling with entropy-driven local sampling, increasing exploration intensity during stages where model uncertainty rises after tool calls, thereby improving reasoning performance. Its entropy-based adaptive rollout mechanism includes four key steps:

圖片

Figure 3: ARPO's entropy-driven adaptive rollout mechanism, combining global exploration with local high-entropy node branching.

1. Rollout Initialization

Set the global rollout scale M, and first perform sample-level global sampling: the LLM generates N initial trajectories for input query q and calculates the entropy of the first token of each trajectory, forming the initial entropy matrixInitial Entropy Matrix E_q(p_0). The remaining M-N trajectory sampling budget is reserved for local sampling.

2. Entropy Change Monitoring

After each tool-calling step t, the model continues to generate k tokens after concatenating the tool's return results and calculates the step-level entropy matrixStep-level Entropy Matrix E_q(p_t). The normalized entropy change relative to the initial state is quantified byNormalized Entropy Change Formula, thereby determining the trend of current reasoning uncertainty.

3. Entropy-based Adaptive Branching

To guide the model to explore more deeply at nodes where entropy significantly increases, the local sampling probability for tool-calling step t is defined as:Local Sampling Probability Formula

The model's branching decision is as follows:

圖片

This mechanism adaptively allocates exploration resources to entropy-increasing regions, which often contain higher information gain.

4. Termination Condition

The rollout process continues until the number of branched paths reaches the budget upper limit M-N (stop branching and complete sampling) or all paths terminate early. If the budget still has a remainder, global sampling is supplemented to cover a more comprehensive reasoning space.

Through the above mechanism, ARPO achieves uncertainty-aware efficient exploration while maintaining computational complexity within theComputational Complexity O(M)range, enabling large models to precisely identify and fully utilize the high information gain stages after tool calls.

Advantage Attribution Estimation

ARPO's entropy-driven adaptive rollout generates trajectories containing shared reasoning segments and branched paths, which inspires us to optimize the policy update method to better utilize step-level tool-calling information.

Two Advantage Estimation Methods

1. Hard Advantage Estimation (Hard)

Clearly distinguishes between shared and branched tokens, calculating average advantage for the shared part and separate advantage for the branched part:

Advantage estimation for branched tokens:

圖片

Advantage estimation for shared tokens:

圖片

2. Soft Advantage Estimation (Soft)

Implicitly distinguishes between tokens of shared and branched reasoning chains during policy optimization. Through GRPO (Group Relative Policy Optimization), it dynamically adjusts the importance sampling ratioImportance Sampling Ratio Rho_t Symbolto naturally handle both types of tokens:

圖片

Where the importance sampling ratio is:

圖片

When two trajectories share the same token prefix before step t, their shared tokens have the same importance weightImportance Sampling Ratio Rho_t Symbol. Therefore, this update process is approximately equivalent to hard advantage estimation and is more elegant.

Experimental results demonstrate that soft advantage estimation consistently yields higher rewards in ARPO training, thus it is set as the default advantage estimation method.

Layered Reward Design

ARPO's reward function comprehensively considers answer correctness, tool call format, and multi-tool collaboration. If the model uses multiple tools such as search (<search>) and code (<python>) during reasoning, and ensures correct answers and compliant formatting, it receives additional rewards, as shown in the formula:

圖片

Where:

圖片

Through soft advantage estimation and the layered reward mechanism, ARPO can optimize multi-turn tool usage strategies more smoothly and efficiently during training.

Experimental Results: 10+ Comprehensive Reasoning Task Evaluations

To fully evaluate ARPO's generalization ability and efficiency, we considered the following three test sets:

Computational Reasoning Tasks: Evaluate the model's computational reasoning capabilities, including AIME24, AIME25, MATH500, GSM8K, MATH.

Knowledge-Intensive Reasoning Tasks: Evaluate the model's ability to combine external knowledge for reasoning, including WebWalker, HotpotQA, 2WIKI, MisiQue, Bamboogle.

Deep Search Tasks: Evaluate the model's deep search capabilities, including HLE, GAIA, SimpleQA, XBench.

圖片

圖片

From the experimental results, it can be observed that:

ARPO generally outperforms mainstream methods: ARPO achieves higher accuracy than sample-level RL methods such as GRPO and DAPO on most tasks, with more significant improvements in tool-call intensive tasks (e.g., GAIA, HLE).

Stable performance across multiple tasks: ARPO maintains good performance across computational, knowledge, and search tasks, with no obvious performance shortcomings, verifying its cross-task adaptability.

Experiments: Sampling Analysis and Tool Calling Efficiency Evaluation

Multi-turn Sampling Capability Improves Model Performance

Due to the dynamic and multi-turn interactive nature of Deepsearch tasks, using only the Pass@1 metric cannot fully reflect the model's tool-calling potential. We further analyzed the Pass@3 and Pass@5 metrics and found that both 8B and 14B scale models, after ARPO alignment training, showed continuous improvement and good scaling effects. Notably, the 14B model performed exceptionally well on the Pass@5 metric:

GAIA reached 61.2%

HLE reached 24.0%

XBench-DR reached 59%

Tool Calling Efficiency Significantly Improved

In Agentic RL training, the number of tool calls directly impacts cost. Taking the Qwen2.5-7B model as an example, we compared ARPO with the GRPO method:

ARPO outperformed GRPO in overall accuracy

While using only about half the number of tool calls

圖片

This is attributed to ARPO's unique entropy-based adaptive sampling mechanism, which branches sampling only in high-entropy tool-calling steps, greatly expanding the exploration space for tool behaviors while reducing unnecessary calls.

Summary and Future Outlook

The ARPO algorithm effectively improves the performance of multi-turn tool reasoning agents, addressing the problems of insufficient exploration and lack of generalization ability in existing sample-level RL methods during multi-turn interaction. Through entropy-driven adaptive sampling and advantage attribution mechanisms, ARPO can achieve more efficient and stable outputs in tasks with frequent tool calls and complex reasoning paths. In the future, to continuously enhance the capabilities of Agentic RL models, several directions are worth exploring:

Multimodal Agentic RL: ARPO currently primarily focuses on text reasoning tasks, and still has limitations in processing multimodal information such as images and videos. Future work can extend to multimodal tasks, exploring tool calling and policy optimization in multimodal scenarios.

Tool Ecosystem Expansion: ARPO has demonstrated its potential in multi-tool collaboration tasks. In the future, more types of external tools (e.g., code debuggers, data analysis tools, real-time API calls) can be introduced, and further improvements in complex task performance can be achieved through tool usage policy optimization.

Large-scale and Real-time Deployment: ARPO has shown high training efficiency and reasoning generalization. Future work can explore its deployment and adaptation in larger-scale models and real-time dynamic environments, reducing costs while enhancing practical value.

圖片

Main Tag:Reinforcement Learning

Sub Tags:Large Language ModelsTool UsePolicy OptimizationAI Agents


Previous:Open-Sourcing the Largest High-Quality Scientific Reasoning Post-Training Dataset to Quickly Turn Qwen3 and Others into "Scientists"

Next:Altman Reveals Stunning Prediction: GPT-8 to Cure Cancer by 2035! Humanity Might Wage WWIII Over Compute Power

Share Short URL