Summary! Multi-Turn Planning Techniques in 2025 for Large Language Model Agent RL Training

Datawhale Insights

Author: Shinan, Editor: Qingke AI

Author: Shinan (Zhihu) https://zhuanlan.zhihu.com/p/1902381952998281700

After DeepSeek R1 popularized GRPO-based reinforcement learning (RL) technology, agentic tool use learning also began to employ various algorithms such as GRPO, Reinforce++, PPO, and policy gradient (previously, it was SFT+DPO, which required a large amount of annotated data to cover bad cases; annotating high-quality data back then brought me to tears). The goal is to teach large language models (LLMs) to use tools like code interpreters and web search to enhance their mathematical and reasoning capabilities. A single turn involves calling a tool once, while multiple turns involve calling tools multiple times. Multi-turn tool use is more challenging, mainly due to difficulties in data acquisition and unclear modeling approaches (whether to use an MDP-like training mode that only considers the current state, or a full history mode that considers all states). Tool-use RL is a new research direction with untapped potential.

Recent work has focused on designing prompt templates for multi-turn tool-use, as well as designing rule-based rewards during training (correctness reward, format reward, tool execution reward, etc.), masking tool output during training, incorporating asynchronous parallelism during sampling, integrating Megatron's pipeline parallelism, adding multimodal information, and so on. The basic training paradigm is to first collect a batch of expert trajectories for supervised fine-tuning (SFT), then train with RL (e.g., ReTool), or directly apply RL (e.g., TORL, ToolRL, OTC). Currently, there isn't a method specifically designed for agent RL; most approaches reuse existing infrastructure (like verl, open-rlhf, trl, ms-swift) with some extensions.

Recently, based on tool-use, tool-integrated reasoning has also emerged. The difference from CoT is that tools are used during the reasoning process. This dynamically adds search, code, and various custom API inputs to the reasoning process, further enhancing reasoning capabilities. There are relatively few articles on this topic in the market. I'd like to share my experience from reading papers and practical work to jointly promote research in this field.

图片

Some people might wonder how to write the prompt for agent tool use. The following example is a tool prompt that includes a chat template, specifically covering the chat template, tool schema, tool call, tool response, and so on. The system prompt generally contains tool schema information, the assistant's response, the thinking process is enclosed in <think></think>, tool call parameters are in </tool_call>, and the tool execution response is in the user's response, within </tool_response> tags:

<|im_start|>system In this environment you have access to a set of tools you can use to assist with the user query. You may perform multiple rounds of function calls. In each round, you can call one or more functions. Here are available functions in JSONSchema format: ```json tool_schema ``` In your response, you need to first think about the reasoning process in the mind and then conduct function calling to get the information or perform the actions if needed. The reasoning process and function calling are enclosed within <think> </think> and <tool_call> </tool_call> tags. The results of the function calls will be given back to you after execution, and you can continue to call functions until you get the final answer for the user's question. Finally, if you have got the answer, enclose it within \boxed{} with latex format and do not continue to call functions, i.e., <think> Based on the response from the function call, I get the weather information. </think> The weather in Beijing on 2025-04-01 is \[ \boxed{20C} \]. For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags: <tool_call> {"name": <function-name>, "arguments": <args-json-object>} </tool_call><|im_end|> <|im_start|>user User 1 wants to return all books by Jane Doe they have checked out. Each book is 5 days overdue, what is the total late fee? <|im_end|> <|im_start|>assistant <think> First, I need to find out which books by Jane Doe the user has checked out. Then, I will calculate the late fee for each book based on the number of days it is overdue. </think> <tool_call> {"name": "get_user_checked_out_books", "arguments": {"user_id": 1}} </tool_call> <tool_call> {"name": "search_books_by_author", "arguments": {"author": "Jane Doe"}} </tool_call> <|im_end|> <|im_start|>user <tool_response> {"name": "get_user_checked_out_books", "arguments": {"user_id": 1}} ['Python Basics', 'Advanced Python', 'Data Structures'] </tool_response> <tool_response> {"name": "search_books_by_author", "arguments": {"author": "Jane Doe"}} [{'title': 'Python Basics', 'author': 'Jane Doe', 'copies_available': 3}, {'title': 'Advanced Python', 'author': 'Jane Doe', 'copies_available': 0}] </tool_response> <|im_end|> <|im_start|>assistant <think> ... ...

The rise of agent RL has brought hope to many who work on agent-based applications. Let's discuss the advantages of agent RL:

• External knowledge can be directly obtained through tool interaction, further improving the model's accuracy.

• DPO is a data-driven method that requires a large amount of data for training. DPO incorporates some ideas from contrastive learning, making it very difficult to train well. PPO-series methods are an online-RL approach where samples are generated through sampling and then used for training improvement, requiring much less data than DPO.

Having discussed the advantages of agent RL, I'd also like to talk about its shortcomings.

• Truly complex tasks may require dozens of steps to complete. Due to LLMs' decreased performance in processing long sequences and low computational efficiency with long sequences, existing RL frameworks are still focused on tasks that can be completed in about 10 steps. Real-world tasks often require 30-100 steps to solve, so there is still a long way to go before truly complex problems can be solved.

• Although GRPO is a rule-based method that simplifies the process, it still requires annotated data, carefully designed rewards, and finally hyperparameter tuning and data adjustment to achieve good results.

• RL requires an environment for training, typically simulated environments. Their speed is certainly not as fast as GPU computation. Accelerating the environment to keep pace with RL training is also a consideration.

• Agent RL research mostly focuses on single tools, such as code interpreter-only, web search-only, etc., with less research on mixed multi-tool multi-turn calls.

Some might be confused about these RL algorithms; I'll briefly clarify them:

• PPO: Treats each generated token as an

Main Tag:LLM Agent Training

Sub Tags:Reinforcement LearningReward DesignMulti-Turn PlanningTool Use


Previous:Qwen Team Releases Long-Context Reasoning Model QwenLong-L1, Surpassing o3-mini

Next:LLM + RL Questioned: Deliberately Using Incorrect Rewards Still Significantly Boosts Math Benchmarks, Causing a Stir in the AI Community

Share Short URL