Process Supervision > Outcome Supervision! Huawei City University Reconstructs RAG Inference Training, 5k Samples Outperform 90k Model

The MLNLP community is a well-known machine learning and natural language processing community both domestically and internationally, reaching NLP master's and doctoral students, university professors, and corporate researchers.

The community's vision is to promote communication and progress among the academic, industrial, and enthusiast communities in natural language processing and machine learning, especially for beginners.

Source |PaperWeekly

With the rapid development of Large Language Models (LLMs), Retrieval Augmented Generation (RAG) has become an essential path for AI to acquire knowledge. However, traditional RAG faces a fatal flaw: they mechanically "look up information once, answer once," becoming helpless when faced with complex problems that require multi-layered, step-by-step reasoning. This is like asking a student who can only use a dictionary to solve a mathematical proof – destined to fail.

"Agentic RAG" emerged in response, enabling AI to act like human experts, autonomously deciding when to consult information, how to extract key questions, and how to integrate diverse information. Star projects like Deep-research are pioneers in this revolution.

Recent academic advances, such as the Search-R1 method, incorporate outcome-supervised reinforcement learning into the Agentic RAG training process, using the correctness of the final answer as the sole reward signal, achieving notable results. However, outcome-supervised strategies – which only care about the final answer's correctness and use a single reward signal to guide the entire training process – are like teaching a child to solve a problem by only saying "the answer is wrong" without pointing out where the mistake occurred.

A research team from City University of Hong Kong and Huawei Noah's Ark Lab found that outcome-supervised reinforcement learning in Agentic RAG has three key issues:

• Blind and inefficient exploration: The model is like fumbling in the dark, only knowing if it's right or wrong after completing all steps.

• Unclear credit assignment: Correct early reasoning steps are often wrongly "punished" due to subsequent errors.

• Overly coarse feedback: Lack of fine-grained guidance makes it difficult for the model to master complex decision-making skills.

The research team put forward a crucial insight: to train an Agentic RAG system that truly possesses "thinking ability," relying solely on the final answer as a reward is far from enough; every critical decision during the reasoning process should be precisely supervised and optimized.

Based on this concept, the team systematically introduced process-supervised reinforcement learning into the Agentic RAG training process for the first time, building a new framework called ReasonRAG. This method significantly improved model performance through three innovative mechanisms:

• Fine-grained reward mechanism

• Search-based optimal path construction

• Preference optimization training strategy

The experimental results are remarkable: on multiple authoritative evaluation sets, ReasonRAG, using only 5k training data, surpassed the Search-R1 model which required 90k data, demonstrating excellent data efficiency and reasoning capabilities.

Paper Title:

Process vs. Outcome Reward: Which is Better for Agentic RAG Reinforcement Learning

Paper Address:

https://arxiv.org/abs/2505.14069

Code Address:

https://github.com/wlzhang2020/ReasonRAG

Technical Challenges

Achieving process-supervised optimization for Agentic RAG faces two core challenges:

• How to define high-quality process rewards? The reward should not only judge whether the model's reasoning is correct but also guide it to take the shortest and most effective path. Among equally correct answers, shorter paths should be encouraged more.

• How to automatically annotate process supervision data? High-quality intermediate steps usually require manual annotation, but this method is time-consuming, labor-intensive, and difficult to scale. How to enable the model to automatically generate supervised intermediate reasoning steps becomes critical.

Core Technology Analysis

ReasonRAG constructs a tightly integrated reasoning closed-loop system, with the entire path from reward design to model decision-making revolving around five key steps: setting process rewards → searching reasoning paths → constructing preference data → optimizing decision strategies → real-time dynamic reasoning. These five steps enable the model to learn to combine search to complete a reasoning path that is "both accurate and fast."

Step One: The reward mechanism considers not just the result, but also the process. In traditional methods, models only score points if they get the answer right. ReasonRAG, however, "scores" each reasoning step, introducing Shortest Path Reward Estimation (SPRE). By simulating multiple paths, it rewards quick and accurate decisions and penalizes redundant, ineffective thoughts, teaching the model to "take fewer detours and hit the target more often."

Step Two: Reasoning paths are not chosen impulsively, but found with a tree. Faced with a vast number of possible thought paths, ReasonRAG doesn't rely on intuition but leverages Monte Carlo Tree Search (MCTS) to systematically search multi-turn combinations of "whether to retrieve, whether to answer." Each reasoning step is like navigating a maze, progressively approaching the optimal path through a state-action tree.

Step Three: Preference samples are self-generated. Insufficient process supervision data is not an issue; ReasonRAG simply generates RAG-ProGuide itself. In this dataset, the reasoning paths constructed by the model through the first two steps are automatically scored and ranked, ultimately forming examples of good and bad comparisons, allowing the model to optimize decision preferences through reinforcement learning.

Step Four: Preference learning, making choices systematic. With clear preference comparisons, ReasonRAG uses the DPO optimization strategy to help the model progressively learn and make better decisions.

Step Five: Flexible scheduling of the reasoning process. ReasonRAG designed a clear reasoning control flow. The model can dynamically decide whether to retrieve or generate an answer based on the current task state, flexibly calling various capability modules to achieve intelligent and orderly thought progression.

Experimental Results

Performance Comparison

The paper systematically compared ReasonRAG with 12 SOTA methods on five authoritative question-answering datasets. The results demonstrated ReasonRAG's significant advantages in data efficiency, multi-hop reasoning, and generalization ability:

High data efficiency: Using only 5k training samples, ReasonRAG surpassed Search-R1 (trained with 90k data, EM 32.8%, F1 40.7%) in average EM (34.4%) and F1 (42.3%). Process rewards significantly outperformed traditional outcome rewards.

Stronger multi-hop reasoning: On HotpotQA, ReasonRAG achieved an F1 score of 48.9%, outperforming AutoRAG (43.7%) and Search-R1 (47.0%), demonstrating strong complex reasoning integration capabilities.

Good cross-domain generalization ability: ReasonRAG consistently showed leading performance on challenging test sets like Bamboogle and MuSiQue, indicating that its reasoning strategy possesses good transferability and robustness.

Training Efficiency

ReasonRAG's EM performance on PopQA, HotpotQA, and 2WikiMultiHopQA consistently increased faster than Search-R1's as GPU hours increased, indicating its higher training efficiency.

Optimization Strategy

The experiments further compared the effects of different optimization strategies: including Base model, Supervised Fine-Tuning (SFT), Outcome-based Reinforcement Learning (ORL), and Process-based Reinforcement Learning (PRL).

The results show that ReasonRAG achieved the best performance across all datasets, indicating that the fine-grained feedback mechanism provided by process rewards is more conducive to learning complex reasoning strategies.

Summary and Future Directions

ReasonRAG proposes a process reward-based Agentic RAG reinforcement learning training paradigm, demonstrating potential in training efficiency, complex reasoning ability, and generalization performance. Compared to traditional outcome supervision methods, process-level supervision provides more fine-grained and stable optimization signals, especially suitable for multi-turn, complex task learning.

Future directions include:

• Building a richer process reward system, introducing multi-dimensional feedback signals such as information redundancy penalties;

• Extending to more task scenarios, such as multimodal question answering, code reasoning, complex tool calling, and other agentic applications.

Invitation to Technical Exchange Group

△Long press to add assistant

Scan QR code to add assistant WeChat

Please note: Name-School/Company-Research Direction

(e.g., Xiao Zhang-Harbin Institute of Technology-Dialogue Systems)

to apply to join technical exchange groups for Natural Language Processing/Pytorch, etc.

About Us

The MLNLP community is a civilian academic community jointly established by machine learning and natural language processing scholars from home and abroad. It has currently developed into a well-known machine learning and natural language processing community globally, aiming to promote progress in the academic and industrial fields of machine learning and natural language processing, as well as among enthusiasts.

The community provides an open exchange platform for relevant practitioners for further studies, employment, and research. Welcome to follow and join us.

Process Supervision > Outcome Supervision! Huawei City University Reconstructs RAG Inference Training, 5k Samples Outperform 90k Model

Share Short URL