ZeroSearch: <Alibaba Technology> Large Language Models Learn Through Self-Rewarding Without a Browser

Here's another technology to learn about. This time it's a framework technology developed by Alibaba, focused on building general AGI capabilities. If self-learning can be achieved, it's true that large language models can unify all knowledge domains without issue. So the Alibaba team proposed ZeroSearch to incentivize the search capabilities of LLMs without interacting with actual search engines, transforming the LLM into a retrieval module capable of generating relevant and noisy documents based on queries.

Reinforcement learning (RL) is currently the best solution for training large models and a promising strategy. It further improves LLM performance by enhancing reasoning and decision-making abilities. Notably, RL-based models like OpenAI-o1 and DeepSeek-R1 have achieved significant advancements in logical and iterative reasoning (DeepSeek-R1: In-depth analysis, the first step for domestic AGI), purely through reward-driven learning without relying on explicit step-by-step supervision.

Under this paradigm, some research explores using reinforcement learning to train policy models capable of searching for relevant information more effectively. DeepResearcher introduced real-time interaction with commercial search engines like Google, allowing models to be trained in an environment very similar to real-world web search (Magentic-One: Implementation of AI networked search, a general multi-agent solution). Despite these advances, combining RL with real-world search scenarios still faces significant challenges:

Uncontrolled document quality: The quality of documents retrieved from real-time search engines is often unpredictable, introducing noise and instability to the training process.

Excessive API costs: RL training requires frequent deployment and extensive manual labeling, potentially involving hundreds of thousands of API calls, which incurs huge financial costs and severely limits scalability.

Key points of this article:

Understanding ZeroSearch Architecture & Technical Principles

Understanding AI Knowledge

ZeroSearch

ZeroSearch is a reinforcement learning framework that enables LLMs to learn search strategies without interacting with real search engines. At its core, LLMs acquire extensive world knowledge during large-scale pre-training, allowing them to generate relevant documents based on search queries (basically making their own cheat sheet from the textbook, huh?).

The main difference between a search engine and a simulated LLM lies in the text style of the returned content. Through lightweight supervised fine-tuning or prompt constraint, the behavior of a real search engine can be effectively simulated. Besides eliminating API costs, a significant advantage of using LLMs for document generation is the ability to control document quality.

How to put it, during supervised fine-tuning, prompt design is used to distinguish between documents that lead to correct or incorrect answers, enabling the simulated LLM to learn to generate relevant or noisy documents by adjusting a few words in the prompt. Building on this, a curriculum rollout mechanism is introduced during training, where the quality of generated documents gradually decreases over time to simulate increasingly challenging retrieval scenarios. This allows the policy model to first learn basic output formats and task requirements, and then progressively adapt to more challenging and noisy retrieval scenarios.

Regarding ZeroSearch scalability, increasing the number of GPUs can speed up the simulated LLM's generation throughput, enabling efficient large-scale rollout. Using a 3B LLM as a simulated search engine effectively incentivizes the policy model's search capabilities. A 7B retrieval module achieved performance comparable to Google Search, while a 14B retrieval module even surpassed Google Search.

ZeroSearch is compatible with base models and instruction-tuned models of various parameter sizes, without requiring a separate supervised warm-up phase (no pre-filling either? The author thinks Alibaba is boasting a bit). Furthermore, it seamlessly integrates with widely used reinforcement learning algorithms, including Proximal Policy Optimization (PPO), Group Relative Policy Optimization (GRPO), and Reinforce++.

Architecture & Underlying Principles

Before introducing the core technical points, let's understand what LLM retrieval is. Everyone knows that LLM inference is also a retrieval process, similar to calling an expert system (browser) to find the maximum predicted probability value from a normal distribution (softmax) to predict and summarize. Of course, retrieval also has more complete external tools (RAG) and reinforced chain-of-thought approaches.

RAG enhances generation performance by integrating relevant external knowledge into the generation process. It guides the LLM through processes like query generation, query decomposition, and multi-turn information retrieval. Although these methods are effective, they often require complex prompt engineering and place high demands on the model's reasoning capabilities. To improve efficiency and reduce dependence on powerful black-box LLMs, subsequent research proposed supervised fine-tuning strategies for smaller LLMs. However, these enhancements simultaneously introduce performance and time costs during deployment.

Self-RAG employs a self-reflection mechanism to iteratively refine model output through predicted reflection tokens.

RetroLLM integrates retrieval and generation capabilities by enabling the model to directly generate fine-grained evidence from the corpus through constrained decoding.

RAG-star integrates retrieved information into the Monte Carlo Tree Search (MCTS)-based reasoning process, dynamically expanding the search space during inference.

AirRAG adopts Monte Carlo Tree Search (MCTS) to activate intrinsic reasoning capabilities and expand the solution space.

Reinforced chain-of-thought is simple; it uses a DeepResearcher-like framework to set up an agent to retrieve necessary knowledge.

Zero Retrieval

Returning to the main text's definition of zero retrieval, the Alibaba team describes it as utilizing LLMs to simulate search engines, thereby eliminating the need for real search engines. As shown below,

The team demonstrates the application process of two reinforcement learning algorithms (PPO and GRPO) within the ZeroSearch framework. The rollout sequence includes tokens generated by the policy model and document tokens returned by the simulated LLM.

Here's a problem: applying the same optimization procedure uniformly to two types of tokens can lead to training instability because the retrieved content is externally generated and not under the direct control of the policy model.

To mitigate this issue, the team introduced a loss masking mechanism for retrieved tokens, ensuring that gradients are calculated only for the model's own output. This strategy stabilizes the reinforcement learning training process while maintaining the effectiveness of retrieval-augmented generation.

The overall interaction is divided into three distinct stages: First, the model explicitly articulates its internal reasoning within the ... tags. Second, if more evidence is needed, it issues a search query within the

... tags. Finally, once sufficient information is retrieved, the model provides the answer within the ... tags.

(Image from appendix)

Let's look at PPO first. It's a policy gradient algorithm aimed at maximizing expected reward while ensuring training stability by limiting the magnitude of policy updates. Its core idea is: policy model updates should not be overly aggressive to avoid performance collapse due to a single update. Starting from the policy model, π(θ), which is essentially a value model, generates trajectories based on the input question. Several tag-related issues need to be considered, such as reasoning (), analyzing whether information search is needed, searching (

) to generate search prompts, retrieving () by the simulated search engine to return relevant or noisy documents, and finally answering () to generate the final answer. Then, based on the reward r obtained from the reward model and reference model, A is calculated using the formula below, and the policy model is updated.

RPO, on the other hand, optimizes the policy through relative reward comparisons within groups, reducing reliance on absolute reward values, thereby improving training efficiency and stability. The policy model generates multiple trajectories (O(1)...O(g)). Each set of trajectories corresponds to different search query and answer generation attempts. Then, based on the set of r results, the policy probabilities are adjusted according to reward rankings, encouraging high-reward behaviors and suppressing low-reward behaviors. Finally, feedback updates the policy model.

PPO is like a "strict coach": requiring students to improve scores on each test but prohibiting sudden changes in learning methods (like switching from rote memorization to divergent thinking).

GRPO is like a "competitive popular vote": students learn which methods are more effective through within-group rankings (e.g., Method A scores higher than Method B), thereby self-adjusting their strategy.

A bit confused, huh?Let's summarize the above case in plain terms. Suppose a policy model receives the question "Who is the author? Where does he live?". The model simulates "thinking aloud", analyzing the question within , e.g., "I need to first find the author of this article, then find the city where he lives." Then it generates a search query within

, e.g., "city where the author lives". The simulated search engine (the updated policy model, which is denoised) returns documents based on the query and gets . Finally, it generates the answer, e.g., "Changsha, Hunan".

To achieve this, Alibaba proposed a formula to maximize the reward (answer accuracy) while constraining the difference between the policy model π(θ) and the reference model π(ref) through KL divergence, ensuring stable and controllable policy updates.

Here, π(θ）represents the policy model to be optimized, responsible for generating search queries and the final answer. Conversely, π(ref) is the reference model (usually the initial policy model), used to constrain policy updates and prevent excessive deviations. r(ϕ) is clearly a reward function, providing feedback based on the accuracy of the answer. Its main purpose is to find the optimal solution y based on π(ref) and π(θ）. π(ψ) represents the simulated search engine LLM, with fixed parameters, generating documents based on queries. β represents the weight coefficient for KL divergence, balancing reward maximization and policy stability.

Another key point is the design of the reward function. This reward function is a core mechanism. First, the reward function guides the model's learning direction. Based on the match between the generated answer y and the standard answer, the reward function quantifies the model's correctness. For example, if the answer is completely correct, the value of the reward is high; otherwise, it is low. Second, it prevents reward hacking. Using Exact Match (EM) here prevents the model from generating overly long answers to "get lucky" and include the correct answer. Instead, the score considers both precision (the proportion of correct parts in the predicted answer) and recall ( the proportion of the standard answer covered), encouraging the model to generate concise and accurate answers. Finally, there is dynamic adjustment. The level of the reward influences directly the parameter update direction of the policy model π(θ), causing it to gradually favor retrieval or reasoning that yields high rewards.

Fine-tuning

The team proposed a lightweight Supervised Fine-Tuning (SFT) procedure. Specifically, interaction trajectories are collected by prompting the LLM to interact with a real search engine in a multi-turn manner until a final answer is reached. Trajectories that produced correct answers are labeled as positive, indicating useful retrieved documents. Conversely, trajectories that led to incorrect answers are labeled as negative, indicating noisy retrieval results.

Then, the team extracts query-document pairs from positive and negative trajectories and performs lightweight SFT to improve the LLM's ability to simulate a real search engine. As shown below, by adjusting a few words in the prompt, useful retrieval and noisy retrieval can be distinguished. Additionally, the input question and its corresponding answer are incorporated into the prompt to broaden the LLM's knowledge boundary. After fine-tuning, the LLM is capable of generating useful and noisy documents, enabling dynamic document quality control during the deployment process.

(Image from appendix)

Automated Learning Mechanism

Through the policy model design and prompt mentioned above, the team deployed a fully automated agent to achieve self-learning. In this process, the policy model performs interactive reasoning and generates search queries, which are fed into the simulated LLM to generate corresponding documents. To gradually increase the training difficulty, a curriculum-based deployment mechanism is introduced, where the quality of generated documents gradually decreases over time. This is controlled by a probability function.

Here, p(s) and p(e) represent the initial and final noise probabilities, i and m represent the current training step and total training steps, and b is the base of the exponent, with a default value of 4. As training progresses, the ratio i/m increases, resulting in a higher p(i) value; i.e., the probability of generating noisy documents is initially higher. But this allows the policy model to first learn basic output structures and task requirements, and then progressively adapt to more challenging and noisier retrieval scenarios.

The reward signal serves as the primary supervision during the reinforcement learning process. In this study, the team adopted a rule-based reward function that focuses solely on answer accuracy. In preliminary experiments, the team observed that using Exact Match (EM) as a reward metric often led to reward cheating, where the policy model tended to generate overly long answers to increase the probability of including the correct answer (doing a bunch of fancy stuff to cheat, right?). To mitigate this issue, the team adopted an F1 score-based reward function, which balances precision and recall. It is calculated as follows:

Where IN represents the number of words overlapping between the predicted result and the true result, PN represents the number of words in the predicted result, and RN represents the number of words in the true result.

Performance Comparison

To evaluate ZeroSearch's effectiveness, the team compared their method using open models (Qwen) of different sizes with the following baselines.

Original Prompt Methods: This category includes direct prompting, Chain-of-Thought (CoT), and standard Retrieval-Augmented Generation (RAG).

Advanced RAG Methods: Considering RAgent and Search-o1, which iteratively search for relevant information.

Reinforcement Learning Tuning Methods: This category includes R1 and Search-R1. In R1, the policy model is trained based solely on its internal knowledge to perform deep reasoning.

(Image from appendix)

ZeroSearch consistently outperformed all baseline methods. This performance advantage was evident in both in-domain datasets (e.g., NQ and HotpotQA) and out-of-domain datasets (e.g., TriviaQA, PopQA, 2WikiMultiHopQA, Musique, and Bamboogle), fully demonstrating the robustness of the team's method.

Furthermore, ZeroSearch surpassed methods relying on real search engines. Compared to Search-R1, which uses real search engines, ZeroSearch achieved better performance, highlighting its potential as an effective alternative to real search engines in large-scale reinforcement learning. ZeroSearch also demonstrated strong generalization capabilities. Across different model families, parameter sizes, and types (e.g., base models or instruction-tuned models), ZeroSearch consistently outperformed the baseline models. Moreover, its performance further improved with increasing model scale, emphasizing its scalability.

Conclusion

ZeroSearch is a novel reinforcement learning framework that enhances LLMs' search capabilities without interacting with real search engines. Through supervised fine-tuning, the LLM is transformed into a retrieval module capable of generating relevant and noisy documents. The overall design employs a curriculum rollout mechanism to progressively improve reasoning ability by exposing the model to increasingly challenging retrieval scenarios. Experimental results show that ZeroSearch's performance surpasses models based on real search, exhibits good generalization ability across base LLMs and instruction-tuned LLMs of different sizes, and supports various reinforcement learning algorithms.

Appendices:

ZeroSearch: Incentivize the Search Capability of LLMs without Searching

https://arxiv.org/html/2505.04588v1

ZeroSearch: Incentivize the Search Capability of LLMs without Searching

https://github.com/Alibaba-NLP/ZeroSearch/blob/main/llm_agent/generation.py

ZeroSearch: <Alibaba Technology> Large Language Models Learn Through Self-Rewarding Without a Browser

Share Short URL