Currently, mainstream large language model vendors, and even companies building Agent intelligent entities, are using external online search API interfaces and functions to obtain data. Without exception, they all utilize real search engines or online search API interfaces. Calling traditional search engines often means uncontrollable document quality and high API costs. To solve these problems, Alibaba's Tongyi team has open-sourced a new solution, ZeroSearch. ZeroSearch introduces a novel reinforcement learning framework – training "search capability" without interacting with real search engines. The following is the complete translated text of the paper "ZeroSearch: Incentivize the Search Capability of LLMs without Searching". Enjoy.
Introduction
Effective information search is crucial for enhancing the reasoning and generation capabilities of large language models (LLMs). Recent research has explored the use of reinforcement learning (RL) to improve the search capabilities of LLMs by interacting with real-time search engines in real-world environments. While these methods have achieved encouraging results, they face two major challenges: (1) Uncontrolled document quality: The quality of documents returned by search engines is often unpredictable, introducing noise and instability into the training process. (2) Excessive API costs: RL training requires frequent Rollouts, potentially involving hundreds of thousands of search requests, which incurs significant API overhead and severely limits scalability. To address these challenges, we introduce ZS, a reinforcement learning framework that can incentivize the search capabilities of LLMs without interacting with real search engines. Our approach first involves lightweight supervised fine-tuning, transforming the LLM into a retrieval module capable of generating both relevant and noisy documents in response to queries. During reinforcement learning training, we employ a curriculum-based rollout strategy, gradually reducing the quality of generated documents and progressively enhancing the model's reasoning abilities by exposing it to increasingly challenging retrieval scenarios. Extensive experiments show that ZS can effectively incentivize the search capabilities of LLMs using a 3B LLM as the retrieval module. Notably, the performance of the 7B retrieval module is comparable to real search engines, while the 14B retrieval module even surpasses them. Furthermore, it generalizes well across base models and instruction-tuned models of various parameter sizes and is compatible with various RL algorithms.
1. Introduction
Large language models (LLMs) have demonstrated remarkable performance across a range of downstream tasks, including mathematical reasoning, question answering, and code generation. However, the knowledge encoded within these models is inherently static, limited by the scope of the data encountered during pre-training. Consequently, LLMs remain prone to generating fabricated or outdated information, which undermines their reliability in practical applications. Therefore, enabling LLMs to access external information sources is crucial for generating more accurate and well-grounded responses.
A widely adopted approach to address this issue is Retrieval Augmented Generation (RAG), which incorporates external knowledge into the generation process. Early work in this area focused on prompt-based strategies that guide LLMs through query generation, query decomposition, and multi-turn information retrieval. While effective, these strategies often require meticulous prompt engineering and place high demands on the model's reasoning capabilities. To improve efficiency, subsequent research explored supervised fine-tuning (SFT) to enhance the performance of smaller LLMs. Further advancements have focused on Test-Time Scaling techniques, such as Monte Carlo Tree Search (MCTS), which can dynamically expand the search space during inference. Although promising, these methods introduce significant computational overhead, posing challenges for practical Rollout.
Recently, reinforcement learning (RL) has emerged as a promising strategy to further improve the performance of LLMs by enhancing their reasoning and decision-making abilities. Notably, RL-based models such as OpenAI-o1 and DeepSeekR1 have achieved significant advancements in logical and iterative reasoning – entirely driven by reward-driven learning without relying on explicit step-by-step supervision. Within this paradigm, several studies have explored using RL to train policy models that can effectively search for relevant information. Representative examples include Search-R1, R1-Searcher, and ReSearch. Notably, DeepResearcher introduced real-time interaction with commercial search engines like Google, allowing the model to be trained in an environment closely resembling real-world web search. Despite these advancements, combining RL with real-world search scenarios still presents significant challenges: (1) Uncontrolled document quality: The quality of documents retrieved from real-time search engines is often unpredictable, introducing noise and instability into the training process. (2) Excessive API costs: RL training requires frequent rollouts, potentially involving hundreds of thousands of API calls, which incurs substantial financial costs and severely limits scalability.
To address these challenges, we propose ZS – a reinforcement learning framework that enables LLMs to learn search strategies without interacting with real search engines. Our key insight and its critical advantage is that LLMs acquire extensive world knowledge during large-scale pre-training and are capable of generating relevant documents based on search queries. The primary difference between real search engines and simulated LLMs lies in the text style of the returned content.
However, through lightweight supervised fine-tuning, even relatively small LLMs can effectively simulate the behavior of real search engines. Besides eliminating API costs, a significant advantage of using LLMs for document generation is the ability to control document quality. Specifically, during supervised fine-tuning, by designing prompts that differentiate documents leading to correct or incorrect answers, the simulated LLM can learn to generate relevant or noisy documents by adjusting a few words in the prompt. Building on this, we introduce a Curriculum Rollout mechanism during training, where the quality of generated documents gradually decreases over time to simulate increasingly challenging retrieval scenarios. This allows the policy model to first learn basic output formats and task requirements before gradually adapting to more challenging and noisier retrieval scenarios. More importantly, ZS exhibits strong scalability: increasing the number of GPUs can significantly accelerate the generation throughput of the simulated LLM, enabling efficient large-scale rollouts. Empirical results show that even using a 3B LLM as the simulated search engine can effectively incentivize the policy model's search capabilities. The 7B retrieval module's performance is comparable to Google Search, while the 14B retrieval module even surpasses Google Search. ZS is compatible with base models and instruction-tuned models of various parameter sizes and does not require a separate supervised warm-up phase. Furthermore, it seamlessly integrates with widely used reinforcement learning algorithms, including Proximal Policy Optimization (PPO), Group Relative Policy Optimization (GRPO), and Reinforce++.
Our contributions can be summarized as follows:
We propose ZS, a novel reinforcement learning framework that incentivizes LLM search capabilities without interacting with real search engines.
Through supervised fine-tuning, we transform an LLM into a retrieval module capable of generating both relevant and noisy documents based on queries. We further introduce a curriculum rollout mechanism that progressively enhances reasoning capabilities by exposing the model to increasingly challenging retrieval scenarios.
We conducted extensive experiments on in-domain and out-of-domain datasets. The results show that ZS outperforms models based on real search engines while incurring no API costs. Furthermore, it demonstrates excellent generalization across base LLMs and instruction-tuned LLMs of various parameter sizes and supports different reinforcement learning algorithms.
2. Related Work
2.1 Retrieval Augmented Generation
Retrieval Augmented Generation (RAG) enhances generation performance by integrating relevant external knowledge into the generation process. Early studies primarily employed prompt-based methods, guiding LLMs through processes such as query generation, query decomposition, and multi-turn information retrieval. Although these methods are effective, they often require complex prompt engineering and impose high demands on the model's reasoning capabilities. To improve efficiency and reduce reliance on strong black-box LLMs, subsequent research explored supervised fine-tuning (SFT) to enhance the performance of smaller LLMs. For example, Self-RAG employs a self-reflection mechanism to iteratively refine model output through predicted reflection tokens. RetroLLM integrates retrieval and generation by enabling the model to generate fine-grained evidence directly from a corpus through constrained decoding. Recent advancements also include Test-Time Scaling techniques, particularly Monte Carlo Tree Search (MCTS), which can dynamically expand the search space during inference. For instance, RAG-star integrates retrieved information into tree-based reasoning, while AirRAG uses MCTS to activate intrinsic reasoning capabilities and expand the solution space. Despite achieving encouraging results, these methods introduce significant computational overhead, limiting their practical application.
2.2 Learning to Search via Reinforcement Learning
In recent years, reinforcement learning (RL) has emerged as a promising paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). Prominent RL-based models such as OpenAI-o1 and DeepSeekR1 have demonstrated remarkable logical and iterative reasoning abilities, driven purely by reward signals without explicit step-by-step supervision. Several studies have also explored RL techniques specifically for training models to perform effective information retrieval. For example, Search-R1 utilizes reinforcement learning to autonomously generate multiple search queries during step-by-step reasoning. Similarly, R1-Searcher proposes a two-stage, outcome-based RL approach aimed at enhancing search capabilities. ReSearch utilizes RL to teach models to reason through searching, entirely without supervision on intermediate reasoning steps. However, these methods typically use static local text corpora (e.g., Wikipedia) and fail to capture the complexity of real-world interaction. To bridge this gap, DeepResearcher introduced direct interaction with commercial search engines like Google, allowing the training environment to closely align with real-world search scenarios. While these real-time retrieval methods have achieved outstanding performance, they also face numerous challenges, such as unpredictable document quality and excessive API costs (factors that negatively impact system scalability). To overcome these limitations, we propose ZS, a method that uses LLMs to simulate real-time search, effectively eliminating reliance on expensive, rate-limited real search APIs. Through lightweight supervised fine-tuning, ZS can explicitly control document quality and enable a curriculum rollout mechanism, thereby enhancing training stability and robustness.
3. ZeroSearch
In this section, we first formally state the reinforcement learning objective without using search engines. Then, we detail the design of ZS, covering the training template, search simulation tuning, curriculum-based Rollout strategy, reward design, and training algorithm.
3.1 Reinforcement Learning without Search Engines
We propose a reinforcement learning framework that eliminates the need for real search engines by leveraging LLMs to simulate them. The optimization objective formula is as follows:
Where πθ is the policy model to be optimized, πref is the reference model, and rϕ represents the reward function. πψ represents the simulated LLM, whose parameters remain fixed throughout the training process.
Figure 1: PPO and GRPO training demonstration without using search engines.
3.2 Training Template
Table 1: Training Template. During training and inference, the question is appended at the end.
In ZS, we do not rely on supervised fine-tuning for generation but follow and apply a multi-turn interaction template that guides the policy model through iterative reasoning and information retrieval until a final answer is reached.
As shown in Table 1, the interaction is divided into three distinct phases: First, the model thinks within the <think>...<think> tags. Second, if additional evidence is required, it searches within the <search>...<search> tags. Finally, once sufficient information is retrieved, the model answers within the <answer>...<answer> tags. This clear separation of thinking, searching, and answering enforces a structured decision-making process, enhancing model transparency and reliability.
3.3 Search Simulation Tuning
During the Rollout process, we use an LLM to simulate a real search engine, generating documents based on queries. A simple approach is to directly prompt the LLM to generate documents. However, this often results in a significant stylistic gap compared to the output of real search engines.
To bridge this gap, we propose a lightweight supervised fine-tuning (SFT) procedure. Specifically, we first collect interaction trajectories by prompting the LLM to interact with a real search engine in a multi-turn manner until a final answer is reached. Trajectories yielding correct answers are labeled as positive, indicating useful retrieved documents; trajectories leading to incorrect answers are labeled as negative, indicating noisy retrieval results.
Then, we extract query-document pairs from positive and negative trajectories and perform lightweight SFT to enhance the LLM's ability to simulate real search engines. As shown in Table 2, useful and noisy retrievals can be distinguished by adjusting a few words in the prompt. Furthermore, we incorporate the input question and its corresponding answer into the prompt to broaden the LLM's knowledge boundary. After fine-tuning, the LLM can generate both useful and noisy documents, enabling dynamic document quality control during the Rollout process.
3.4 Rollout with Curriculum Search Simulation
During the Rollout process, the policy model performs interactive reasoning and generates search queries, which are fed into the simulated LLM to generate corresponding documents. To gradually increase training difficulty, we introduce a curriculum-based Rollout mechanism where the quality of generated documents gradually decreases over time. This is controlled by a probability function pi, which determines the likelihood of generating a noisy document at step i:
Where ps and pe represent the initial and final noise probabilities, i and m represent the current and total training steps, and b is the exponential base, defaulting to 4. As training progresses, the ratio i/m increases, leading to a higher value of pi, meaning a greater probability of generating noisy documents. This allows the policy model to first learn the basic output structure and task requirements before gradually adapting to more challenging and noisier retrieval scenarios.
3.5 Reward Design
Reward signals serve as the primary supervision in the reinforcement learning process. In this study, we adopt a rule-based reward function that solely focuses on answer accuracy. In preliminary experiments, we observed that using Exact Match (EM) as the reward metric often led to reward cheating: the policy model tended to generate overly long answers to increase the probability of including the correct answer. To mitigate this issue, we employ an F1-score-based reward function, which balances precision and recall, calculated as follows:
Where IN represents the number of words overlapping between the predicted and true results, PN represents the number of words in the predicted result, and RN represents the number of words in the true result. We did not include additional rewards for output format, as we observed that the model consistently generated well-formatted responses without explicit supervision.
3.6 Training Algorithm
Our method is compatible with various reinforcement learning algorithms, including Proximal Policy Optimization (PPO), Group Relative Policy Optimization (GRPO), and Reinforce++, each offering unique advantages in optimizing retrieval-augmented reasoning.
In ZS, the rollout sequence includes tokens generated by the policy model and document tokens returned by the simulated LLM. Applying the same optimization procedure uniformly to both types of tokens can lead to training instability because the retrieved content is externally generated and not directly controlled by the policy model.
To alleviate this issue, we introduce a loss masking mechanism for the retrieved tokens, ensuring that gradients are computed only for the model's own output. This strategy stabilizes the reinforcement learning training process while maintaining the effectiveness of retrieval-augmented generation.
4 Main Results
4.1 Datasets and Evaluation Metrics
We evaluated ZS on a range of different question-answering benchmarks: (1) Single-hop QA, including NQ, TriviaQA, and PopQA. (2) Multi-hop QA, including HotpotQA, 2WikiMultiHopQA, Musique, and Bamboogle.
We follow and use Exact Match (EM) as the evaluation metric. A prediction is considered correct if its normalized form exactly matches any one of the normalized true answers.
4.2 Baselines
To evaluate the effectiveness of ZS, we compared our method with the following baseline methods. (1) Original Prompting Methods: This category includes Direct Prompting, Chain-of-Thought (CoT), and standard Retrieval Augmented Generation (RAG). (2) Advanced RAG Methods: We consider RAgent and Search-o1, which iteratively search for relevant information. (3) RL Fine-tuning Methods: This category includes R1 and Search-R1. In R1, the policy model is trained to perform deep reasoning solely based on its internal knowledge. In contrast, Search-R1 enables the policy model to interact with a real search engine multiple times during reasoning.
To ensure a fair comparison, we used the F1 score as the reward metric for all RL methods. Notably, among RL-based search baselines, we only compare with Search-R1 because it avoids complex reward design, data selection, or tedious training procedures. This setup allows for a direct and fair comparison between real search engines and our simulated search engine.
4.3 Experimental Setup
We conducted experiments using three model families: Qwen-2.5-7B (Base/Instruct) and Qwen-2.5-3B (Base/Instruct), as well as LLaMA-3.2-3B (Base/Instruct). To simulate real retrieval scenarios, we used Google Web Search via SerpAPI as the external search engine. To ensure a fair comparison, the number of retrieved documents was fixed at 5 for all methods.
For datasets, we followed the setup in Search-R1 and merged the training sets of NQ and HotpotQA, creating a unified dataset for all fine-tuning-based methods. We evaluated on seven datasets to assess in-domain and out-of-domain performance. For prompt-based baseline models, we used the Instruct models, as Base models often struggle to follow task instructions. For RL-based methods, we evaluated both Base and Instruct variants to assess generality across model types.
To train the simulated LLM, we performed lightweight SFT using Qwen-2.5-3B, Qwen-2.5-7B, and Qwen-2.5-14B as backbone networks. The learning rate was set to 1e-6. To train ZS, we employed two reinforcement learning algorithms: GRPO and PPO. In the GRPO setting, the policy LLM was trained with a learning rate of 1e-6, sampling 5 responses per prompt. In the PPO setting, the policy LLM was trained with a learning rate of 1e-6, while the value model was trained with a separate learning rate of 1e-5. We applied Generalized Advantage Estimation (GAE) with hyperparameters λ = 1 and γ = 1. Unless otherwise specified, GRPO was used as the default reinforcement learning algorithm, and Qwen-2.5-14B was used as the default simulated LLM in all experiments.
4.4 Performance
Table 3 below shows the comparison of ZS with several baseline methods on 7 datasets. Based on the results, several key observations can be made:
Table 3: Main results using different LLMs as backbone models. Best performance is shown in bold.
ZS consistently outperforms all baseline methods. This performance advantage is effective for both in-domain datasets (e.g., NQ and HotpotQA) and out-of-domain datasets (e.g., TriviaQA, PopQA, 2WikiMultiHopQA, Musique, and Bamboogle), demonstrating the robustness of our method.
ZS surpasses methods that rely on real search engines. Compared to Search-R1, which uses real search engines, ZS achieved better performance, highlighting its potential as an effective alternative to real search engines in large-scale reinforcement learning.
ZS exhibits strong generalization capability. Across different model families, parameter sizes, and types (e.g., base models or instruction-tuned models), ZS consistently outperforms the baseline models. Furthermore, performance improves as the model size increases, demonstrating its scalability.
5 Further Analysis
5.1 Comparison between ZS and Real Search Engine
We compared the reward curves of ZS and Search-R1 (using real search engines) on LLaMA-3.2-3B, as shown in Figure 2a and 2b below. We can draw several key observations:
The overall reward trends of both methods are similar. As training progresses, the reward scores for both ZS and Search-R1 steadily increase, indicating that the policy models in both settings can effectively learn to interact with search engines and produce correct answers.
ZS achieves a more stable and smoother learning curve. As shown in Figure 2b, ZS initially lagged behind Search-R1 but eventually surpassed it with smaller fluctuations, thanks to the curriculum mechanism helping the model gradually master the use of search tools.
ZS generalizes well across both base and instruction-tuned models. In both model types, ZEROSEARCH steadily improved reward performance, highlighting its generality.
Figure 2: (a-b): Reward curve comparison between ZS and Search-R1 using LLaMA-3.23B. (c): Interaction turns and reward progression during LLaMA-3.2-3B-base training.
Table 4: Performance of simulated search engines using different LLM configurations. We compared prompt-based and fine-tuned simulated LLMs (3B to 14B) with Google Search.
5.2 Choice of Large Language Models
In this section, we will investigate how different simulated engine configurations affect performance, including prompt-based LLMs and fine-tuned LLMs, ranging from 3B to 14B parameters. Based on the results in Table 4, we draw the following observations:
First, the performance of the fine-tuned 7B simulated engine (SFT-7B) is comparable to Google Search, while the 14B version (SFT-14B) even surpasses Google Search. This demonstrates the feasibility of using trained LLMs as alternatives to real search engines in reinforcement learning environments.
Second, the performance of fine-tuned simulated engines is significantly better than prompt-based engines. Although prompt-based methods explicitly simulate the response style of real search engines, a significant distribution gap remains, leading to poorer performance.
Third, performance continues to improve with increasing model scale. Larger simulated LLMs not only exhibit stronger simulation capabilities but also more accurately distinguish between relevant and irrelevant documents, enabling more effective curriculum learning during training.
5.3 Interaction Turn Study
In this section, we will use the LLaMA3.2-3BBase model to analyze the training dynamics of ZS by examining reward progression and the number of interactions during training. The results are shown in Figure 2c above.
In the early stages of training, the number of interactions decreases sharply, while the reward increases slowly. This is mainly because the policy model initially lacks knowledge of how to correctly call the search engine, leading to redundant interactions. However, it quickly learns the correct format and begins to effectively eliminate unnecessary steps.
As training progresses, both the number of interactions and the reward curve increase sharply and then stabilize. This is mainly because the policy model can effectively retrieve relevant documents and ultimately obtain the correct answer, resulting in higher rewards. It is worth noting that although the reward appears stable in the later stages of training, the underlying task difficulty continues to increase due to the influence of the curriculum mechanism. Therefore, continuous improvement of the policy and enhancement of reasoning capabilities are necessary to maintain stable performance.
表 5:ZS 在不同 RL 算法下的性能。我们使用 Qwen2.5-3B-Base 和 LLaMA-3.2-3B-Base 模型比较了 PPO 和 GRPO。
表 6:逆向课程研究。我们使用 Qwen-2.5-3B-Base 和 Qwen-2.5-3B-Instruct 模型比较了标准课程和逆向课程推广设置的表现。
5.4 Different RL Algorithms: PPO vs GRPO
In this section, we will use the Qwen2.5-3B-Base and LLaMA-3.2-3B-Base models to evaluate the performance of two widely adopted reinforcement learning (RL) training algorithms, PPO and GRPO, under the ZS framework. The comparison results are shown in Table 5 above.
Upon observation, both GRPO and PPO successfully incentivized search capabilities within our framework, demonstrating the versatility of our approach. Among these, GRPO exhibited more stable performance in both models, highlighting its advantages in training stability. Notably, the repeated rollout mechanism in GRPO incurs higher API costs when interacting with real search engines, further underscoring the practicality of our simulated search setup.
5.5 Inverse Curriculum Study
In this section, we compare the curriculum promotion strategy with the inverse curriculum setting to analyze its effectiveness. In the inverse curriculum setting, training difficulty decreases over time by gradually increasing the quality of retrieved documents. The results are shown in Table 6 above.
The results clearly indicate that in both models, the standard "easy-to-difficult" curriculum mode consistently outperforms the inverse "difficult-to-easy" curriculum mode, demonstrating the effectiveness of curriculum learning within our framework. Starting with better search results allows the policy model to first learn how to call the search engine and understand the basic output format. As training progresses, the model is exposed to increasingly challenging scenarios, thereby developing stronger reasoning capabilities.
6 Conclusion
In this paper, we propose a novel reinforcement learning (RL) framework ZS, which enhances LLM search capabilities without interacting with real search engines. Through supervised fine-tuning, an LLM is transformed into a retrieval module capable of generating both relevant and noisy documents. We employ a curriculum rollout mechanism to progressively enhance reasoning capabilities by exposing the model to increasingly challenging retrieval scenarios. Experimental results show that ZS outperforms real search-based models, generalizes well across base LLMs and instruction-tuned LLMs of different scales, and supports multiple reinforcement learning algorithms.
However, our method has some limitations. Rollout simulating search LLMs requires access to GPU servers. While more cost-effective than using commercial APIs, this introduces additional infrastructure costs. We will discuss these costs in detail in the appendix below.
Table 8: Cost comparison between real search engine and our simulated search method.
Deeper Thoughts on the Paper:
ZeroSearch represents a key technical advancement in retrieval-augmented training for language models. This framework introduces a self-supervised paradigm where large language models (LLMs) can simulate search engine behavior, thereby eliminating reliance on commercial APIs like Google Search. This shift not only reduces the financial burden of reinforcement learning-based training but also provides a controllable environment to shape the retrieval process. ZeroSearch challenges a core assumption in modern LLM training: that high-quality external search queries are essential for effective information retrieval and question answering.
A key technical advantage of ZeroSearch is its ability to decouple retrieval quality from search engine output noise. While traditional methods inherit the variability and bias of commercial engines, ZeroSearch allows for fine-grained control over retrieved data. This introduces a new optimization dimension for LLM training, allowing systematic adjustment of the quality and diversity of retrieved documents to support specific task functionalities, such as fact verification, grounded generation, or multi-hop reasoning. Developers and researchers can now integrate ZeroSearch into their own training pipelines, enabling cost-effective large-scale RLHF and retrieval conditioning experiments without the limitations of any external APIs.
ZeroSearch sets a precedent for the future of retrieval-augmented generation. It provides a reliable alternative to web-based search as a training signal, with significant implications for cost reduction, improved model alignment, and safety. For AI developers focused on scalable training mechanisms, reinforcement learning, and search-augmented reasoning, ZeroSearch offers a technically rigorous and open alternative, redefining how retrieval capabilities can be integrated into foundational model development.