Introduction
Currently, the model paradigm deeply integrating reasoning and search has become a cutting-edge hot topic in AI research. Large models, by flexibly invoking search tools during the reasoning process, accurately obtain key information and drive subsequent inference, opening up new paths for tackling complex tasks.
Previous research by Tongyi Lab's search team, such as ZeroSearch and OmniSearch, trained large models to use search engines in specific downstream tasks through reinforcement learning. However, in practice, it was found that this single-task training mode has obvious limitations, with insufficient model generalization capability and difficulty in meeting retrieval and reasoning demands in diverse scenarios.
To break through this bottleneck, Tongyi Lab, based on the "pre-training-fine-tuning" concept, grandly launched the MaskSearch universal pre-training framework. This framework innovatively introduces the Retrieval-Augmented Masked Prediction (RAMP) task, inspired by BERT's masking mechanism, allowing the model to use search tools to predict masked text content. This enables the model to simultaneously master task decomposition, reasoning strategies, and search engine operation techniques during the pre-training phase, laying a solid foundation for adaptation across multiple domains.
MaskSearch is compatible with supervised fine-tuning and reinforcement learning. Validated through two-stage training, it has achieved significant performance leaps on multiple open-domain question answering datasets compared to traditional training methods.
Paper Title: MaskSearch: A Universal Pre-Training Framework to Enhance Agentic Search Capability
Paper Link: https://arxiv.org/abs/2505.20285
Code Link: https://github.com/Alibaba-NLP/MaskSearch
MaskSearch
Next, we delve into the core architecture and operational mechanism of MaskSearch.
2.1 Task Definition
Retrieval-Augmented Masked Prediction (RAMP) as MaskSearch's pre-training task, its core essence lies in: masking key information in the input text sequence, and the model actively needs to leverage external knowledge bases and invoke search tools to predict these masked text segments.
To increase the difficulty of the masked portions, in addition to named entities (such as person names, place names, organization names, etc.), dates, and numbers commonly masked in previous masked prediction tasks, MaskSearch also considers the following types of key information:
1. Ontological Knowledge: Key concepts involved in classification systems or knowledge systems within the text;
2. Specific Terminology: Professional terms specific to a particular domain or topic;
3. Numerical Values: Specific numerical values involved in the text, such as statistical data, measurement values, etc.
This not only increases the task's difficulty but also prompts the model to process information more finely during retrieval and reasoning, thereby enhancing its adaptability and generalization capabilities in multi-domain tasks.
2.2 Training Methods
Supervised Finetuning
To generate Chain-of-Thought (CoT) data for Supervised Finetuning (SFT), the authors propose a data generation method combining (1) Agent Synthesis and (2) Distillation, specifically including:
Agent Synthesis: First, a multi-agent system is established, incorporating roles such as planning, search rewriting, and observation analysis, to collaboratively perform the Chain-of-Thought generation task. Finally, an LLM is responsible for answer judgment, retaining only the Chain-of-Thought for correct answers.
Distillation: To quickly expand the dataset and maintain high quality, a teacher model trained with existing data is used to directly generate reasoning trajectories and iteratively update the teacher model, thereby progressively improving data quality.
Reinforcement Learning
In the reinforcement learning part, the authors adopted the Dynamic Sampling Policy Optimization (DAPO) algorithm, combining rule-driven format rewards and model-driven answer rewards to construct a hybrid reward system. Format rewards check whether the model's output conforms to the specified format, while answer rewards evaluate the consistency between generated answers and standard answers.
The authors explored various answer reward functions and ultimately chose a model-based reward function, using the Qwen2.5-72B-Instruct model as an arbiter to score the consistency between generated and standard answers.
Curriculum Learning
To facilitate learning from easy to difficult, the authors proposed classifying training samples by the number of masks, allowing the model to first learn basic reasoning skills through simple samples, and then gradually improve its capabilities to cope with more challenging scenarios.
Experiments
3.1 Main Results
The authors demonstrated through experiments based on different sizes of Qwen and LLaMA models that the two-stage MaskSearch training framework significantly enhances the search and reasoning capabilities of large models.
Following a training process with RAMP as the pre-training task and the HotpotQA dataset as the downstream task, MaskSearch consistently improved model recall on in-domain datasets; performance gains were even more significant on out-of-domain datasets like Bamboogle, where smaller models could even rival the performance of larger models, validating RAMP's effectiveness as a scalable learning signal.
Experiments further validated the compatibility of both supervised learning (SFT) and reinforcement learning (RL) training methods with the MaskSearch framework. Among them, RL demonstrated a higher performance ceiling on the RAMP task, especially achieving optimal results across all sizes of Qwen models in in-domain tasks like HotpotQA.
This indicates that RL, through dynamic sampling strategies and hybrid reward mechanisms, can more accurately optimize the model's multi-step search and reasoning processes, providing a stronger training paradigm for enhancing the adaptability of retrieval-augmented models.
3.2 Scaling Performance
In the context of supervised learning, the authors verified MaskSearch's scalability through experiments with different training steps: smaller models (e.g., 1B) showed significant performance improvement after pre-training, while larger models (e.g., 7B), limited by the diversity of self-evolved data, experienced relatively smoother performance gains, though recall scores still increased compared to fine-tuned-only models.
This proves that RAMP has the potential for continuous improvement across models of different scales and also indicates that data quality and diversity are key factors determining the performance ceiling of SFT methods.
3.3 Supervised Curriculum Learning Effect
Additionally, experiments validated the curriculum learning training strategy designed based on the number of masks. The specific method involves stratifying data sampling by the number of masks during training, with each quantity corresponding to 10K training samples, complemented by 6K HotpotQA data to maintain task balance.
When the number of masks gradually increased from 1 to 4, the Qwen2.5-7B model's score on the validation set significantly increased, and it performed notably better than training with data containing mixed numbers of masks. Furthermore, curriculum learning also further improved the model's post-training performance on downstream tasks, validating the promoting effect of difficulty gradient design on reasoning capability construction.
More Analysis
4.1 Impact of Masking Strategy
Masking strategy is another important factor affecting the difficulty of the RAMP pre-training task. The authors compared random masking with a difficulty-oriented masking strategy based on perplexity (PPL), which prioritizes masking parts that are difficult to recover by calculating the model's loss value (i.e., perplexity) when restoring masks.
Experiments showed that the PPL strategy improved model recall on the FanoutQA dataset, but in other datasets, it could lead to performance degradation due to excessive pursuit of difficulty, indicating that task difficulty still needs to match the model's current search and reasoning capabilities. Therefore, combining curriculum learning with training strategies to balance difficulty can achieve better overall results.
4.2 Impact of RL Reward Function
During reinforcement learning training, different reward functions have varying impacts on model performance. Taking the Qwen-7b model as an example, a reward function based on token-level recall rate prompted the model to pile a large amount of irrelevant information into the answers to improve recall, leading to a significant increase in answer length and a notable drop in actual performance compared to other RL reward functions.
Although introducing penalty terms to suppress answer length can reduce information redundancy to some extent, the model can still exploit rule loopholes through enumeration within a limited length.
In contrast, the model-based reward function demonstrated the best performance, excelling over the other two reward methods in terms of generated answer length, token-level recall rate, and scores judged by the Qwen72b model. It effectively avoided reward gaming and exhibited excellent stability and efficiency throughout RL training.
Conclusion
MaskSearch aims to enhance the agentic reasoning + search capabilities of Large Language Models (LLMs). This framework relies on the Retrieval-Augmented Masked Prediction (RAMP) pre-training task, empowering the model to autonomously perform multi-step search and reasoning, filling masked blanks in text, and achieving deep integration of external knowledge.
Tempered by both Supervised Fine-tuning (SFT) and Reinforcement Learning (RL) training paths, and incorporating a curriculum learning strategy, MaskSearch achieved significant performance improvements over baseline methods on both in-domain and cross-domain open-domain question answering tasks.
Read More
Submission Channel Let your words be seen by more people
Let your words be seen by more people
How can more high-quality content reach readers through shorter paths, reducing the cost for readers to find good content? The answer is: people you don't know.
There are always people you don't know who know what you want to know. PaperWeekly can perhaps serve as a bridge, encouraging scholars and academic inspirations from different backgrounds and directions to collide, sparking even more possibilities.
PaperWeekly encourages university labs or individuals to share various types of high-quality content on our platform, which can include interpretations of the latest papers, analyses of academic hotspots, research insights, or explanations of competition experiences, etc. Our sole purpose is to truly make knowledge flow.
Basic Manuscript Requirements:
• The article must be an original work by the individual, not previously published through public channels. If it has been published or is pending publication on other platforms, please clearly indicate so.
• Manuscripts are recommended to be written in markdown format, with accompanying images sent as attachments, requiring clear images and no copyright issues.
• PaperWeekly respects the original author's right of attribution and will provide competitive industry remuneration for each accepted original first-published manuscript, settled based on a tiered system according to article readership and quality.
Submission Email: hr@paperweekly.site
• Please include an instant contact method (WeChat) with your submission, so we can contact the author as soon as the manuscript is selected.
• You can also directly add the editor's WeChat (pwbot02) for quick submission, remarking: Name-Submission.
△Long press to add PaperWeekly editor
🔍
Now, you can also find us on "Zhihu"
Go to the Zhihu homepage and search for "PaperWeekly"
Click "Follow" to subscribe to our column
·