The CAS work addresses two genuine engineering pain points of "deep search agents": first, the problem itself not being difficult enough, leading the model to avoid real thinking; and second, the context being rapidly overwhelmed by long tool outputs, causing the process to fail prematurely. Researchers tackled these challenges head-on, reshaping both the training and inference pipelines from data and system perspectives simultaneously, making complex reasoning both useful and executable.
You will observe a clear engineering trade-off: using "high-quality, verifiable, and cross-source" problems as training fuel, treating "early tool outputs" as readily available cache rather than a permanent burden, and maintaining consistency in this context state throughout training and inference. The result is a straightforward change: the agent is no longer forced to conclude after ten or so rounds but reliably handles nearly a hundred tool interactions within a standard 32k context, preserving the complete reasoning chain during the process. Ultimately, this allows a medium-sized 32B open-source model to work stably, explainably, and cost-effectively on tasks requiring multi-site searching, evidence gathering, and reasoning. This is key for many companies looking to deploy and reuse "AI research assistants or analysts."
Where Exactly is the Problem?
You might have encountered this pitfall: training data is too "shallow," failing to teach genuine research behavior. Current common open-source multi-hop QA data is often Wikipedia-style, allowing the model to "guess correctly" based on memory or single-page retrieval. When deployed in production, the model struggles with tasks requiring cross-site, multi-timeline searching, and verification. Another issue is context explosion, where long processes fail quickly; effective interaction usually lasts only 10 to 15 rounds in a 32k context. Web snippets returned by tools are generally five to ten times longer than the assistant's reasoning text, and this rapidly growing content always consumes the space. Many systems use summarization models to compress tool output, but this loses information granularity, increases system complexity, and, more critically, is difficult to incorporate into verifiable end-to-end reinforcement learning, causing deviations between the optimal training strategy and online behavior.
Core Idea 1: Reverse-Constructed Complex Problem Data
The goal here is to create problems that must be answered through multi-page, multi-step reasoning, and where the answer can be verified by web evidence. This forces the model during training to learn expert strategies like "verification—backtracking—sub-goal decomposition—cross-document synthesis." Researchers call this a "reverse construction" method for task generation.
Specific Three Steps
Step 1: Anchor on an "entity" and collect real web evidence first (information must be sufficient and complementary)
Screen person names from Wikipedia, but only select figures with moderate exposure: if too obscure, there won't be enough data; if too famous, the model might recall the answer directly from its internal parameters, defeating the training purpose. Researchers quantified this "moderate" level as page views falling within a certain range over the last 6 months. Then, two types of retrieval are performed around each entity: one for biographical information using the name directly, and another for recent news to gather dozens of candidate web pages.
Apply a triple filter to the collected web pages: 1. Entity Correspondence: Compare with Wikipedia to exclude "same name, different person" confusion; 2. Information Complementarity: Keep only pages providing new and independent information, removing redundant statements; 3. Site Credibility: Remove unreliable sources, retaining only credible sites. This ensures that when questions are generated later, the information is "scattered across multiple pages," "mutually complementary," and "reliable."
Step 2: Generate questions based on multi-source evidence, and intentionally increase the difficulty (force multi-source, forbid Wikipedia, and apply "secondary fuzzification")
Forbid using the Wikipedia page itself as evidence, preventing the model from extracting answers from a single structured source.
Explicitly require each question to integrate information from at least four different sources, forcing "cross-document inference" instead of single-page retrieval.
Apply "secondary fuzzification" to the generated questions: replace "specific references" with "more generalized descriptions" (e.g., changing "born on July 2nd" to a generalization like "born early 21st century"), while ensuring the answer remains unique. This means the model cannot instantly match during search; it must gradually correspond the "generalized description" with details from multiple pages to locate the unique answer. This step transforms "information matching" into "reasoning reduction."
Step 3: Double Filtering – easy questions are all rejected first, followed by strict removal of quality issues
Difficulty Filter: Use two automated "probes" to eliminate overly easy questions: 1. Check if the direct search engine can find the entity or answer in one step; 2. Check if a zero-shot LLM can guess the answer directly. If either easily succeeds, it is not the desired "must be multi-step, multi-source" question and is rejected.
Quality Filter: Remove all questions that compromise verifiability, including: 1. Vague statements prone to ambiguity; 2. Answers that are ambiguous or non-unique; 3. Answers that cannot be logically deduced from the given reference documents (i.e., insufficient evidence chain). Only the remaining Q&A pairs are the "difficult and verifiable" high-quality training samples.
Why is this approach effective? Because it directly addresses real-world long-range retrieval tasks: information is scattered, signal-to-noise ratio varies, and cross-page comparison and backtracking confirmation are essential. Many existing multi-hop datasets rely heavily on structured Wikipedia information, which can often be solved by "shallow retrieval + model memory," failing to induce "verification, backtracking, and planning," which are true "expert cognitive behaviors."
Core Idea 2: Dynamic Sliding Window
Why is a new strategy needed? Researchers first performed empirical analysis: in a common 32k context, most models hit the limit after about 10–15 rounds. The reason is that tool-returned web content is typically 5–10 times longer than the assistant's reply, accumulating like a snowball and rapidly consuming dialogue space. However, these "very long tool outputs" often only influence the decision of the "immediately following step," having little impact on decisions made ten or more rounds later. Thus, retaining all historical tool outputs is both context-wasting and inefficient.
Based on this observation, researchers proposed "Sliding Window" context management:
Let a multi-turn trajectory be denoted as $\tau$ = {user question $q$, assistant $a1$, tool $t1$, assistant $a2$, tool $t2$ …}.
Set the window size W and the slide step size S. When the cumulative number of tool responses reaches W, older tool responses are batch replaced with a placeholder prompt (e.g., "Previous tool outputs omitted, rerun the tool if needed"). Only the most recent W tool outputs originals are kept. Crucially, the assistant's own reasoning content is always fully retained without truncation. This preserves the "reasoning chain" while clearing out the "historical long web pages."
Training-Inference Consistency (How it's done during training)
Simply applying the sliding window during inference is insufficient. If the model is trained on "full history" but forced to infer within a "sliding window context," a distribution mismatch occurs, leading to instability. To prevent this, researchers segment each trajectory according to the sliding window rhythm during inference, creating multiple training sequences. This accustoms the model during training to the context state where "some old web pages are replaced by placeholders":
If a trajectory has (T) tool calls, it generates $1 + \lfloor (T - W) / S \rfloor$ training sequences. The 1st sequence contains the initial full context. In subsequent sequences, older tool calls are replaced with placeholders according to the sliding boundary, retaining only the tool originals within the window, thereby replicating the real visible context during the inference process.
To prevent conflicts arising from "repeatedly optimizing the same piece of assistant output," masking is applied to each sequence, ensuring that each assistant response is trained only once. The paper provides a masking formula (shown below), which means: in the (k)th sequence, only the "newly generated part" of the assistant text participates in backpropagation; previously appearing assistant text is treated as "read-only context."
The loss is calculated only at the position of the newly generated assistant response $a_i$. This one-to-one replicates the sliding window visibility of the inference phase in the training phase.
Results and Advantages (Why this is better than "summarizing old pages")
Since there is no external summarization, the model can still retrieve the original web content by "rerunning the tool" when needed, ensuring no information loss.
The mechanism itself does not increase the complexity or computational cost of an extra summarization model, and it is easier to incorporate into end-to-end reinforcement learning optimization (summarization components often act as "training blind spots").
With the same, or even smaller, context budget, the sliding window method pushes the number of effective interaction rounds higher. The paper reports stable interaction approaching 100 rounds in a 32k context, and superior performance across multiple benchmarks compared to "no management" or "summarization compression." The tables and figures show that under 32k/64k/128k limits, the 32k sliding window solution achieves ≈33.3%, a level the other two strategies only approach at larger contexts.
Training Process: From Cold Start to Verifiable Reinforcement Learning
The cold start phase uses Supervised Fine-tuning (SFT) to establish the fundamentals of "using tools and stepwise thinking." Researchers used a more capable model to generate action trajectories in a real web environment, also using the dynamic sliding window during generation to prevent context length from cutting off trajectories. Trajectories with incorrect final answers or excessive length were filtered out. The remaining high-quality examples are used to train the model to adapt to dynamic context via "multi-sequence construction." The Reinforcement Learning phase uses Group Relative Policy Optimization (GRPO) for policy improvement. For the same question, multiple complete trajectories are generated, and a verifiable binary reward is assigned based on whether the final answer is correct. An advantage is then standardized within the group, and this trajectory-level advantage is passed to all corresponding training sequences, ensuring trajectory feedback is stably used for sequence-level parameter updates.
The specific engineering details are practically laid out: the base model is Qwen3‑32B with the thinking mode enabled. SFT uses about 3000 high-quality trajectories, batch size 256, learning rate $1 \times 10^{-5}$. RL uses about 4000 questions, batch size 32, learning rate $2 \times 10^{-6}$. Eight rollouts are generated per question, maximum trajectory length is 40,000 tokens, and the single-question round limit is 60. The tool window size is set to 5, the slide step to 3. The training implementation is based on the VERL framework. Evaluation across BrowseComp, BrowseComp‑zh, XBench‑DeepSearch, and GAIA uniformly uses temperature 0.6, top‑p 0.9, a maximum of 100 interactions, and the same window 5 plus slide 3 context management. The final answer correctness is judged by a reviewer model using structured prompts.
Tool Suite and Interaction Details: Three Tools are Enough
Many first consider adding a summarization model to compress web content, but the researchers focused on "what to read, how to read it, and when to stop." They only retained three lightweight, high-leverage tools: a search service to get titles, links, and snippets; a fetch service to convert web pages into scrollable Markdown text page by page; and an in-page find tool to locate keywords and nearby context within long documents. This gives the agent control, allowing it to fast-forward, pause, or exit content page by page, instead of being passively overwhelmed by thousands of words at once. Since no external summarization is done, information details are not prematurely cropped, and end-to-end training avoids the optimization discontinuity of "not seeing the real text."
Operational Demonstration: How a Real Problem is Solved
The researchers provide a case study trajectory in the appendix. The question required locking down a unique historical location amidst various clues. Conditions included whether it was located in a national capital, whether it was near a river, range of start and completion years, numerical range of wall thickness, whether it survived specific periods of tornado and earthquake damage, whether it was acquired by the government between 1980 and 1990, and the birth year range of the president at the time of acquisition. Such problems force the agent to cross-verify repeatedly across multiple web pages and backtrack when necessary. Using tools, the agent first uses the search service to identify candidates, then the fetch service to read critical pages in detail, and the in-page find tool to quickly jump to keyword-relevant paragraphs. Meanwhile, the sliding window continuously moves older long tool outputs away, preserving the thinking process completely. The final answer locked down is Ahsan Manzil in Dhaka.
First, use search to locate candidate buildings related to the conditions, recording whether they are in the capital and near a river to quickly exclude obvious mismatches.
Use the fetch service to read the most promising candidates page by page, focusing on verifying if the start and completion years fall within the specified closed intervals, and simultaneously noting engineering details like wall structure and thickness.
Use in-page find to locate keywords like "tornado" and "earthquake," confirming item by item whether there are records of tornado damage between 1880 and 1890, and earthquake damage between 1890 and 1900, strictly comparing dates within the range.
Continue comparing the "government acquisition" year across different sources for the same entity, cross-verifying who the national president was in that year, and whether that president's birth year falls within the closed interval of 1920 to 1935, thus closing the constraint chain.
For details uncommon in encyclopedia summaries, such as "wall thickness," supplementary retrieval is performed using more specialized or local sources, and the value is checked against existing conditions to ensure all conditions are met simultaneously, not in isolation.
Maintain the sliding window throughout the verification process, allowing early long tool outputs to be replaced by placeholders. If information is uncertain, the tool is called again to re-fetch the original text, preventing loss of traceability while ensuring the context is not burdened by historical snippets.
Experimental Results and Reproducibility Settings
After streamlining the process, the numbers speak for themselves.
The researchers tested the model on four "deep web research" benchmarks:
| Dataset | Description |
|---|---|
| BrowseComp-en | Complex English web QA task requiring searching multiple pages and reasoning. |
| BrowseComp-zh | Chinese version, similar task type to the above. |
| XBench-DeepSearch | A cross-lingual "deep search" evaluation set focusing on multi-turn interactive reasoning. |
| GAIA | Complex web fact-checking task provided by the DARPA GAIA program. |
These benchmarks require real web tools; the answers cannot be retrieved directly from the model's memory but must be searched, integrated, and verified using web pages.
DeepMiner‑32B achieved an accuracy of 33.5 on BrowseComp‑en, a significant improvement over previous open-source agents, and showed similar improvements on BrowseComp‑zh, XBench‑DeepSearch, and GAIA. More indicative is the performance of the SFT-only version, which already surpassed many open-source agents on several benchmarks. This result means DeepMiner achieves "near-commercial level" deep web reasoning performance within the open-source ecosystem. This highlights that "high-difficulty and verifiable data" itself yields benefits, which are further enhanced by verifiable reinforcement learning and dynamic context management. Evaluation consistently used decoding settings of temperature 0.6 and top‑p 0.9, a maximum of 100 interaction rounds, and window 5 plus slide 3 context management, along with structured reviewer prompts to ensure traceability of the judgment process—details crucial for local reproducibility.
Validation of the Sliding Window Mechanism Effect
This section specifically measures the differences between three context management strategies:
| Management Strategy | Features | Performance in 32k Context |
|---|---|---|
| No Management | All web content retained | About 22%; only runs 15–20 rounds before hitting context limit |
| Summarize Old Pages | External summarization model compresses historical pages | About 27%; runs 30–40 rounds, but loses details |
| Sliding Window (Researcher's method) | Only deletes old web page originals, preserves assistant reasoning text | 33.3%; runs stably up to nearly 100 rounds |
Comparison across 64k and 128k context lengths:
The no-management strategy performance increases slowly due to long web pages and high noise.
The summarization strategy improves slightly but still lags behind the sliding window.
The sliding window strategy achieved the performance of the 128k summarization strategy already at 32k.
Conclusion: Sliding window management not only saves context but also maintains reasoning stability. For the same context capacity, it allows the model to perform almost 4–6 times more reasoning rounds.
The researchers showed curves in the experimental chart demonstrating that the sliding window curve peaks almost at 32k, while other methods only approach this level at 128k.
What is the Significance of This Work?
Solving the two major pain points of "deep search" simultaneously: On one hand, making the training task "difficult and genuine" using "reverse construction + multi-source synthesis + fuzzification + strict filtering"; on the other hand, using the "sliding window" to mechanistically extend the "sustainable length" of multi-turn reasoning, maintaining training-inference consistency without relying on an extra summarization model, without losing details, and without increasing system complexity.
Data Efficiency and Capability Transfer: Even the SFT-only version significantly outperforms models trained on traditional multi-hop data like HotpotQA, demonstrating that the constructed data aligns better with the real requirements of "deep web research." Capability is further enhanced by stacking RL.
Engineering Feasibility: The ability to push interaction rounds to ~100 within the common 32k context is critical for practical systems, as simply expanding the context (to 128k or more) incurs high costs.
Possible Limitations and Notes
Data and Ethics: Training data comes from public web pages, which inevitably contains personal information. Researchers commit to using only public sites, filtering non-standard sites and social media, performing anonymization before release, and setting access review for weights to mitigate misuse risk.
Evaluation Judge Reliance on LLM: Subjective evaluation uses a strong model as a judge, which is a common practice but means the results are somewhat influenced by the "reviewer prompt and model version"—researchers provided the reviewer template in the appendix to enhance reproducibility.
For related research on deep search agents, you may refer to this survey:
Final Thoughts
Ultimately, this approach integrates three elements: "problems must be genuine and difficult," "context must be controllable and consistent," and "feedback must be verifiable and stable." This transformation allows multi-turn search agents to move from superficial attempts to sustained deep exploration. I prefer to view this as an engineering perspective framework: first safeguarding the continuity of the reasoning chain, then moving the biggest context overhead away on demand, and finally ensuring training and inference share the same "world state." If you are developing web search, intelligent analysis, or enterprise knowledge Q&A into practical products, these modification points can be gradually integrated into existing systems without a complete overhaul, simultaneously stabilizing the long-standing issues of "how long it can think" and "thinking correctly."