BrowseComp benchmark have consistently been zero. This chasm seemed insurmountable. Yesterday, Alibaba's Tongyi open-sourced their latest Web Agent model—WebSailor. Beyond open-sourcing the model, code, and paper, WebSailor presents a complete and reproducible methodology, showing everyone that: open-source Agents can also achieve superhuman reasoning and challenge the dominance of closed-source systems!First, we need to understand why previous open-source Agents struggled. The paper points out that the problem lies in the difficulty of the training data. Previous training methods primarily focused on two types of tasks:
Level 1: Low uncertainty tasks, such as questions answerable with a single search.
Level 2: Multi-hop tasks with clear paths, such as “Who was the first academician of the Chinese Academy of Sciences from the alma mater of Alibaba's current CEO?”. Although complex, the reasoning path is fixed and linear.
However, many real-world challenges fall into Level 3: extremely high uncertainty + extremely complex exploration paths. These have no standard answer path, requiring the Agent to act like a true researcher, constantly exploring, pruning, integrating, and reasoning within a sea of information. Training models with Level 1 and Level 2 data and then expecting them to solve Level 3 problems is akin to teaching only addition and subtraction and then asking students to solve calculus. The results are naturally dismal.
So, how do we create sufficiently difficult Level 3 training data? WebSailor open-sourced SailorFog-QA, and its generation method is remarkably ingenious:
1. Constructing Complex Knowledge Graphs: Starting from real-world websites, a highly interconnected knowledge graph containing numerous entities and complex relationships is built through random walks. This ensures that the problem source is authentic and the structure is non-linear.
2. Sampling + Questioning: A subgraph is randomly sampled from this complex graph, and then questions and answers are generated based on this subgraph.
3. Introducing Difficulty (Crucial Step): When generating questions, information is intentionally obscured. This trick is brilliant.
Precise dates become “early 21st century”.
Clear names become “an institution founded by someone whose name starts with F”.
Specific values become “market share less than 1%”.
This masking directly maximizes the initial uncertainty of the task, forcing the Agent to learn to compare, reason, and synthesize information, rather than simply performing lookups.
As shown in the figure above, the number of tool calls required by SailorFog-QA is strikingly similar in distribution to the BrowseComp-en benchmark (orange line) and far exceeds other datasets. Models trained with such high-difficulty data naturally possess strong practical capabilities.
With high-quality QA, the next step is to generate solution trajectories for the model to learn from.
Traditional methods involve using a stronger expert model (e.g., QwQ-32B) to generate complete thought and action trajectories, which our model then imitates. But there's a big pitfall here: expert models are usually very verbose! Their thought processes are filled with lengthy, stylized “fluff”. Learning directly from these not only contaminates our model's thinking style and limits its flexibility, but more critically, in long tasks requiring dozens of tool calls, this fluff quickly overwhelms the context window!
WebSailor's approach is a textbook example of extracting the essence and discarding the dross:
1. Have the expert model generate the full trajectory, but only retain the action-observation sequence. This is equivalent to observing the master's actions without listening to their rambling.
2. Then, use another powerful instruction-following model to inversely generate a concise, distilled, goal-oriented “thought” for each successful action.
The resulting training trajectories preserve the expert's core problem-solving logic while being clean and concise, without fluff, making them highly suitable for training on long tasks.
Finally, for the training phase, WebSailor adopts a “two-step” strategy.
Step One: RFT Cold Start.
They found that directly applying RL (Reinforcement Learning) yielded poor results because the tasks were too difficult, and rewards were too sparse, so the model initially had no idea where to go. Therefore, it was necessary to first “cold start” with a small amount (only 2k) of filtered high-quality SFT data, allowing the model to grasp the basic tool usage and the “skeleton” of long-chain reasoning.
Step Two: DUPO Algorithm Reinforcement.
This is a more efficient RL algorithm they proposed—Duplicating Sampling Policy Optimization (DUPO). Compared to previous methods like DAPO, its biggest advantage is its speed. In RL training for Agents, the “rollout” process of interacting with the environment is very time-consuming. DUPO, through a clever trick—prioritizing the duplication of samples that exhibit diversity (some rollouts successful, some failed) to fill a batch during training, instead of pulling new samples from the environment—greatly improves training efficiency, achieving approximately 2-3 times acceleration.
As seen from the figure above, the RL stage (green portion) brings significant performance improvements to the model, especially on high-difficulty tasks like BrowseComp.
Data remains the moat in the Agent era. The true barrier is not in the model structure but in the ability to create high-difficulty, high-uncertainty training data. With the gradual exploration of open-source Agents, engineering pressure can be alleviated to some extent. For complex Agent tasks, foundational models can even catch up to and rival top-tier closed-source systems.
Open source, a promising future!
paper: https://arxiv.org/pdf/2507.02592
code: https://github.com/Alibaba-NLP/WebAgent