Anthropic Reveals Multi-Agent System Details: Claude Replicates Human Collective Intelligence, Outperforms Single Opus by 90%!

The large model Agent race has reached new heights.

From software engineering to research assistance, from daily life to business decisions, an omnipotent AI agent seems to have become an indispensable path to AGI. Major giants have entered the fray, but most are still operating within the paradigm of single-agent intelligence.

However, today, Anthropic released a ten-thousand-word article, for the first time disclosing the design principles, architectural details, and engineering lessons of its internally developed Multi-agent Research System.

This system is precisely the hidden force behind the latest "research" capabilities of its flagship model, Claude.

The core data is even more astonishing: in a multi-agent system where Claude Opus serves as the "leader" and multiple Claude Sonnets serve as "subordinates," its performance on internal research evaluation benchmarks is 90.2% higher than that of the strongest single Claude Opus 4.

This is not just a quantitative change, but a qualitative leap.

The Anthropic team put forth a thought-provoking idea in the article: once intelligence reaches a certain threshold, multi-agent systems will become key to extending capabilities. Just as human society's exponential development over the past hundred thousand years has relied not on leaps in individual intelligence, but on the emergence of collective wisdom and collaborative abilities.

They frankly state that the essence of this system is to "replicate" the collective intelligence of human society through architectural design.

Even more surprisingly, they found a counter-intuitive conclusion: in their benchmark tests, 80% of the AI agent performance variance was explained by a simple, blunt factor—token consumption.

In other words, "brute force works wonders" might actually be effective in the agent domain. And multi-agent systems are the best way to smartly "burn" enough tokens to solve complex problems while keeping economic costs under control.

This article is highly informative, covering almost all critical aspects of building a production-grade multi-agent system, from architectural design, prompt engineering, and tool selection to evaluation methods and engineering challenges.

Without further ado, let's dive directly into Anthropic's firsthand sharing, which is full of hardcore insights.

Why use multi-agents? Are single large models not enough?

Before delving into the architecture, a fundamental question must be answered: why do we need multi-agents? Are powerful single models like Claude Opus or GPT-4o not enough?

Anthropic's answer is: for tasks like open-ended research, they are truly not enough.

The essence of research work is non-linear and path-dependent. When human experts explore a complex topic, they constantly adjust direction based on new findings, and may at any time delve into an unexpected angle. You cannot hardcode this exploratory process with a fixed, linear workflow.

This is precisely the Achilles' heel of single LLMs. They excel at "one-shot" Q&A but struggle with complex tasks requiring continuous autonomous decision-making and multi-round exploration.

Multi-agent systems, however, perfectly fit this demand.

Parallel Compression and Separation of Concerns

The essence of research is extracting insights from vast amounts of information, which is fundamentally a compression process.

Multi-agent systems greatly accelerate this process through parallelization. The system can dispatch multiple "subagents," each with its own independent context window, toolset, and exploration trajectory, much like different members of a research group simultaneously approaching a problem from various angles.

They each complete information gathering and preliminary analysis, "compressing" and refining the most important tokens, and finally reporting to the "leader agent."

This separation of concerns design is not only more efficient but also reduces the risk of single-path dependency, making research more comprehensive and in-depth.

90% Performance Boost Demonstrated

Actions speak louder than words; data provides proof.

Anthropic's internal evaluations show that multi-agent systems exhibit overwhelming advantages when handling breadth-first queries that require exploring multiple independent directions simultaneously.

A system with Claude Opus 4 as the leader and Claude Sonnet 4 as subagents performed 90.2% better in internal research evaluations than a single-agent system using Claude Opus 4 alone.

A classic example is: "Find all board members of companies in the S&P 500 Information Technology sector."

• Single-agent system: Fell into a slow, continuous search loop and ultimately failed to find the complete answer.

• Multi-agent system: The leader agent quickly broke down the task, assigned a subagent to each company or group of companies for parallel searching, and successfully compiled all correct answers.

The Secret to Success: "Brute Force Works Wonders"

The most surprising finding came from Anthropic's analysis of its BrowseComp evaluation benchmark, which specifically tests an agent's ability to locate hard-to-find information on the web.

They found that 95% of the model's performance variance could be explained by three factors. Among them, token usage alone accounted for 80% of the variance! The other two factors were the number of tool calls and model selection.

This finding fundamentally validates the correctness of their architectural design: by distributing work to multiple agents with independent context windows, the system can effectively scale token usage to handle complex tasks that a single agent cannot. This is equivalent to investing more "computing power" and "depth of thought" into problem-solving.

Of course, this also comes with an obvious cost: burning money.

Data shows that agent interactions consume approximately 4 times more tokens than regular chats, and multi-agent systems consume as much as 15 times more.

This means that multi-agent systems are economically viable only for scenarios where the task value is high enough to cover their performance cost.

Additionally, not all tasks are suitable for multi-agents. For example, most programming tasks have much lower parallelism than research tasks, and LLM agents are currently not very good at real-time coordination and delegation of coding work.

In summary, multi-agent systems excel in high-value, highly parallelizable tasks where the information volume exceeds a single context window and requires interaction with numerous complex tools.

Architecture Revealed: Commander + Workers, a Three-Step Research Process

Anthropic's research system employs a classic orchestrator-worker pattern. A leader agent coordinates the entire process and delegates specific tasks to parallel specialized subagents.

The official architecture diagram below clearly illustrates its workflow:

We can break it down into the following key steps:

1. Launch and Planning When a user submits a query (e.g., "What are the top companies in AI agent field in 2025?"), the system creates a LeadResearcher agent. It first enters an iterative research process, the first step being to think and save its research plan to "Memory." This is a crucial detail. Because an agent's context window (even 200K tokens) can be filled, saving the core plan to external memory ensures the agent does not "lose its memory" during long-term tasks.

2. Task Decomposition and Delegation The LeadResearcher creates multiple specialized Subagents based on the plan. The diagram shows two, but the actual number can be dynamically adjusted. Each Subagent is given a very specific research task, such as "research the latest developments of company A" or "find the funding history of company B."

3. Parallel Execution and Dynamic Adjustment Each Subagent works independently, gathering information using tools like search. A key design is interleaved thinking: after each tool call, the Subagent pauses to think, evaluate the quality of the results, identify information gaps, and plan the next query. This allows subagents to adapt dynamically to tasks.

4. Result Synthesis and Iteration After completing their tasks, subagents return their findings to the LeadResearcher. The LeadResearcher synthesizes all subagent reports and determines if further research is needed. If so, it can create more subagents or adjust existing strategies, forming a research loop.

5. Citation and Attribution Once the LeadResearcher determines that enough information has been gathered, the research loop exits. At this point, the system passes all research reports and original documents to a dedicated CitationAgent. This agent's sole responsibility is to precisely match and attribute every statement in the report to its original source. This greatly ensures the factual accuracy and traceability of the final answer.

6. Final Delivery Finally, a research report with complete and precise citations is presented to the user.

The entire architecture is fundamentally different from traditional Retrieval-Augmented Generation (RAG). Traditional RAG is static; it retrieves text chunks most similar to a query once and then generates an answer. Anthropic's system, however, is dynamic and multi-step, actively discovering, adapting to, and analyzing information to generate answers of much higher quality than RAG.

The "Eight Golden Rules" of Prompt Engineering

If architecture is the skeleton, then Prompts are the incantations that infuse the agent with soul.

The Anthropic team admits that in the early stages of the system, agents behaved chaotically: generating 50 subagents for a simple query, endlessly searching for non-existent sources, and even interfering with each other.

Prompt engineering was their core lever for taming these "wild horses." They summarized eight golden rules:

1. Think like an agent: To write good Prompts, you must first become an agent. The team built simulated environments and observed agent behavior step-by-step, which immediately revealed failure modes: for example, continuing to search after finding an answer; overly lengthy search queries; selecting the wrong tool, etc. Building an accurate mental model of agent behavior is a prerequisite for effective iteration.

2. Teach the commander how to delegate: The leader agent needs to give clear instructions to subagents. Simple instructions like "research semiconductor shortages" are far from enough, as this can lead to overlapping tasks or missing critical information by subagents. For example, one subagent might research the 2021 automotive chip crisis, while two others repeatedly research the 2025 supply chain status. Good instructions must include: clear objectives, output format, suggested tools and data sources, and clear task boundaries.

3. Adjust workload based on complexity: Agents find it difficult to determine how much effort they should put into different tasks themselves. Therefore, the team directly embedded scaling rules into the Prompt.

• Simple fact-finding: Requires 1 agent, 3-10 tool calls.

• Direct comparison: Requires 2-4 subagents, 10-15 tool calls each.

• Complex research: May require more than 10 subagents with clear divisions of labor. These explicit guidelines help the leader agent efficiently allocate resources and avoid over-investing in simple problems.

4. Tool design is crucial: The interface between agents and tools is as important as the human-machine interface. Using the right tool makes the job easier. If an agent is asked to search a webpage for information that only exists internally on Slack, it is doomed to fail from the start. Poor tool descriptions can lead agents in completely wrong directions. Therefore, each tool needs a unique objective and a clear description. The team even provided agents with heuristic rules for tool selection in the Prompt: first check all available tools, match tool purpose with user intent, prioritize specialized tools, etc.

5. Enable agents to self-improve: This is a "meta-cognitive" insight; Claude 4 models themselves are excellent Prompt engineers. When given a failed Prompt and a failure case, they can accurately diagnose the problem and suggest improvements. The team even created a "tool testing agent." When given a flawed tool, it attempts to use the tool, then rewrites the tool's description to avoid future failures. Through dozens of tests, this agent discovered many subtle differences and bugs. This self-improvement process ultimately reduced the task completion time by 40% for future agents using new descriptions.

6. Cast a wide net first, then focus precisely: Search strategies should mimic human expert research methods: first explore the field comprehensively, then delve into details. Agents often default to overly lengthy and specific queries, resulting in sparse returns. We correct this tendency by prompting agents to start with short, broad queries, evaluate available information, and then gradually narrow the focus.

7. Guide the thinking process: Claude's "extended thinking mode" (outputting thought process within tags) can serve as a controlled scratchpad. The leader agent uses it to plan methods, evaluate tools, and determine the number and roles of subagents. Subagents use it to plan queries and evaluate result quality after tool calls. Tests show that this method significantly improved instruction following, reasoning, and efficiency.

8. Experiment with parallel execution: Early agents searched serially, which was painfully slow. The team introduced two types of parallelization:

• Macro-parallelism: The leader agent launches 3-5 subagents at once, rather than serially.

• Micro-parallelism: Each subagent can make 3+ tool calls in parallel at once. These two changes reduced research time by up to 90% for complex queries, allowing the system to complete tasks in minutes that previously took hours.

How to Evaluate Effectively? From LLM-as-Judge to Human Red Teams

Evaluation is the cornerstone of building reliable AI applications, but evaluating multi-agent systems is particularly difficult.

Traditional evaluation assumes that for input X, the system should follow path Y to get output Z. But agents are non-deterministic; they can reach the same correct goal through completely different valid paths.

Therefore, evaluation methods must be flexible enough to judge both the correctness of the result and the reasonableness of the process.

1. Start immediately, small-sample evaluation This is valuable advice for all AI development teams. Many believe that only large evaluation sets containing hundreds of cases are valuable, thus delaying action. Anthropic's experience is: start evaluating immediately with small samples. In early development, a small Prompt adjustment can make success rates soar from 30% to 80%. This massive effect size can be discovered with about 20 representative queries.

2. Well-designed LLM-as-Judge Research reports are free-form text, difficult to evaluate programmatically. LLMs naturally became the best "examiners." Anthropic uses an LLM judge to score based on a detailed scoring rubric:

• Factual accuracy: Does the statement match the source?

• Citation accuracy: Does the cited source support the statement?

• Completeness: Does it cover all requested content?

• Source quality: Were high-quality primary sources used, not SEO farms?

• Tool efficiency: Were the correct tools used a reasonable number of times? They found that using a single LLM call to output a score of 0.0-1.0 and a pass/fail grade based on a single Prompt was the most stable and consistent with human judgment.

3. Human evaluation is indispensable Automated evaluation always has blind spots. Human testers (red teams) can discover unexpected edge cases. For example, human testers found that early agents tended to choose SEO-optimized content farms rather than more authoritative but lower-ranked sources, such as academic PDFs or personal blogs. The team solved this problem by adding heuristic rules about source quality to the Prompt.

From Prototype to Product: Hard-Earned Engineering Lessons

The gap between a well-functioning agent prototype on a development machine and a reliable, production-grade system is much wider than imagined. Anthropic calls this "the last mile is critical."

1. Statefulness and error accumulation: Agents are long-running, stateful processes. A small bug in traditional software can be infinitely amplified in an agent system, derailing the entire task. Therefore, simply "starting over after an error" is unacceptable; it is both expensive and frustrating for users. Their solutions are:

• Recoverability: By setting checkpoints, the system can recover from where an error occurred, rather than restarting.

• Enabling agents to adapt to errors: When a tool fails, directly inform the agent, allowing it to use its intelligence to adapt and find alternatives. This method has been surprisingly effective.

2. Debugging non-determinism: Due to the non-deterministic nature of agents, reproducing bugs becomes extremely difficult. User reports often state "the agent didn't find obvious information," but the cause is difficult to trace. The solution is full production tracing. This allows them to diagnose the root causes of failures and fix them systematically. Furthermore, they monitor the agent's decision patterns and interaction structures (without monitoring specific content, while protecting user privacy) to discover unexpected behaviors.

3. Cautious deployment: An agent system is a highly stateful network composed of Prompts, tools, and execution logic that runs almost continuously. When deploying updates, you cannot simply interrupt agents that are currently running. They adopted a "rainbow deployments" strategy, where new and old versions of the system run simultaneously, and traffic is gradually migrated from the old to the new version, thus avoiding interference with agents performing tasks.

4. Future: Asynchronous execution: Currently, the system is synchronous: the leader agent must wait for a batch of subagents to complete before proceeding to the next step. This simplifies coordination but also creates bottlenecks. The future direction is asynchronous execution, where agents can work concurrently and create new subagents as needed. Although this presents significant challenges in result coordination and state consistency, this performance improvement will be worthwhile as models can handle longer, more complex tasks.

Summary and Outlook: The Dawn of an AI "Virtual Company"

Anthropic's article reveals a profound reality: building production-grade AI agents, the last mile is critical. The chasm from prototype to product stems from the nature of errors continuously accumulating in agent systems.

Despite the challenges, multi-agent systems have demonstrated immense value. User feedback indicates that Claude's research function has helped them discover unforeseen business opportunities, navigate complex healthcare options, resolve thorny technical bugs, and save days of work by revealing deep connections between research areas.

Through analysis of user usage, Anthropic found that the most common use cases for this feature currently include:

• Developing domain-specific software systems

• Developing and optimizing specialized technical content

• Formulating business growth and revenue generation strategies

• Assisting academic research and educational material development

• Researching and verifying information about people, places, or organizations

Behind this lies careful engineering design, comprehensive testing, meticulous Prompt and tool refinement, robust operational practices, and a deep understanding of agents' current capabilities and close collaboration between research, product, and engineering teams.

The "iPhone moment" for agents may not have arrived yet, but Anthropic's exploration undoubtedly points us toward that lighthouse. An "AI virtual company" composed of a leader agent (CEO), subagents (expert employees), tools (department capabilities), and memory (knowledge base) is rising on the horizon.

Human collective intelligence is being "replicated" and "accelerated" in a new digital form. This, perhaps, is the most exciting future for multi-agent systems.

Reference link: https://www.anthropic.com/engineering/built-multi-agent-research-system

Anthropic Reveals Multi-Agent System Details: Claude Replicates Human Collective Intelligence, Outperforms Single Opus by 90%!

Share Short URL