Multi-Agent Systems Are "Burning" Tokens! Everything Anthropic Has Discovered

Synced Review

Synced Editorial Department

A must-read guide for multi-agent system research.

“Anthropic has released an excellent explanation of how they built a multi-agent research system using multiple Claude AI agents. This is a must-read guide for anyone building multi-agent systems.” Just now, famous X blogger Rohan Paul strongly recommended a new Anthropic research paper.

Recently, research on AI agents has been emerging one after another. However, this has also brought some confusion to researchers, such as what tasks require multi-agents? How do multiple AI agents collaborate? How to solve context and memory issues…

Facing these questions, you might as well read this article from Anthropic; perhaps you'll find the answers.

Article link: https://www.anthropic.com/engineering/built-multi-agent-research-system

Advantages of Multi-Agent Systems

Some research involves open-ended problems, where the required steps are often difficult to pre-determine. For complex problem exploration, humans cannot rigidly dictate a fixed path, as the process is inherently dynamic and path-dependent. When people conduct research, they typically adjust their methods continuously based on discoveries, following leads that emerge during the investigation.

This unpredictability makes AI agents particularly suitable for research-type tasks. Research work requires flexibility, the ability to pivot or explore related content as developments unfold during the investigation. Models must be able to perform multi-round reasoning autonomously, deciding further exploration directions based on intermediate findings. A linear, one-shot process cannot handle such tasks.

The essence of research is compression: distilling valuable insights from a vast corpus. Subagents assist in this compression process by running in parallel, each with its own independent context window. They can simultaneously explore different aspects of a problem, then distill the most important findings to the main research agent. Each subagent also serves to separate concerns — they use different tools, prompts, and exploration paths, thereby reducing path dependence and ensuring a more comprehensive and independent research process.

Once intelligence reaches a certain threshold, multi-agent systems become a key way to improve performance. For example, although individual human intelligence has increased over the past hundred thousand years, it is due to our collective intelligence and collaborative abilities in the information age that the overall capabilities of human society have grown exponentially. Even general intelligence agents, as individuals, have limits when performing tasks; however, multiple agents collaborating can complete more complex tasks.

Anthropic’s internal evaluations show that multi-agent research systems perform exceptionally well in "breadth-first" query tasks, which typically require simultaneously exploring multiple independent directions. They found that a multi-agent system composed of Claude Opus 4 as the main agent and Claude Sonnet 4 as subagents performed 90.2% better than a single Claude Opus 4 agent.

A core advantage of multi-agent systems is their ability to solve problems through substantial token consumption. Analysis shows that in BrowseComp evaluations (which measure the ability of browsing agents to locate challenging information), three factors collectively explain 95% of the performance variance. The study found:

token consumption alone explained 80% of the variance;
the number of tool calls and model selection constituted the other two key factors.

This finding validates Anthropic’s previously adopted architecture: by distributing tasks to different agents, each with its own context window, capacity for parallel reasoning is increased. The latest Claude models demonstrate a powerful multiplier effect on token usage efficiency. For example, upgrading Claude Sonnet to version 4 delivers performance improvements that even surpass doubling the token budget of Claude Sonnet 3.7. For tasks that exceed the processing limits of a single agent, a multi-agent architecture can effectively scale token usage, thereby achieving stronger processing capabilities.

Of course, this architecture also has a drawback: in practical applications, they consume tokens very quickly. According to Anthropic's statistics, agents typically use about 4 times the tokens of a normal chat interaction, while multi-agent systems consume about 15 times that of chat.

Therefore, to achieve economic viability, multi-agent systems need to be used in scenarios where the task value is high enough to cover the costs associated with their performance improvements. Furthermore, some domains are not suitable for current multi-agent systems, such as tasks requiring all agents to share the same context, or where there are significant dependencies between agents.

For instance, truly parallelizable parts are relatively few in most programming tasks, and current large language model agents are not yet strong enough in "real-time coordination and task allocation."

Therefore, multi-agent systems excel in high-value tasks characterized by: requiring extensive parallel processing, information volume exceeding a single context window, and needing interaction with a large number of complex tools.

Architecture

Anthropic's research system adopts a multi-agent architecture, using an "orchestrator-worker" pattern: a lead agent is responsible for overall coordination while dispatching tasks to multiple specialized subagents running in parallel.

How the multi-agent architecture actually works: The user's query first goes through the lead agent, which creates multiple specialized subagents to search different aspects of the query in parallel.

The workflow shown above is as follows: When a user submits a query, the system creates a lead research agent called LeadResearcher, which enters an iterative research process. LeadResearcher first considers research methods and saves its plan to Memory (memory module) to persist context information — because once the context window exceeds 200,000 tokens, content will be truncated, and retaining the research plan is crucial for subsequent reasoning.

Subsequently, LeadResearcher creates multiple specialized subagents (the diagram shows two, but there can be any number) and assigns specific research tasks to each. Each Subagent independently performs web searches, uses an interleaved thinking approach to evaluate the results returned by tools, and feeds research findings back to LeadResearcher.

LeadResearcher comprehensively analyzes these results and determines whether further research is needed — if so, it can create more subagents or optimize existing research strategies.

Once sufficient information is collected, the system exits the research loop and passes all research findings to the CitationAgent (citation annotation agent), which processes all documents and research reports, identifying the specific citation locations corresponding to each assertion, thereby ensuring that all points are supported by clear sources.

Finally, the research results, including complete citation information, are returned to the user.

Prompt Engineering and Evaluation Methods for Research Agents

A key difference between multi-agent and single-agent systems is that coordination complexity rapidly increases. In early stages, agents often exhibit erroneous behaviors, such as generating up to 50 subagents for simple problems, endlessly searching for non-existent resources online, or frequently interfering with each other and sending too many irrelevant updates.

Since each agent's behavior is driven by prompts, prompt engineering becomes the primary means for researchers to optimize these behaviors. Here are some principles Anthropic summarized during the process of designing prompts for agents:

Efficient Prompt Design. To optimize prompts, one must understand their actual impact. For this purpose, Anthropic built a simulated environment via a console — fully replicating the prompts and tool configurations within the system, step by step observing the agent's working process. This method immediately exposed typical failure modes: redundant execution (continuing operations after sufficient results were obtained); inefficient queries (using lengthy, vague search instructions); and tool misuse (incorrect selection of functional modules). Therefore, efficient prompt design relies on building an accurate mental model of the agent's behavior; once understood deeply, the most effective directions for improvement become clear.

Teaching the Coordinator How to Divide Work Correctly. In the system adopted by Anthropic, the lead agent is responsible for breaking down user queries into several sub-tasks and assigning these tasks to subagents. Each subagent requires clear objectives, output formats, guidance on which tools and information sources to use, and clear task boundaries. If task descriptions are not specific enough, agents may engage in redundant work, leave tasks undone, or fail to find the necessary information.

Anthropic once learned a profound lesson: when they initially used broad instructions like "research chip shortages," they found these instructions often too vague, leading subagents to misunderstand tasks or perform searches identical to other agents. For example, three subagents simultaneously focused on 2025 supply chain data, with one deviating to the 2021 automotive chip crisis without covering manufacturing bottlenecks, resulting in a final report with 60% duplication and missing wafer fab capacity analysis.

Adjusting Effort Based on Query Complexity. Because agents have difficulty determining the appropriate effort required for different tasks, Anthropic embedded tiered effort rules into their prompts. Simple fact-finding only requires 1 agent calling tools 3-10 times; direct comparison tasks might need 2-4 subagents, each calling tools 10-15 times; while complex research tasks could use over 10 subagents, with clearly defined responsibilities.

These clear guidelines help the lead agent allocate resources more effectively, avoiding over-investment in simple queries.

Tool Design and Selection Are Crucial. The interface between agents and tools is as important as the human-computer interface. Using the right tools can significantly improve efficiency — in many cases, this is not just an optimization but a necessary condition. For example, if an agent attempts to retrieve context information that only exists in Slack through web search, it is doomed to fail from the outset.

As MCP servers enable models to access external tools, this problem becomes more complex — agents may encounter tools they have never used before, and the quality of these tool descriptions can vary.

Therefore, Anthropic designed clear heuristic rules for agents, such as: first review all available tools, match tool usage to user intent, use web search for broad information exploration, prioritize specialized tools over general ones, etc.

Poor tool descriptions can lead agents completely down the wrong path, so every tool must have a clear purpose and description.

Enabling Agents to Self-Improve. Anthropic found that the Claude 4 series models perform exceptionally well in prompt engineering. When given a prompt and a corresponding failure mode, they can diagnose the cause of agent failure and suggest improvements.

Anthropic even built a tool testing agent: when it receives a problematic MCP tool, it attempts to use it and then rewrites its tool description to prevent similar failures. By testing the tool dozens of times, this agent can discover crucial usage details and potential bugs.

This process of optimizing the tool interaction experience reduced task completion time by 40% for subsequent agents using the new descriptions, as they could avoid most common errors.

Broad First, Then Narrow; Gradual Progress. Search strategies should mimic the research methods of human experts: first explore broadly, then refine in depth. However, agents often tend to start with long, specific queries, resulting in very limited content being returned.

To address this, Anthropic guides agents in their prompts to start with short, broad queries, first assessing available information, and then gradually focusing and deepening the research direction.

Guiding the Thinking Process. "Extended Thinking Mode" allows Claude to display its visible thought process in its output, functioning as a controllable "scratchpad." The lead agent utilizes this thinking process to plan the overall strategy, including evaluating which tools are suitable for the current task, judging the complexity of the query and the number of subagents needed, and defining the responsibilities of each subagent.

Tests show that extended thinking significantly improves agents' instruction following, reasoning capabilities, and execution efficiency.

Subagents also first formulate a plan and then, after tool calls, use Interleaved Thinking to evaluate result quality, identify information gaps, and refine the next query. This gives subagents stronger adaptability when facing different tasks.

Parallel tool invocation completely transforms the speed and performance of research tasks. Complex research tasks inherently require consulting numerous information sources. Anthropic's early agents used serial search, which was extremely inefficient.

To address this, they introduced two parallel mechanisms:

the lead agent simultaneously creates 3-5 subagents instead of generating them sequentially;
each subagent uses more than 3 tools concurrently instead of calling them one by one.

These improvements reduced the research time for complex queries by up to 90%, allowing the research system to complete in minutes what previously took hours, while also covering a far broader range of information than other systems.

Effective Evaluation Methods

Good evaluation mechanisms are crucial for building reliable AI applications, and agent systems are no exception. However, evaluating multi-agent systems faces unique challenges.

Traditional evaluations usually assume that AI will follow the same steps every time: given input X, the system should execute path Y and output result Z. But multi-agent systems do not work this way. Even starting from the same point, agents may take entirely different but equally valid paths to achieve their goals. Some agents might consult only 3 information sources, others 10; they might also use different tools to arrive at the same answer.

Since we don't always know which set of operational steps is correct, it's often impossible to evaluate agent performance merely by checking adherence to a preset process. Instead, we need more flexible evaluation methods that assess whether the agent achieved the correct result and whether its execution process was reasonable.

Starting with Small-Sample Evaluations. In the early stages of agent development, any change often leads to significant impact. For example, simply adjusting a prompt might boost the success rate from 30% to 80%. In this phase of high impact, only a small number of test cases are needed to observe the effect of a change.

Anthropic initially used a set of approximately 20 queries, which represented real usage patterns. Testing these queries was usually sufficient to clearly determine the effect of a change.

People often hear AI development teams say they postpone creating evaluation mechanisms because they believe only large-scale evaluations with hundreds of test cases are valuable. But in reality, the best practice is to start small immediately, beginning evaluation with a few examples rather than waiting until a complete evaluation system is built.

When used properly, the "LLM-as-judge" evaluation method is also a good option.

Research-type outputs are difficult to evaluate programmatically because they are often free-form text and rarely have a single correct answer. LLMs are naturally suited to act as evaluators for such outputs.

Anthropic used an "LLM judge" to evaluate each output based on a set of criteria (rubric), specifically including the following dimensions:

Factual accuracy: Does the statement align with the cited source?
Citation accuracy: Does the cited content genuinely support the corresponding statement?
Completeness: Did it cover all requested content?
Information source quality: Did it prioritize high-quality primary sources over lower-quality secondary materials?
Tool usage efficiency: Were relevant tools reasonably selected and appropriately used?

Anthropic experimented with using multiple LLMs to evaluate each dimension separately, but ultimately found that a single LLM call, using one prompt to generate a 0.0–1.0 score and a "pass/fail" judgment, was the most stable method and best aligned with human judgment.

This method is particularly effective when test cases themselves have clear answers, such as: "Did it accurately list the top three pharmaceutical companies by R&D spending?" Such questions allow for direct judgment of correctness.

Leveraging LLMs as judges enables efficient scaling to evaluate hundreds of outputs, significantly boosting the scalability and practicality of the evaluation system.

Human evaluation can uncover issues missed by automated evaluations. People actually testing agents will discover edge cases that evaluation systems cannot capture, such as hallucinatory answers generated from unusual queries, system failures, or subtle source selection biases. Even in today's prevalence of automated evaluations, manual testing remains indispensable.

Production Reliability and Engineering Challenges

In traditional software, program defects can lead to functional failures, performance degradation, or system crashes. In agent systems, however, subtle changes can trigger massive behavioral shifts, making it exceptionally difficult to write code for complex agents that need to maintain state over long execution times.

Agents are stateful, and errors accumulate. Agents can run for extended periods, maintaining state across multiple tool calls. This means we need to persistently execute code and handle errors in the process. Without effective mitigation, minor system glitches can be catastrophic for an agent. When an error occurs, we cannot simply restart from scratch: restarting is costly and frustrating for users. Instead, Anthropic built systems capable of resuming execution from the state where an agent encountered an error.

Debugging. Agents make dynamic decisions during runtime, and results are non-deterministic even with identical prompts, which makes debugging more challenging. By adding full production tracing, Anthropic can systematically diagnose the causes of agent failures and fix issues.

Deployment Requires Careful Coordination. Agent systems are highly stateful networks of prompts, tools, and execution logic, operating almost continuously. This means that whenever we deploy an update, agents might be at any stage of their execution process. While it's not possible to update all agents to a new version simultaneously, Anthropic uses rainbow deployment, gradually shifting traffic from older versions to newer ones while keeping both running in parallel, thereby avoiding disruption to active agents.

Synchronous Execution Creates Bottlenecks. Currently, Anthropic's orchestrator agent executes subagent tasks synchronously, waiting for each batch of subagents to complete before proceeding. This simplifies coordination but also creates bottlenecks in information flow between agents. For example, the lead agent cannot guide subagents in real-time, subagents cannot collaborate with each other, and the entire system might be blocked waiting for a specific subagent to complete its search.

Asynchronous execution, however, offers more parallelism: agents can work simultaneously and create new subagents as needed. But this asynchronous nature also brings challenges in terms of result coordination, state consistency, and error propagation. As models become capable of handling longer and more complex research tasks, Anthropic expects the performance gains to outweigh the increased complexity.

Summary

When building AI agents, the "last mile" often accounts for the majority of the journey. Transforming a codebase runnable on a developer's machine into a reliable production system requires substantial engineering effort. The compounding nature of errors in agent systems means that minor issues in traditional software can completely disrupt an agent's operation. A failure at one step might lead the agent to explore an entirely different path, producing unpredictable results. For various reasons outlined in this article, the gap between prototype and production environment is typically larger than expected.

Despite these challenges, multi-agent systems have demonstrated immense value in open-ended research tasks. With meticulous engineering design, comprehensive testing, attention to detail in prompt and tool design, robust operational practices, and close collaboration between research, product, and engineering teams with a deep understanding of current agent capabilities, multi-agent research systems can operate stably in large-scale scenarios. We are already seeing these systems change how complex problems are solved.

For reprints, please contact this official account for authorization

Submissions or media inquiries: liyazhou@jiqizhixin.com

Multi-Agent Systems Are "Burning" Tokens! Everything Anthropic Has Discovered

Share Short URL