Google Reveals: Scaling Through Multi-Agent Reasoning Is the Future.

Google DeepMind and MIT jointly published a paper called TUMIX (Tool-Use Mixture).

It suggests that Multi-Agent systems are the ultimate way to achieve test-time scaling. At half the cost, accuracy on HLE soared from 21.6% to 34.1%, surpassing Gemini-2.5-Pro Deep Research.

Additionally, they included a bonus finding: allowing Agents to design other Agents yielded even better results than human-designed ones.

A Counter-Intuitive Finding

Agent diversity > Extensive sampling

What is the current mainstream method for test-time scaling?

Repeatedly sampling the same strongest model and then using majority voting to select the answer.

This seems reasonable, but Google’s validation showed: it's wrong.

They conducted an experiment:

Single Agent repeated 15 times vs 15 different Agents reasoning once each
At the same inference cost, the 15 different Agents showed significantly higher accuracy and coverage.

Why?

Because different Agents utilize different tool usage strategies (pure text reasoning, code execution, web search, mixed dual-tool use, etc.), allowing them to explore a broader solution space. Repeating a single Agent essentially keeps revolving within the same cognitive framework.

They also compared three tool combinations:

Code_Text (can only use code, no search)
Search_Text (only search, no code)
Code_Search_Text (can use both)

The results showed that the dual-tool Agent group had significantly higher coverage and accuracy than the single-tool groups.

What does this indicate? Code Interpreter and Search are not substitutes for each other; they are complementary.

Text reasoning excels at semantics and common sense, code excels at precise calculation, and search excels at acquiring the latest knowledge. Only a mix of all three can unlock the full potential of the LLM.

The Core Mechanism of TUMIX

The TUMIX architecture is not complex, revolving around three core steps:

First Round

15 different Agents perform inference in parallel.

Each Agent has its own tool usage strategy (CoT, pure code, search-first, mixed code/search, guided dual-tool, etc.)

Each Agent can call tools up to 5 times, generating 15 preliminary answers.

Second Round and Beyond

Answer Sharing + Iterative Optimization:

All answers from the previous round are appended to the original question. Each Agent generates a new answer based on the original question plus the answers from the other Agents.

This process repeats until the LLM judges that the answer has converged.

Termination

LLM-as-Judge is used to automatically determine when to stop iteration (minimum 2 rounds). The final answer is selected via majority voting.

This design maintains exploratory diversity while improving answer quality through iterative optimization.

They also observed an interesting phenomenon: as the iteration rounds increase, coverage (at least one Agent answers correctly) decreases, but average accuracy increases.

This suggests that while Agents learn from each other and converge, they sometimes erroneously discard some correct answers.

Therefore, the key is to find the right point—sufficient iteration and optimization without excessive convergence.

Conclusion

Let's look at TUMIX's practical performance:

On Gemini-2.5-Pro, HLE increased from 21.6% to 32.3%, GPQA improved from 84.6% to 87.9%, and AIME 24&25 improved from 87.3% to 96.7%.

Compared to other Test-time Scaling methods (Self-MoA, Symbolic-MoE, DEI, SciMaster, GSA), TUMIX shows a clear advantage in average accuracy at the same inference cost.

Can LLMs Automatically Design Stronger Agents?

The paper included a bonus finding: they tried letting Gemini-2.5-Pro design new Agents itself.

The method was simple:

Show the LLM the existing 15 human-designed Agents.
Ask it to generate more diverse and higher-quality Agents.
Filter the best 15 performing Agents from the generated set of 25 new Agents.

The result?

The group mixing human-designed and LLM-generated Agents performed 1.2% higher than the purely human-designed group.

What did the LLM-generated Agents look like? For example:

Plan-Verify-Refine: Plan first, then execute (code or search), then verify and optimize.
SearchThenCode: Mandating search first, then code use.
Debate-CrossExam: Simulating a debate between a proposer and a skeptic to guide tool usage.

These strategies were completely different from the human-designed ones, indicating that the LLM possesses certain Meta-Agent design capabilities.

Final Thoughts

The paths of OpenAI o1 and DeepSeek R1 involve having a single model engage in deep thought, which essentially still scales within the same reasoning framework.

TUMIX shows us that using diverse Agents and mixing tools can achieve better results at a lower cost.

Furthermore, LLMs can design stronger Agent architectures, meaning future AI systems may optimize their own workflow without human intervention.