First Multi-Round LLM Router Unveiled: Router-R1 Teaches Large Models to "Think–Route–Aggregate"

Haozhen Zhang is currently a first-year Ph.D. student at Nanyang Technological University (NTU). This work was completed during his internship at the University of Illinois Urbana-Champaign (UIUC). Tao Feng is a second-year Ph.D. student at UIUC, and Jiaxuan You is an Assistant Professor in the UIUC Computer Science Department. The team has long focused on the LLM Router domain, producing representative research results such as GraphRouter, FusionFactory, and the Router-R1 discussed here.

"If a question can be answered by a small model, why deploy a more expensive large model to think about it?"

In the age of explosive growth in Large Language Models (LLMs), this seemingly simple question is becoming a crucial bottleneck in AI system design. Balancing performance, latency, and cost requires intelligently distributing tasks among different LLMs, presenting a new challenge for AI infrastructure.

Recently, a research team from the University of Illinois Urbana-Champaign (UIUC) published their new work at NeurIPS 2025: "Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via Reinforcement Learning." This paper introduces Router-R1, the first multi-round LLM Router framework, enabling LLMs not just to "answer," but to "think, dispatch, and coordinate other models" to achieve a controllable balance between performance and cost.

Paper Title: Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via Reinforcement Learning
Authors: Haozhen Zhang, Tao Feng, Jiaxuan You
Institution: University of Illinois at Urbana-Champaign
Paper Link: https://arxiv.org/abs/2506.09033
Code Link: https://github.com/ulab-uiuc/Router-R1

🧭 Background: From "One Model Answers All" to "Intelligent Scheduling"

ChatGPT, Claude, Gemini, Qwen, LLaMA... In just two years, the LLM family has grown from a handful to hundreds of models. Different models have varying strengths: some excel at logical reasoning, others are precise in knowledge retrieval, and some offer quick response times and lower costs.

However, most current AI applications rely on single-model inference, where a user query is sent directly to a fixed LLM for answering. While simple, this approach means simple questions can waste computing power, while complex, multi-hop problems might be answered incorrectly due to insufficient model capability.

Consequently, the "LLM Router" has emerged and is becoming the new front-end brain of AI systems. Unlike Token-level Routers (like MoE), the LLM Router operates at the Query-level. It assesses the complexity of a question, matches it to the most suitable model, and can even dynamically combine multiple models to complete the inference.

Yet, existing LLM Routers (like GraphRouter, RouterDC) typically use a single-round decision mechanism: a given question is routed to only one candidate model for a complete answer. This single-round routing struggles with complex tasks requiring multi-hop reasoning or cross-domain knowledge.

🚀 Router-R1: Turning the Router Itself into a "Thinking LLM"

The core innovation of Router-R1 is transforming the router itself into a Policy LLM equipped with reasoning capabilities.

Router-R1 is thus no longer merely a "Query dispatcher," but an intelligent agent possessing a chain of thought, actively performing "Think—Select Model—Aggregate." It can repeatedly switch between thinking, routing, and aggregation across multiple routing rounds, iteratively constructing the final answer:

1️⃣ Think: Upon receiving a User Query, Router-R1 first executes the "Think" stage to perform internal reasoning analysis and determine if external information is needed for assistance.
2️⃣ Route: If external information is required, Router-R1 triggers the "Route" instruction to dynamically call suitable external candidate models (e.g., Qwen, LLaMA, Gemma, Mixtral) based on each LLM's Descriptor Prompt to answer subproblems.
3️⃣ Aggregate: The responses from the external models are returned, inserted into the Policy LLM's Evolving Context for aggregation, and the multi-round reasoning continues iteratively to generate the final answer.

This alternating "Think–Route–Aggregate" mechanism allows Router-R1 to fully leverage the complementary advantages of different LLMs (e.g., one excelling at mathematical reasoning, another at knowledge retrieval), potentially achieving true multi-model collaborative reasoning.

🎯 Using Reinforcement Learning to Balance Performance and Cost

Router-R1 formalizes the entire multi-round routing process as a sequential decision problem. Reinforcement Learning is used to train the Router to optimize the Performance-Cost Trade-off within a complex decision space. The paper designs three intuitive reward functions:

1️⃣ Format Reward: Output Format Correctness Reward

Ensures the model output strictly adheres to format constraints such as <think> and <answer>, preventing the generation of invalid text during early training.

2️⃣ Final Outcome Reward: Result Correctness Reward

Uses the Exact Match (EM) metric to measure whether the generated answer is completely identical to the ground truth, directly incentivizing the Router to output correct results.

Where is the LLM's prediction, and is the ground truth.

3️⃣ Cost Reward: Cost Constraint Reward

Router-R1 innovatively introduces a computational cost reward mechanism, designing an inversely proportional reward function based on the parameter size and output token count of the called model:

Where represents the unit Token cost function for the API service, is the parameter size of the external model called, and is the number of output tokens. This mechanism encourages Router-R1 to consider the trade-off between performance and cost when answering questions, achieving controllable and dynamic cost-aware routing and inference.

The total reward for Router-R1 combines the three components:

Where hyperparameter α controls the degree of the performance-cost trade-off.

🧪 Comprehensive Leadership Across Seven Benchmarks: Improved Accuracy + Generalization

The research team conducted a systematic evaluation of Router-R1 on 7 QA Benchmarks, covering both single-hop and multi-hop reasoning tasks, including NQ, TriviaQA, PopQA, HotpotQA, 2WikiMultiHopQA, Musique, and Bamboogle. Router-R1 was trained only on the NQ and HotpotQA datasets, performing Out-of-domain Evaluation on the remaining datasets.

As shown above, when α=0 (optimizing only performance without considering cost), Router-R1 achieved the strongest overall performance across all datasets, defeating single-round routing methods like GraphRouter/RouterDC, and demonstrating strong generalization capabilities on unseen datasets.

As shown above, when the hyperparameter α was further adjusted to explore the performance-cost trade-off, the invocation cost significantly decreased as α increased, opening a new paradigm for cost-controllable, intelligent LLM scheduling strategies.

Furthermore, to test Router-R1's generalization to external candidate LLMs, as shown above, when external models not involved in training were added, the performance remained relatively stable and even improved without retraining, demonstrating Router-R1's excellent zero-shot transferability.

🧩 Summary: Towards the Era of "Multi-Model Collaborative Agents"

Router-R1 is not just "another bigger model"; it represents a new paradigm enabling multiple models to work collaboratively. Through reinforcement learning, Router-R1 evolves the LLM from a "single responder" to a "multi-agent coordinator," achieving dynamic balance between performance and cost. This allows Router-R1 to maintain high-quality output while reducing computational power and cost overheads, lowering the environmental and resource pressure of deploying large models. Router-R1 naturally supports model reuse and modular composition, allowing for quick integration simply by adding a new model description, laying the foundation for building scalable, multi-model coexistent AI infrastructure.

It is noteworthy that the latest GPT-5 technical report also explicitly confirms the adoption of an LLM Router mechanism for dynamic scheduling of different model versions. This further validates the trend represented by Router-R1: multi-model collaborative routing will become an indispensable underlying infrastructure for the future LLM ecosystem.

THE END

First Multi-Round LLM Router Unveiled: Router-R1 Teaches Large Models to "Think–Route–Aggregate"

Share Short URL