Did "More is Better" Fail? ModelSwitch Jumps Out of the Sampling Black Hole, Rewriting the LLM Inference Paradigm

In the rapid development of large language models (LLMs) today, how to further improve their performance has become the focus of researchers. Many current works, based on the "repeated sampling-voting" framework, perform a large number of samples during testing to improve answer accuracy. Sometimes, a single question may require hundreds or even thousands of samples, leading to enormous computational overhead. We can't help but ask: Do we really need so many samples?

The ModelSwitch strategy introduced in this article seeks a balance between performance and efficiency. It abandons the blind increase in sampling times for a single model, instead cleverly allocating the sampling budget to multiple LLMs, leveraging their potential complementary advantages.

▲ Figure 1. Performance comparison of ModelSwitch with Self-Consistency on Math and MathBench datasets

As shown in Figure 1, on the MATH dataset, ModelSwitch (using a combination of GPT-4o mini and Gemini 1.5 Flash) achieved an accuracy of 81% with just 35 samples. This not only outperforms the stronger Gemini 1.5 Flash, which reached 79.8% accuracy using the Self-Consistency method with as many as 512 samples, but also achieves a computational efficiency improvement of up to 14 times!

On the MathBench dataset, ModelSwitch (using a combination of Gemma-2-9B-It and Llama-3.1-8B-Instruct) achieved an accuracy of 75% with only 48 samples. This surpasses the stronger Gemma-2-9B-It's 73.7% accuracy achieved with the Self-Consistency method using 512 samples, demonstrating a 10-fold efficiency improvement.

Paper Title:

Do We Truly Need So Many Samples? Multi-LLM Repeated Sampling Efficiently Scales Test-Time Compute

Paper Link:

https://arxiv.org/abs/2504.00762

Project Code:

https://github.com/JianhaoChen-nju/ModelSwitch

Detailed Explanation of ModelSwitch Algorithm Mechanism

What is the core mechanism of ModelSwitch? The answer is to use the consistency of answers generated by models as a signal for intelligent switching between different models. This design is based on a key empirical observation: the accuracy of a model is often closely related to the consistency it exhibits when generating answers.

It can be imagined that when a model faces a problem and provides a variety of highly inconsistent answers, it usually means it is "uncertain" about the question, and the likelihood of correctness is naturally low.

After ModelSwitch captures this uncertain signal, it does not force the current model to continue; instead, it decisively switches to another LLM, hoping that the next model might know something the previous one did not. If a subsequent model can provide highly consistent answers, the probability of obtaining the correct solution significantly increases.

▲ Figure 2. Schematic diagram of ModelSwitch working between two LLMs

Referring to Figure 2, the ModelSwitch algorithm, when running, allows multiple LLMs to sequentially generate a pre-allocated number of samples. If all answers provided by the current model are perfectly consistent, the algorithm confidently adopts this answer and terminates the entire process early, thereby saving computational overhead for subsequent models.

However, if the answers from the current model are inconsistent, the algorithm hands it over to the next model to continue sampling until a model can generate perfectly consistent answers. If no model can produce perfectly consistent answers, or if all models have been sampled, then the answers from all models are aggregated.

This dynamic switching not only aims to improve the accuracy of the final answer but, more importantly, it also significantly reduces unnecessary computational costs. When aggregating answers, ModelSwitch employs a weighted voting algorithm.

The weighted voting algorithm considers two dimensions of weights: first, the consistency of each model's own answers for the current query, calculated by the entropy of the answer distribution—higher consistency means lower entropy and higher weight. Second, the model's own prior performance. This design ensures that it can dynamically capture the model's confidence in a specific problem while also considering the model's historical performance.

Performance Evaluation

So, how does ModelSwitch perform in broader practical tests? The research team conducted extensive and rigorous evaluations of ModelSwitch on up to seven diverse challenging datasets covering mathematical reasoning (GSM8K, MATH, AIME24), common sense and domain-specific knowledge understanding (MMLU-Pro), symbolic reasoning (DATE), and multilingual tasks (MGSM).

The experiments utilized various closed-source LLMs including GPT-4o mini, Gemini 1.5 Flash, Claude 3 Haiku, GPT-4o, Gemini 1.5 Pro, and several open-source LLMs including Llama-3.1-8B-Instruct, Gemma-2-9B-It, Qwen2.5-7B-Instruct, and Llama-3.1-70B-Instruct.

The main comparisons were made against the single-LLM repeated sampling-voting method Self-Consistency and several advanced multi-agent debate methods including MAD, ChatEval, AgentVerse, and MOA.

Several key findings from the experimental results highlight the value of ModelSwitch:

Firstly, a fundamental finding consistent across all experiments is that there is a universal and strong positive correlation between the consistency of model-generated answers (measured by entropy, where lower entropy indicates greater consistency) and the accuracy of the final answers.

As shown in Figure 3, there is a significant negative correlation between the entropy of answers and accuracy, with correlation coefficients |r| often greater than 0.8 and being highly statistically significant (p<0.001). This universal pattern, observed across various models and datasets, provides a solid empirical basis for ModelSwitch's mechanism of relying on consistency as a core judgment signal—"consistency often means correctness, while confusion is more prone to error".

▲ Figure 3. Correlation between answer consistency (entropy) and accuracy for six common LLMs on MATH and MathBench

Secondly, in comparison with single-model Self-Consistency, ModelSwitch demonstrated dual advantages in both performance and efficiency. As shown in Figure 4, across all datasets, ModelSwitch, using a combination of two LLMs (Gemini 1.5 Flash and GPT-4o mini), consistently outperformed single-model Self-Consistency.

For instance, when the sampling budget increased from 1 to 16 times, ModelSwitch's performance on MathBench improved by 7 percentage points (from 72.7% to 79.7%), significantly exceeding the improvements brought by Self-Consistency for single models: 2.6 percentage points for Gemini 1.5 Flash (from 72.7% to 75.3%) and 1 percentage point for GPT-4o mini (from 71.7% to 72.7%).

At the same time, ModelSwitch saves an average of 34% in sampling times, significantly reducing API call costs and computational consumption. Furthermore, combinations of smaller models can surpass the performance of a single larger parameter model through ModelSwitch. For example, on GSM8K, ModelSwitch simultaneously outperformed larger models like GPT-4o and Gemini 1.5 Pro.

▲ Figure 4. Performance comparison of ModelSwitch using GPT-4o mini and Gemini 1.5 Flash combination vs. both models individually using Self-Consistency

Furthermore, against mainstream multi-agent debate methods, ModelSwitch also showed superior overall performance. As shown in Figure 5, under a unified and fair sampling budget of 15 times, ModelSwitch's performance on multiple datasets surpassed that of five other complex multi-agent debate frameworks.

Especially on the highly challenging MMLU-Pro dataset, ModelSwitch achieved an accuracy of 63.2%, which is a full 10.2 percentage points higher than the best-performing single LLM (53%) and significantly better than MAD (47.6%) and MOA (52.6%).

The reason behind this is that ModelSwitch adopts a concise switching mechanism, effectively avoiding potential error propagation issues that can occur during complex multi-agent interactions.

▲ Figure 5. Performance comparison of ModelSwitch and multi-agent debate methods

Analysis of Factors Affecting ModelSwitch Performance

The experiment also explored the influence of the number and arrangement order of LLMs on ModelSwitch's performance. As shown in Figure 6, the most significant performance improvement typically occurs when the number of LLMs increases from one to two. If the number of LLMs continues to increase, the benefits may diminish, and performance might stabilize or slightly decline.

This suggests that choosing a few (typically two) LLM combinations of comparable performance and diversity for ModelSwitch is often key to achieving optimal results.

Regarding the arrangement order of models, arranging them from strongest to weakest usually improves overall efficiency by achieving consensus earlier. However, ModelSwitch showed good robustness to model order; even with a weakest-to-strongest arrangement, there was no sharp decline in final performance.

▲ Figure 6. Impact of Model Count and Order on ModelSwitch Performance

Finally, ModelSwitch can also be effectively combined with more powerful verification mechanisms to achieve a further leap in performance. As shown in Figure 7, when ModelSwitch is combined with the Best-of-N selection strategy (referred to as RM-BoN) based on high-performance reward models like Qwen2.5-MATH-RM-72B, its performance is further enhanced.

On the MATH dataset, the accuracy after combining with RM-BoN increased from 80% with majority voting to 84%. Furthermore, the ModelSwitch+RM-BoN combination still outperformed the strategy of combining the best single LLM with RM-BoN.

▲ Figure 7. Performance comparison of ModelSwitch and single models combined with reward models as verification mechanisms

Paper Summary

ModelSwitch is a simple, efficient strategy that requires no additional training or complex model fusion. Through a dynamic model switching mechanism based on answer consistency, it cleverly leverages the complementary advantages of multiple LLMs during test-time computation, significantly improving overall performance and computational efficiency across various benchmarks.

The core mechanism of this method is based on the empirical observation of a strong correlation between model answer consistency and accuracy, and is supported by solid theoretical analysis.

Overall, ModelSwitch provides a simple, universal, and highly effective solution for how to efficiently scale the computational capabilities of large language models during inference.

Did "More is Better" Fail? ModelSwitch Jumps Out of the Sampling Black Hole, Rewriting the LLM Inference Paradigm

Share Short URL