Introduction
The advancement in reasoning capabilities has significantly boosted the performance of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) across various tasks. However, over-reliance on Chain-of-Thought (CoT) reasoning can degrade model performance, lead to verbose outputs, and impact efficiency.
Research shows that prolonged CoT reasoning does not always improve accuracy and can even weaken a model's ability to handle simple tasks. To address this, we propose the Certainty-Based Adaptive Reasoning (CAR) framework, which dynamically selects between short answers and detailed long-text reasoning based on the model's perplexity. It first generates a brief answer and evaluates its perplexity, triggering a reasoning process only when the model's confidence is low (i.e., perplexity is high).
Across multiple benchmarks, including multimodal visual question answering, key information extraction, and text reasoning, CAR outperforms methods relying solely on short answers or prolonged reasoning, achieving an optimal balance between accuracy and efficiency.
Paper Title:
Prolonged Reasoning Is Not All You Need: Certainty-Based Adaptive Routing for Efficient LLM/MLLM Reasoning
Paper Link:
https://arxiv.org/abs/2505.15154
Related Work
CAR is the first solution to automate the switching between long and short reasoning. The area most relevant to CAR is the reduction of Token count during reasoning, aiming to solve the problem of increased computational overhead caused by excessive Tokens during the reasoning process.
Concise Thoughts [1] uses a fixed global Token budget to limit the number of Tokens generated, while Token-Budget-Aware LLM reasoning (TALE) [2] dynamically adjusts the Token budget based on problem complexity.
However, these methods may introduce additional LLM calls or face impractical Token limitations. Additionally, Chain of Draft (CoD) [3] reduces verbosity by generating a minimum number of intermediate steps, significantly reducing the number of output Tokens without compromising accuracy.
Recently, some works have also proposed parallel reasoning methods [4] and methods that sacrifice interpretability to reduce the number of predicted Tokens [5,6].
Pilot Experiments
Pilot Experiment Setup
We conducted pilot experiments in the fields of text-intensive Visual Question Answering (VQA) and Key Information Extraction (KIE), selecting 8 representative datasets. These included VQA datasets: DocVQA, InfoVQA, ChartQA, VisualMRC (covering various types of visual text like documents, charts, infographics); and KIE datasets: SROIE, CORD, FUNSD, POIE (focusing on structured information extraction from receipts, tables, etc.).
Based on this data, we fine-tuned Qwen2.5-0.5B to evaluate performance on in-domain (DocVQA, ChartQA, etc.) and out-of-domain (POIE, InfoVQA, etc.) datasets. Models were required to generate two types of responses: short answers (prompt: “Please directly output the answer”) and long-text reasoning + answer (prompt: “Please output the reasoning process before outputting the answer”).
After evaluation, we compiled the accuracy (Accuracy) and corresponding perplexity (PPL) of the answers for the datasets, where a lower PPL indicates higher model confidence in the answer.
▲ Figure 1 Dataset PPL scores vs. accuracy
▲ Figure 2 Distribution of PPL and answer correctness across datasets
Experiments revealed a strong negative correlation between PPL and accuracy. By analyzing the relationship between accuracy and PPL at the dataset level, we found a significant inverse correlation (as shown in Figure 1): datasets with higher accuracy had lower average PPL.
Furthermore, as shown in Figure 2, within datasets, the average PPL score for correctly predicted examples was also lower than for incorrectly predicted examples.
The above experiments revealed the potential of PPL as a model confidence indicator. Therefore, we first propose a basic PPL-based dynamic reasoning decision: triggering long-text reasoning in low-confidence scenarios (PPL exceeding a threshold) to avoid hasty decisions; and directly outputting short answers in high-confidence scenarios to improve reasoning efficiency.
Specifically, we used the 75th percentile of the test set PPL distribution as the threshold to evaluate performance (as shown in Table 1). Experiments showed significant performance improvements for the model on most datasets.
▲ Table 1 Performance comparison when PPL is taken as the 75th percentile threshold
Method (Certainty-based Adaptive Reasoning)
Based on the aforementioned exploratory findings, this paper will leverage them as a foundation to develop Certainty-based Adaptive Reasoning (CAR), a dynamic reasoning decision framework that uses perplexity (PPL) with the goal of adaptively switching between short-text reasoning and long-text reasoning during the inference process.
By avoiding redundant computations, this method will significantly improve the model's reasoning efficiency and accuracy. As shown in Figure 3(a), we first train Large Language Models (LLMs) or Multimodal Large Language Models (MLLMs) using examples that include both short answers and long-text reasoning responses.
Subsequently, using the perplexity (PPL) of the training set, we estimate the PPL distributions for correct and incorrect short answers, and these distributions are used for decision-making. Specifically, if the estimated distribution determines that a short answer is correct, the proposed method directly outputs that correct answer. Otherwise, it performs long-text reasoning. The reasoning process is shown in Figure 3(b).
▲ Figure 3 Schematic diagram of CAR model training and inference process
Model Training: We mix training examples that contain both short answers and long-text reasoning responses to build a new dataset.
To guide the model to generate short answers, the instruction “Please directly output the answer” is used; if a long-text answer with a reasoning process is required, the instruction “Please output the reasoning process before outputting the answer” is used.
Subsequently, a standard instruction fine-tuning process is adopted, where the model receives sequences composed of input text and output text, and the optimization objective is cross-entropy loss:
After model training is complete, short answer inference is performed on all samples in the training set to generate predicted answers and calculate their perplexity values (PPL). The perplexity of a Token sequence is defined as:
Gaussian Distribution Modeling: Let the binary variable C denote whether the short answer is correct (C=1 for correct, C=0 for incorrect). Assume that the PPL distributions for both correct and incorrect answers follow a Gaussian distribution:
The probability density functions are:
Finally, estimate the parameters from the training data (assuming n_1 and n_0 are the number of correct and incorrect answers in the training set, respectively):
Inference Process: For a new input x, the inference steps are as follows:
1. Short Answer Inference: The model generates a short answer and calculates its corresponding PPL as PPL_new;
2. Probability Calculation: Based on Bayes' theorem, substitute PPL_new into the probability density function to calculate the posterior probability;
Where the prior probabilities are:
3. Decision Rule: If the probability of the short answer being correct is higher than its probability of being incorrect, output the short answer directly; otherwise, trigger the model's long reasoning.
Experimental Results
5.1 Implementation Details
We used Qwen2-VL-7B-Instruct as the multimodal language model, and Qwen2.5-7B-Instruct and Llama3.1-8B-Instruct as the large language models, named CAR, CAR, and CAR respectively.
All models were trained for 3 epochs using the AdamW optimizer with a batch size of 32 and a learning rate of 1e-6. The maximum input and output sequence lengths were set to 4096 and 1024, respectively. Training was conducted on 8 NVIDIA A100 GPUs.
To eliminate randomness, all models did not use sampling methods during testing and uniformly used beam search=1 for generation. Additionally, the maximum number of generated tokens was set to 1024, and the maximum number of input tokens was set to 4096.
To verify the effectiveness of our proposed method, we conducted experiments on three multimodal datasets: DocVQA, ChartQA, and FUNSD.
Unlike the pilot experiments in previous sections, here we input image modal data and evaluate performance using a multimodal large language model. Since these datasets lacked reasoning process annotations, we reused the reasoning process data obtained from the pilot experiments.
Furthermore, we also evaluated the CAR method on text datasets, selecting three widely used reasoning datasets: mathematical reasoning datasets GSM8K and MathQA, and common sense reasoning dataset StrategyQA.
5.2 Multimodal Dataset Performance Comparison
Table 2 shows the performance on multimodal datasets. First, the superior performance of CAR compared to CAR and CAR demonstrates the effectiveness of using perplexity (PPL) as an indicator for reasoning path selection.
Additionally, CAR achieved the highest average accuracy of 77.9%, improving by 2.8% and 5.5% over baseline models Qwen2VL and Qwen2VL respectively.
Notably, our method also maintained lower token usage (average 86.9 tokens), which is only 15% of the tokens used by Qwen2VL. These results demonstrate the practicality of CAR in multimodal scenarios.
▲ Table 2 Performance comparison on multimodal datasets
5.3 Text Dataset Performance Comparison
Tables 3 and 4 present the performance comparison for text-based reasoning tasks. The CAR method demonstrates robust performance. Specifically, when using the Qwen2.5-7B model, the average accuracy reached 81.1%, and with Llama3.1-8B, it reached 74.9%, both outperforming the short answer baseline models (55.8% and 51.5%) and the long-text reasoning models (75.0% and 70.8%).
Notably, compared to solely long-text reasoning, CAR’s Token usage was reduced by 45.1% (with Qwen2.5 model) and 45.6% (with Llama3.1 model), respectively. In the Qwen2.5 model, CAR consistently outperformed CAR and CAR, once again demonstrating the effectiveness of using perplexity (PPL) as a path selection metric.
Furthermore, CAR’s performance surpassed advanced Token reduction methods like TALE and CoD. Specifically, on the Qwen2.5 model, CAR’s average accuracy was 8.3% higher than TALE and 6.9% higher than CoD, while maintaining the lowest Token usage (i.e., 69.2 Tokens).
Similarly, on the Llama3.1 model, CAR’s average accuracy was 6.6% higher than TALE and 5.5% higher than CoD, while generating the fewest tokens.
It is worth noting that CAR’s adaptive routing was particularly effective on the MathQA dataset (e.g., Llama3.1 model 70.2% vs. CoD’s 59.1%, Qwen2.5 model 83.8% vs. CoD’s 67.1%). The potential reason for this phenomenon is that the proposed CAR model eliminated unnecessary reasoning steps. This highlights CAR's practicality across different reasoning paradigms.
▲ Table 3 Performance comparison on text datasets (based on Qwen2.5 model)
▲ Table 4 Performance comparison on text datasets (based on Llama3.1 model)
5.4 Performance Comparison after Integrating TALE Method
We further explored the feasibility of combining the CAR framework with Token reduction techniques like TALE. By replacing the original reasoning process with short reasoning steps generated by TALE, we constructed CAR-TALE variants on Qwen2.5-7B and Llama3.1-8B.
The results in Table 5 and Table 6 show that on the Qwen2.5 model, after combining CAR with TALE, the average accuracy increased from 78.8% to 85.5% (+6.7%), and the number of generated tokens decreased from 127.8 to 111.3, achieving both performance and efficiency improvements.
On the Llama3.1 model, after combining with TALE, the average accuracy increased from 71.6% to 80.8% (+9.2%), validating the effectiveness of the integrated solution.
Experiments prove that CAR and Token reduction techniques have synergistic advantages. By combining the dynamic path selection of the adaptive reasoning framework with Token reduction techniques, the efficiency and accuracy of large model inference can be further optimized.
▲ Table 5 Performance comparison of CAR with TALE integration (based on Qwen2.5)
▲ Table 6 Performance comparison of CAR with TALE integration (based on Llama3.1)
Conclusion
We propose the Certainty-Based Adaptive Reasoning (CAR) framework, which dynamically switches between short answer and long-text reasoning modes based on model confidence.
By quantifying model confidence in answers using perplexity (PPL), CAR outputs short answers directly for efficiency when confidence is high, and triggers long-text reasoning for accuracy when confidence is low.
Experiments show that in multimodal (e.g., DocVQA, ChartQA) and text reasoning (e.g., GSM8K, MathQA) tasks, CAR reduces token usage by over 45% compared to pure long-text reasoning, with an average accuracy improvement of 6%-8%. It outperforms baseline methods on models like Qwen2.5 and Llama3.1, notably reducing redundant steps in mathematical reasoning tasks.
CAR breaks the inherent perception that “long-text reasoning necessarily performs better,” offering a more flexible and efficient solution for large model reasoning, and promoting the development of large model reasoning towards intelligence and lightweightness.
References
1. Nayab, Sania, et al. "Concise thoughts: Impact of output length on llm reasoning and cost." arXiv preprint arXiv:2407.19825 (2024).
2. Han, Tingxu, et al. "Token-budget-aware llm reasoning." arXiv preprint arXiv:2412.18547 (2024).
3. Xu, Silei, et al. "Chain of draft: Thinking faster by writing less." arXiv preprint arXiv:2502.18600 (2025).
4. Ning, Xuefei, et al. "Skeleton-of-thought: Large language models can do parallel decoding." Proceedings ENLSP-III (2023).
5. Hao, Shibo, et al. "Training large language models to reason in a continuous latent space." arXiv preprint arXiv:2412.06769 (2024).
6. Shen, Zhenyi, et al. "Codi: Compressing chain-of-thought into continuous space via self-distillation." arXiv preprint arXiv:2502.21074 (2025).
Further Reading
#Submission Channel#
Let your words be seen by more people
How can more high-quality content reach readers through a shorter path, reducing the cost for readers to find quality content? The answer is: people you don't know.
There are always people you don't know who know what you want to know. PaperWeekly can perhaps serve as a bridge, facilitating the collision of scholars and academic inspiration from different backgrounds and directions, sparking more possibilities.
PaperWeekly encourages university laboratories or individuals to share various high-quality content on our platform, which can include the latest paper interpretations, academic hot topic analyses, research experiences, or competition explanations. Our sole purpose is to make knowledge truly flow.
📝 Basic requirements for submissions:
• The article must be an original work by the individual and has not been published through public channels. If it has been published or is pending publication on other platforms, please clearly indicate.
• Submissions are recommended to be written in markdown format, with accompanying images sent as attachments, requiring clear images and no copyright issues.
• PaperWeekly respects the author's right to attribution and will provide competitive remuneration for each accepted original first-published manuscript, with specific compensation tiered based on article readership and quality.
📬 Submission channel:
• Email: hr@paperweekly.site
• Please include immediate contact information (WeChat) with your submission so we can contact the author as soon as the manuscript is selected.
• You can also directly add the editor's WeChat (pwbot02) for quick submission, noting: Name-Submission.
△Long press to add PaperWeekly editor
🔍
Now, you can also find us on "Zhihu"
Go to the Zhihu homepage and search for "PaperWeekly"
Click "Follow" to subscribe to our column
·