DeepSeek Accuracy and Efficiency Doubled, Huawei & CAS Propose Chain-of-Thought "Early Exit" Mechanism

Long chains of thought enable large language models to reason, but overthinking can become a burden.

Huawei, together with the Institute of Information Engineering, Chinese Academy of Sciences, has proposed a new mechanism that allows large models to terminate thinking early to avoid this problem.

Using this method, the accuracy and efficiency of large models can be improved simultaneously without additional training.

src="http://mmbiz.qpic.cn/mmbiz_png/YicUhk5aAGtBComrB1IsiaZXelaXxYvtnIpjicokns9wxhevX5LORMhgKibSI5OKYMdkz2iaZFibe0RciawQKKeZQpqhw/640" alt="图片">

This method is called DEER, short for Dynamic Early Exit in Reasoning.

Its core is to find the critical point before the quality of reasoning information declines and promptly interrupt the large model's reasoning at this point.

Results across multiple reasoning benchmarks show that DEER is consistently effective on DeepSeek series reasoning LLMs, reducing chain-of-thought generation length by an average of 31% to 43% while increasing accuracy by 1.7% to 5.7%.

To date, DEER has been verified to be continuously effective on more reasoning models such as QwQ, Qwen3, and Nemotron, and across 11 evaluation sets.

src="https://mmbiz.qpic.cn/mmbiz_png/YicUhk5aAGtBComrB1IsiaZXelaXxYvtnIBevnibyT7nJzepw8lUp4wJtpuooCQicwT64s0kKlkvU7wYVjCooibmZsg/640" alt="图片">

The critical point for stopping reasoning requires dynamic planning.

Intuitively, as the number of reasoning paths in the chain of thought increases, more information is available for generating conclusions.

If the critical point where the reasoning information becomes just sufficient can be identified (called Pearl Reasoning), and the model is forced to stop thinking further and output the conclusion directly at this point, both accuracy and efficiency can be achieved simultaneously.

The key to this research is finding such a pearl during the generation of long chains of thought.

To verify this motivation, the authors forced the model to switch from thinking to generating the answer directly at the transition points of each reasoning path. If the resulting answer was correct, the existence of this Pearl Reasoning was verified.

As shown in the figure below, approximately 75% of samples indeed contain such a pearl (i.e., early exit can still generate the correct answer), and even 36.7% of samples can obtain the correct answer with less than half of the original reasoning paths.

src="https://mmbiz.qpic.cn/mmbiz_png/YicUhk5aAGtBComrB1IsiaZXelaXxYvtnI8tEsGQOIJicgHwoWXK9niaEIefqn1LozPNmeOpYMKY6SoYRLI2TUhmicg/640" alt="图片">

Therefore, how to find Pearl Reasoning from a long chain of thought is a potential and valuable research topic for achieving efficient reasoning.

To this end, the authors analyzed the overthinking problem existing in reasoning models in preliminary experiments and explored the impact of static early exit on model performance. All experiments were conducted on DeepSeek-R1-Ditil-Qwen-14B.

The authors first let the model perform complete reasoning on the test set (including the chain of thought and conclusion between the think tags before and after), then kept the complete chain of thought and divided it into thought blocks based on thought transition points (such as the occurrence of words like “wait” or “alternatively”).

For these samples, the authors kept different proportions (20%-90%) of thought blocks and appended a thought end marker separator at each truncation point to force the chain of thought process to terminate and generate the final conclusion.

Quantitative results show that under the static setting of early exit using only 20% of the reasoning steps, for MATH-500, 60.8% of the correctly answered samples still remained correct;

For the more difficult GPQA, 35.1% of samples could still remain correct.

src="https://mmbiz.qpic.cn/mmbiz_png/YicUhk5aAGtBComrB1IsiaZXelaXxYvtnIDCK4QyLdxsicAicm7j2H8aNGTcgwJTOUFys5SHiaYoUVFc6CcFC3m4nTA/640" alt="图片">

The figure below illustrates the different proportions of incorrect answers that can be corrected by exiting early at different positions.

For the MATH dataset, the highest error correction rate is achieved when exiting at 40% of the reasoning steps; while for the GPQA dataset, the best error correction rate is achieved when exiting at 50% of the reasoning steps.

src="https://mmbiz.qpic.cn/mmbiz_png/YicUhk5aAGtBComrB1IsiaZXelaXxYvtnIkUe86Jfg2g0GnVS4g0lMP10Nyp1ozl9libdf6qrzTibGzibyt0icsRIhMQ/640" alt="图片">

It seems that the optimal early exit point for each problem is different and closely related to the inherent difficulty of the problem itself.

Therefore, relying on static early exit strategies based on fixed heuristics is suboptimal. Based on this motivation, the authors designed a dynamic early exit mechanism to further correct errors and improve accuracy by finding Pearl Reasoning, while reducing the generated length.

So, how does DEER specifically work?

Three steps to determine the timing of exiting reasoning.

DEER considers the critical moments when the model switches its chain of thought during reasoning as opportunities for early exit, prompting the large model to stop thinking at these moments and generate tentative answers.

The confidence of each trial answer serves as a reference for the early exit decision in reasoning.

src="https://mmbiz.qpic.cn/mmbiz_png/YicUhk5aAGtBComrB1IsiaZXelaXxYvtnI0YHtmjU1lJpyaHPFfxH1q1gkj7vVjhiaVLjohYDYmic09MBicD0BzJ3qg/640" alt="图片">

Specifically, the DEER method includes three actions: Reasoning Transition Monitor, Trial Answer Inducer, and Confidence Evaluation.

The Reasoning Transition Monitor is inspired by the budget force technique, identifying words like “wait” and “alternatively” as critical points for thought transition and monitoring their appearance.

When a thought transition point appears, it triggers the Trial Answer Inducer action—the authors replace “wait” with a marker similar to “Final Answer:” to induce the model to immediately generate a verification answer.

This will be used for the third action, Confidence Evaluation—

If the confidence is high enough, the model is set to stop further thinking and generate the conclusion directly based on the already generated chain of thought;

Otherwise, the answer induction action is withdrawn, and reasoning continues along the original path.

The figure below shows that the confidence of the verification answer in DEER can indeed reflect whether the already generated chain of thought is sufficient to support the large model in generating the final answer.

It can be observed that when the model's reasoning process is incomplete or flawed, the trial answer often shows significantly lower confidence; conversely, when the reasoning is comprehensive and logically sound, the answer generated by the model has higher confidence.

src="http://mmbiz.qpic.cn/mmbiz_png/YicUhk5aAGtBComrB1IsiaZXelaXxYvtnInVhicutrfGydRUmNic6HwGVHCMr3hD6XAULfumviagJjBr5sSeQMyMA0g/640" alt="图片">

Intuitively, the calculation of answer induction and confidence evaluation in DEER introduces additional latency during the reasoning process, especially for code generation tasks where the test answers are still very long, which reduces the efficiency gains obtained by shortening the chain of thought sequence.

To address this issue, the authors proposed a branch-parallel acceleration strategy to further resolve these efficiency limitations:

Multiple branches are linearized into a single sequence and generated in parallel using a specialized Causal Attention Mask;

Dynamic KV cache management is achieved through confidence-based pruning. This strategy allows for temporal overlap between the Trial Answer Inducer and Confidence Evaluation and the ongoing reasoning chain generation, thereby optimizing overall reasoning efficiency.

src="http://mmbiz.qpic.cn/mmbiz_png/YicUhk5aAGtBComrB1IsiaZXelaXxYvtnI8FAG9IWgEe5PDUsHx48ibTgIhlT6pibp8AYMlSSD6grWtjaFX8MFM3dQ/640" alt="图片">

In addition, more discussion on end-to-end latency will be included in the upcoming version.

Making Reasoning Models Faster and Stronger

To verify the performance of DEER, the authors conducted evaluations on 6 challenging reasoning benchmarks, including 3 mathematical reasoning tasks (MATH-500, AMC 2023, AIME 2024), one scientific reasoning task (GPQA Diamond), and two code generation tasks (HumanEval, BigCodeBench).

Accuracy and generation length were selected as evaluation metrics, measuring precision and efficiency, respectively. Experiments used DeepSeek-R1-Distill-Qwen series models of different sizes (1.5B, 7B, 14B, 32B).

Experimental results show that DEER demonstrates amazing effects on all model sizes and evaluation sets.

Numerically, DEER improves accuracy by an average of 1.7 to 5.7 points compared to the conventional Long CoT method, while reducing generation length by 31% to 43%.

On smaller models, DEER shows more significant improvements for the two slightly less difficult benchmarks, MATH-500 and AMC 2023.

On larger models, DEER shows more significant improvements for the two more challenging benchmarks, AIME 2024 and GPQA.

Especially when the model's reasoning ability matches the problem difficulty, the authors' method is more effective.

src="https://mmbiz.qpic.cn/mmbiz_png/YicUhk5aAGtBComrB1IsiaZXelaXxYvtnITBOh7mYzKPlDqSnR3ARWxwcdCcgK1qChSicQEic23cTZgyNKk2TtpdTA/640" alt="图片">

On the two programming test sets, HumanEval and BigCodeBench, the authors' method achieved an average reduction of 64.9% in generation length, while pass@1 increased by 2.1 points, and showed robustness to thresholds around 0.95, without significant fluctuations.

src="https://mmbiz.qpic.cn/mmbiz_png/YicUhk5aAGtBComrB1IsiaZXelaXxYvtnIibnRro8LRItuSaicSNbp9Suz03g0icbw0HTkibicibfKjjlITCwGmTES66aA/640" alt="图片">

To further verify the improvement in end-to-end reasoning efficiency by DEER, the authors tested the average inference latency per sample on the MATH and AMC datasets based on huggingface transformers.

The results show that even without using the branch-parallel decoding acceleration proposed by the authors, DEER already reduced inference latency by 43.4% to 47.3%.

After adopting branch-parallel decoding, the decrease ratio of inference latency showed a super-linear relationship with the decrease ratio of sequence length.

src="https://mmbiz.qpic.cn/mmbiz_png/YicUhk5aAGtBComrB1IsiaZXelaXxYvtnItBUpmmxeMF3aA5kc2ia8YpJnYy38micc3ibsheyE70dUUkibVFDsiasSXZQ/640" alt="图片">

The authors also further proved the effectiveness of DEER through sample analysis.

The original reasoning model tends to switch ideas and explore multiple solution methods when solving problems, however, it is likely that there is only one optimal solution path, and in subsequent thoughts, the model will make mistakes and fail to get the correct answer.

To verify which of the two different results is correct, the model will perform endless self-checking, ultimately failing to provide an answer.

However, under the DEER working mode, this problem is effectively avoided.

src="https://mmbiz.qpic.cn/mmbiz_png/YicUhk5aAGtBComrB1IsiaZXelaXxYvtnIZolAUia7aIuycqUpuZqL3SMIuH2sR1j5YSicdSwpzTibDv9ibG8RsXvqvg/640" alt="图片">

Paper address: https://arxiv.org/abs/2504.15895Project link: https://github.com/iie-ycx/DEER

Main Tag:AI Reasoning

Sub Tags:Large Language ModelsAccuracyEfficiencyChain-of-Thought


Previous:GPT-5 R&D Insider Details Revealed! OpenAI Chief Research Officer: AGI is Just Around the Corner

Next:Yuval Noah Harari, Author of 'Sapiens': The Biggest Danger in the World Today is Humans Not Trusting Each Other, Yet Trusting AI That Evolves Millions of Times Faster Than Carbon-Based Life; We Reject Truth Because It is Expensive, Complex, and Painful

Share Short URL