Meta Introduces Deep Think with Confidence: Boosting Reasoning Accuracy and Efficiency with Minimal Changes

In recent years, large language models (LLMs) have shown astonishing performance on complex reasoning tasks, especially with strategies that generate multiple reasoning chains during testing and use "Self-Consistency" for majority voting, significantly improving answer correctness. However, this method, often called "parallel thinking," comes with a huge computational cost: generating hundreds or even thousands of reasoning paths per question, with token consumption growing linearly, making it almost unsustainable in practical deployment. Worse still, as the number of generated paths increases, performance improvements gradually saturate or even decline, yet traditional voting methods treat all paths equally, unable to distinguish between high-quality and low-quality reasoning.

Image

Paper: Deep Think with Confidence

Link: https://arxiv.org/pdf/2508.15260

It is against this backdrop that research teams from Meta AI and UCSD proposed Deep Think with Confidence (DeepConf)—a simple yet powerful method that dynamically identifies and filters low-confidence reasoning paths during the testing phase. This approach simultaneously improves reasoning accuracy and efficiency without increasing training costs or adjusting hyperparameters. This article will provide a comprehensive interpretation of this method, revealing how it leverages the intrinsic signal of "confidence" to achieve more intelligent and efficient reasoning aggregation.

Why is "Deep Thinking with Confidence" Needed?

Traditional self-consistency methods, while effective, suffer from two major drawbacks:

1. Huge computational overhead: For example, on AIME 2025 math competition problems, using the Qwen3-8B model to improve accuracy from 68% to 82% required generating an additional 511 reasoning paths, consuming hundreds of millions of tokens.

2. Diminishing returns: Generating more paths does not always lead to performance improvements; sometimes it can introduce noise, as low-quality paths might "bias" the voting results.

Previous works have attempted to use "global confidence" (e.g., the average confidence of an entire reasoning path) to filter paths, but this method has two flaws:

Masking local errors: The average value of an entire path can mask severe uncertainty or errors in specific intermediate steps.

Inability to terminate early: The full path must be generated to calculate confidence, preventing early stopping during the generation process.

DeepConf's motivation is precisely to solve these problems: by leveraging finer-grained, local confidence signals to dynamically filter low-quality paths during or after generation, thereby achieving efficient and accurate reasoning.

How Does DeepConf Work?

I. Design and Understanding of Confidence Metrics

At the core of DeepConf is a series of innovative confidence measures that capture the quality of reasoning paths from different angles.

1. Token-level metrics:

Token Entropy: Measures the model's uncertainty about the next word. Lower entropy means higher model confidence.

Where is the probability of the j-th word at the i-th position.

Token Confidence: Defined by the authors as the average negative log probability of the top k candidate words:

Note: Here, higher confidence results in a lower numerical value (due to the negative sign), but in practical use in the paper, the focus is more on relative values—a lower numerical value indicates higher confidence.

2. Trace-level metrics:

Average Trace Confidence: The average of all token confidences across the entire path. While commonly used, it can easily mask local errors.

3. Innovative metrics (key contributions):

Group Confidence: Divides the trace into fixed-length overlapping windows (e.g., groups of 1024 tokens) and calculates the average confidence within each group. This provides a smoother local signal.

Bottom-10% Group Confidence: Takes the average of the 10% of groups with the lowest confidence among all groups. This captures the weakest and most uncertain links in the reasoning.

Lowest Group Confidence: The confidence value of the group with the lowest confidence among all groups. This is the most extreme local quality indicator, very suitable for making early termination decisions during online generation.

Tail Confidence: Only calculates the average confidence of the last fixed number of tokens in the trace (e.g., 2048 tokens). This is because the quality of the ending part of the reasoning (critical steps for reaching the answer) is crucial.

Confidence distribution of correct and incorrect reasoning paths under different confidence measures

Confidence distribution of correct and incorrect reasoning paths under different confidence measures

II. Offline Inference Mode

In offline mode, all reasoning paths have been generated. DeepConf enhances the effect of majority voting through two strategies:

1. Confidence-Weighted Majority Voting: Instead of "one path, one vote," each path's voting weight is proportional to its confidence. High-confidence paths have a greater influence on the final result.

2. Confidence Filtering: Before weighted voting, a portion of paths is filtered out based on their confidence scores. The paper mainly explored two filtering ratios:

Top 10%: Only retains the top 10% most confident paths. An aggressive strategy aiming for high precision, but risks failure if the model is overconfident in incorrect answers.

Top 90%: Retains the top 90% most confident paths. A conservative strategy that filters out the worst 10% while maintaining diversity, offering stability and reliability.

Process of calculating different confidence metrics and performing filtering and weighted voting in offline mode

Process of calculating different confidence metrics and performing filtering and weighted voting in offline mode

III. Online Inference Mode

The goal of online mode is to real-time assess path quality during generation and early terminate those "doomed to fail" paths to save computational resources. Its core is the lowest group confidence metric.

Online DeepConf consists of two steps:

1. Offline Warmup: For a new problem, first generate a small number (e.g., 16) of complete reasoning paths normally, calculate their lowest group confidence, and then determine a stopping threshold s based on a set retention ratio η (10% or 90%). For example, DeepConf-low (η=10%) sets the threshold s to the lowest value among the top 10% confident paths in the warmup set.

2. Adaptive Sampling: Begin large-scale path generation. Each time a token group (e.g., 2048 tokens) is generated, calculate the current group confidence.

If the current value is below threshold s, immediately terminate that path.

Simultaneously, upon completion of each path, check if the consensus of answers from all already generated paths exceeds a certain threshold τ (e.g., 95%). If consensus is reached, stop generation and output the result directly.

This method ensures that the online process approximately replicates the effect of offline filtering while avoiding a large amount of unnecessary token generation.

How decisions are made to early terminate a reasoning path during online generation based on real-time calculated group confidence and threshold comparison

How decisions are made to early terminate a reasoning path during online generation based on real-time calculated group confidence and threshold comparison

How Effective is DeepConf?

Experimental Setup

Models: Covered advanced open-source models of various scales, such as DeepSeek-8B, Qwen3-8B/32B, GPT-OSS-20B/120B.

Benchmarks: Multiple high-difficulty mathematical reasoning datasets, such as AIME 2024/2025, HMMT 2025, BRUMO 2025, and GPQA-Diamond.

Baselines: Standard self-consistency + majority voting (Cons@K), and single-path accuracy (Pass@1).

Evaluation: All results averaged over 64 independent runs, reporting both accuracy and total tokens generated.

Offline Evaluation Results

Comprehensive comparison of the performance of various confidence measurement methods with different filtering strategies on different models and datasets with 512 paths

Comprehensive comparison of the performance of various confidence measurement methods with different filtering strategies on different models and datasets with 512 paths

Key findings:

Confidence-weighted + filtering almost universally outperforms ordinary voting. For example, on AIME25, GPT-OSS-120B using tail confidence + Top10% filtering achieved an astonishing 99.9% accuracy, compared to 97.0% for ordinary voting.

Top10% filtering strategy usually brings the greatest improvement, but there is also a risk of performance degradation due to the model being "confidently wrong" (e.g., GPT-OSS-120B on some datasets).

Top90% filtering is a very safe choice, almost always matching or slightly exceeding the accuracy of ordinary voting, while laying the groundwork for subsequent filtering.

Local confidence signals (tail and bottom) generally perform as well as or even better than global average signals, confirming the necessity of focusing on local quality.

Taking DeepSeek-8B as an example, showing the accuracy advantage of Top10% and Top90% strategies compared to ordinary voting when filtering with lowest group confidence

Taking DeepSeek-8B as an example, showing the accuracy advantage of Top10% and Top90% strategies compared to ordinary voting when filtering with lowest group confidence

Online Evaluation Results

Performance of DeepConf-high and DeepConf-low in terms of token consumption and accuracy compared to ordinary voting when budgeted for 512 paths in online mode

Performance of DeepConf-high and DeepConf-low in terms of token consumption and accuracy compared to ordinary voting when budgeted for 512 paths in online mode

Key findings:

Remarkable efficiency improvements: DeepConf-low reduced token consumption by an average of 43-84%, with a maximum of 84.7% (GPT-OSS-120B on AIME25).

Accuracy maintained or even improved: In most cases, DeepConf achieved comparable or higher accuracy than the baseline while significantly saving tokens. For example, DeepSeek-8B saved 77.9% of tokens on AIME24, while accuracy increased by 5.8%.

Trade-offs between the two modes: DeepConf-low (η=10%) aims for extreme efficiency but occasionally suffers a slight drop in accuracy due to overly aggressive filtering; DeepConf-high (η=90%) is more robust, firmly maintaining accuracy with smaller efficiency gains (saving 18-59% of tokens).

Comparison of DeepConf and baseline methods in total tokens generated across different tasks for the GPT-OSS-120B model

Comparison of DeepConf and baseline methods in total tokens generated across different tasks for the GPT-OSS-120B model

Accuracy curve as a function of generated tokens for DeepSeek-8B

Accuracy curve as a function of generated tokens for DeepSeek-8B

DeepConf, by cleverly utilizing the internal confidence signals of large language models, provides an elegant and effective solution to the "cost-benefit" dilemma of inference at test time. It demonstrates that not all generated paths are equal, and that focusing on the local quality of the reasoning process is far more important than just focusing on the final answer. This research not only significantly improves the inference efficiency of advanced models but, more importantly, points the way for building more "self-aware," resource-efficient AI systems in the future—allowing AI to gauge its certainty during thought, thereby allocating computational resources more intelligently.

Main Tag:Large Language Models

Sub Tags:AI ReasoningNatural Language ProcessingDeep Learning ResearchConfidence ScoresInference OptimizationSelf-Consistency


Previous:MCP Tool Stacking is a Trap! Developer Guru: Command Line's 'Brittleness' Crushes AI! Better to Axe It Down to a Single Code Executor: 7 Calls Become 1! Netizens: Should've Abandoned Black Box Tools Long Ago!

Next:Data Speaks: "Men Live Worse Than Dogs" | Seven Data Sets

Share Short URL