Boost LLM Reasoning Accuracy to 99% Without Fine-Tuning! Try DeepConf, a Lightweight Inference Framework | Latest from Meta

When Large Language Models (LLMs) tackle complex tasks like mathematical problems and logical reasoning, a very popular and effective method is called "Self-Consistency," often referred to as "parallel thinking." The logic is simple: instead of letting the model think only once, allow it to generate multiple solution processes (which we call "reasoning paths" or "traces") with different randomness (temperature > 0), and then select the most frequent answer through "Majority Voting." This is similar to a student solving a problem many times and choosing the answer they arrived at most often, which usually leads to a higher accuracy.

While this method has been useful, it has two significant drawbacks:

* Extremely High Cost: Suppose you need the model to generate 512 reasoning paths to improve accuracy by a small margin. This means your computational cost (token consumption) will skyrocket by 512 times, which is unsustainable in practical applications.

* Performance Bottleneck: "Majority voting" relies on a naive assumption that every reasoning path has equal "voting power." It's like a large village assembly where everyone gets one vote to solve a complex problem. However, this assembly includes thoughtful experts, random guessers, and even "disruptors" who misunderstand the problem. Because the rule is "one person, one vote," if enough "random guessers" and "disruptors" happen to guess the same wrong answer, their votes will drown out the correct opinions of the few experts. This is why, paradoxically, increasing the number of paths can sometimes lead to accuracy quickly saturating or even decreasing, as too much "noise" is introduced.

Researchers from Meta and UCSD have proposed a lightweight inference framework called Deep Think with Confidence (DeepConf), which effectively addresses this "expensive and inefficient" dilemma. It boosted GPT-oss's accuracy on AIME2025 to an astonishing 99.9%, significantly higher than the standard majority voting's 97.0%.https://arxiv.org/abs/2508.15260

DeepConf: The Model's "Confidence" is a Treasure Trove

DeepConf's starting point is very clever: can the model itself determine which reasoning path is of higher quality without adding external judges?

The answer is yes, through the model's "Internal Confidence Signals." When the model generates each word (token), it calculates a probability distribution for all words in its vocabulary.

* If the model is very certain about what the next word should be, this probability distribution will be very "sharp," concentrating on a few words (low entropy).

* If the model is very uncertain, it might consider several words equally likely, and the probability distribution will be relatively "flat" (high entropy).

DeepConf's core idea is: For a high-quality reasoning path, the model should be confident in most steps, and its overall "confidence" during the entire generation process should be generally high. Conversely, paths full of uncertainty and errors will inevitably show "hesitation" in certain parts, resulting in low confidence.

Key Concept: How to Quantify AI's "Confidence"?

To more accurately "diagnose" AI's thought process, researchers have explored various methods to measure its "confidence." This is a layered process, progressing from basic units to complex applications.

Step One: Defining the Most Basic "Token Confidence"

This is the foundation for all "confidence" calculations, defining the model's certainty when generating each token.

* Ci: Represents the confidence score of the token generated at position i.

* Pi(j): Represents the probability of the j-th most likely candidate token predicted by the model at position i.

* k: Indicates how many most probable candidate tokens we consider (e.g., k=20).

* logPi(j): This is the logarithm of the probability. Since probability values are between 0 and 1, their logarithm is negative.

* −k1∑...: Σ is the summation symbol. The entire formula means summing the logarithmic probabilities of the top k candidate tokens, averaging them, and then taking the negative of that average.

Why calculate it this way? This formula is very clever. When the model is very "confident," it assigns a very high probability (close to 1) to a certain token, and low probabilities to other candidate tokens. In this case, the value of logP will be close to 0, so the calculated confidence Ci will be a higher positive number. Conversely, if the model is "uncertain," it will assign similar low probabilities to many candidate tokens. In this case, the value of logP will be a larger negative number, and the resulting confidence Ci will be lower.

In simple terms, this formula converts the model's predicted probability distribution into an intuitive, numerical "confidence score." The higher the score, the more certain the model.

Step Two: Starting with "Average Trace Confidence"

With individual token confidence, the most straightforward approach is to calculate the average score for the entire reasoning path.

* Average Trace Confidence: This is the most basic method, averaging the confidence of all tokens in a complete reasoning path. While effective, its drawback is that it "averages out" localized, critical reasoning failures, and it must wait for the entire path to be generated before calculation, preventing early termination.

Unlike previous methods that directly calculated the average confidence for the entire path (a global metric), DeepConf believes this approach can mask problems. For example, a path might be 90% confident in its steps, but make a crucial error in one step, yet the average would still be high. Therefore, it proposes a series of more refined local confidence metrics.

Step Three: Local Confidence Measurements

This formula calculates the average confidence for a small segment of continuous text (a "group"). The benefit of this is that it avoids excessive fluctuations in individual token confidence, allowing for a more stable reflection of the model's overall state during a particular reasoning phase.

* CGi: Represents the confidence of the "Group" ending with the i-th token.

* Gi: Represents a sliding window containing n tokens (e.g., n=2048).

* ∣Gi∣: Is the number of tokens in this group.

* ∑t∈GiCt: Sums the confidence of all tokens in this group (i.e., Ct calculated by formula 1 above).

* Group Confidence: This is a sliding window measurement. Instead of looking globally, it calculates the average confidence for a small segment of continuous tokens (e.g., 1024 tokens), which better captures local confidence fluctuations during the reasoning process.

* Tail Confidence: This metric is very targeted, focusing only on the confidence of the last part of the reasoning path (e.g., the last 2048 tokens). This is because success or failure often hinges on the final conclusion steps, and confidence at the end is crucial.

* Bottom 10% Group Confidence:

This metric is very clever; it focuses on the average of the 10% lowest confidence segments within a path. This is like finding the "weakest link in the chain," where a sharp drop in confidence often signals a breakdown in the reasoning chain.

* Lowest Group Confidence:

This is the most extreme case, directly using the value of the lowest confidence "group" within the entire path to represent the quality of the entire path. This metric punishes "weak links" the most. These metrics are like equipping AI's thought process with microscopes of different focal lengths, capable of precisely capturing where it starts to get confused in different stages.

Using "Bottom 10%" or "Tail" confidence combined with filtering (especially retaining 10%) typically yields the highest accuracy.

DeepConf's Dual-Track Strategy: Offline and Online Thinking

Based on these confidence metrics, DeepConf has designed two very practical operating modes. You can imagine them as two different project management styles: one is a "hindsight expert" who reviews all proposals after they've been submitted, and the other is a "real-time supervisor" who stops unreliable proposals as the project progresses.

Offline Mode: Letting the Most "Confident" Reasoning Paths Decide

Offline mode is straightforward: once the model has generated all N reasoning paths, we then process them. It primarily optimizes results through two key techniques:

* Confidence-Weighted Majority Voting: This changes the traditional "one vote per path" rule. The voting weight of each reasoning path is no longer 1, but its confidence score. This gives higher-confidence, higher-quality paths more influence in the final decision.

* Confidence Filtering: This is more direct, involving a "pre-selection" round before voting. For example, you can directly eliminate the bottom 90% of paths by confidence score, allowing only the top 10% elite paths to participate in the final decision, thereby significantly reducing noise from low-quality paths.

Online Mode: Real-time Halting of "Unreliable" Ideas

The online mode is the essence of this work; it truly achieves cost reduction and efficiency improvement, and its operation is quite elegant. Its workflow is meticulously designed and can be broken down into the following steps:

* Offline Warmup: For a new problem, the system first completely generates a small batch (e.g., 16) of reasoning paths. These "pioneer troops" serve to establish a baseline.

* Setting Stopping Threshold: The system analyzes the "lowest group confidence" of these 16 warmup paths and sets a dynamic "passing line" (i.e., stopping threshold s) accordingly. For instance, it can take the lowest value among the top 90% of confidence scores from this batch as the threshold.

* Dynamic Generation with Early Stopping: Next, the system begins generating new reasoning paths. During generation, it continuously monitors their local "group confidence." As soon as confidence drops below the previously set threshold s, the system decisively cuts off that path, preventing it from wasting another token!

* Adaptive Sampling: There's also a complementary design. The system continuously tracks the answer consensus among completed paths. If it finds that most paths already point to the same answer (e.g., consensus exceeding 95%), it stops generating more paths, as the answer is already clear.

Experimental Results: Data Doesn't Lie

The paper conducted extensive experiments on multiple high-difficulty math and science reasoning benchmarks (such as AIME, HMMT, GPQA) with various advanced open-source models (e.g., DeepSeek-8B, Qwen3-32B, GPT-OSS-120B). The results were truly astonishing:

* Significant Accuracy Improvement: In offline mode, taking the AIME 2025 test set and the GPT-OSS-120B model as an example, standard majority voting (Cons@512) achieved an accuracy of 97.0%. However, with DeepConf (Tail Conf@512 + top 10% filtering), the accuracy reached 99.9%, almost perfectly solving this benchmark.

* Substantial Cost Reduction: In online mode, compared to majority voting that generates full paths, DeepConf-low (aggressive filtering strategy) can reduce token consumption by up to 84.7% while maintaining or improving accuracy. This means that computational resources that previously cost 100 units might now only cost 15 units to achieve equivalent or even better results.

Application: DeepConf for Customer Churn Prediction

To validate DeepConf's effectiveness in real-world business scenarios, I built a DeepConf-based customer churn prediction example Agent using the researchers' open-source code and tested it on a customer dataset from Kaggle.

Technology Stack and Environmental Requirements

DeepConf's deployment is relatively simple, but there are several key technical requirements you need to understand:

* vLLM Inference Engine: This is the core dependency for running DeepConf, used for efficient batch inference and obtaining token-level log probabilities (logprobs), which are fundamental data for calculating confidence.

* Models Supporting Logprobs: Not all model APIs support returning detailed token probabilities; open-source models like DeepSeek-R1 and Qwen can perfectly support this via vLLM.

* Reasonable Computing Resources: Although it saves a lot compared to traditional self-consistency methods, multi-path inference still requires sufficient GPU/CPU resources.

Actual Performance

I used the DeepSeek-R1-8B model and designed 8 different "expert perspectives" (credit score analyst, customer behavior expert, financial status analyst, etc.) for the customer churn prediction task, generating 4 rounds of reasoning for each perspective, totaling 32 reasoning paths.

From the actual operational results, we can see:

* Multi-angle analysis: 32 reasoning paths analyzed the same customer from different professional angles, generating churn probability predictions ranging from 10% to 60%, reflecting the diversity of reasoning.

* Quantified Confidence: Each path had a clear confidence score (range 4.049-8.262), providing a basis for subsequent intelligent filtering.

* Intelligent Voting Mechanism: Through linear weighting, exponential weighting, and softmax, confidence-weighted voting was performed, with a final predicted result of 49.9% (classified as "retention"), perfectly aligning with the actual situation.

* Efficient Execution: The entire analysis process took only 162 seconds, with an average of 63.2 tokens generated per trajectory, achieving a throughput of 12.5 tokens/second.

In tests with 3 random customers, the prediction accuracy reached 66.7%. Considering this is a complex business prediction task, this result is quite encouraging.

Key Findings

This practical application revealed three prominent advantages of DeepConf in real-world business scenarios:

* Richness of Business Insights: Through multiple expert perspectives, we not only obtained prediction results but, more importantly, gained 32 different analytical approaches, providing rich references for business decisions.

* Strong Result Interpretability: Each reasoning path has a complete analysis process and clear confidence, making the prediction results no longer a "black box."

* Controllable Computational Efficiency: Compared to traditional methods requiring hundreds of paths, 32 paths can yield stable and reliable results.

What Does This Mean for Developing AI Products?

In the past, we attempted to "brute-force" reasoning challenges with a "sea of paths" (generating massive numbers of paths), which was not only costly but often drowned in noise. DeepConf demonstrates that through intelligent filtering and guidance, we can precisely achieve our goals with a small number of high-quality paths. This represents a shift from pursuing breadth of computation to delving into the depth of intelligence. Therefore, DeepConf's value extends far beyond a cost-reduction and efficiency-boosting "secret weapon." For frontline AI engineers and product managers, it offers strategic inspiration, marking a significant evolution in how we collaborate with large models.

Friends who haven't tried it are encouraged to give it a go, and don't forget to star the author!https://github.com/facebookresearch/deepconf/tree/main

The future is here, may we walk together!

End of Article

Please contact me for reprinting. Unauthorized reprinting will be prosecuted.

Boost LLM Reasoning Accuracy to 99% Without Fine-Tuning! Try DeepConf, a Lightweight Inference Framework | Latest from Meta

Share Short URL