Why Can't Language Models Directly Output Answers with Confidence?

Current Language Models (LMs) have achieved breakthroughs in complex question answering tasks by generating Chain-of-Thought through Reinforcement Learning (RL). However, mainstream methods rely on a binary reward function (1 point for correct answer, 0 for wrong), which leads models to guess blindly in pursuit of high scores—displaying high confidence even when uncertain about the answer. This phenomenon is called calibration degradation: model confidence severely mismatches true accuracy, potentially leading to serious consequences in high-stakes domains like healthcare and law.

Paper: Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty

Link: https://www.arxiv.org/pdf/2507.16806

The RLCR method (Reinforcement Learning with Calibration Rewards) proposed in this paper is the first to directly integrate probability calibration into the RL training objective. By requiring the model to output both an answer and a numerical confidence score, and designing a novel reward function (correctness score + Brier calibration score), RLCR achieves:

Theoretical Guarantee: Proof that the model is incentivized to optimize both accuracy and calibration.
Performance Breakthrough: Calibration error (ECE) reduced by over 85% in mathematical reasoning (GSM8K) and factual question answering (HotPotQA) tasks.
Generalization Advantage: Maintains calibration capability across cross-domain tasks, outperforming traditional RL and post-processing classifiers.

Methodology Explained: RLCR Design and Theory

Reward Function Reconstruction

Traditional RLVR (Reinforcement Learning with Verifiable Rewards) uses only binary rewards, where its reward function (R(y, y*)) is defined as 1 if the model output y matches the correct answer y*, and 0 otherwise.

RLCR's innovative reward function introduces the Brier score (a measure of probability prediction calibration). This reward function (R(y, c, y*)) combines a correctness score with a Brier calibration score.

c: The model's output confidence (a numerical value between 0 and 1)
Second term's effect: Penalizes the deviation of confidence c from the true correctness I(y=y*). For example:
- If the answer is correct but c (low confidence), a penalty is incurred.
- If the answer is wrong but c (high confidence), a larger penalty is incurred.

Binary Rewards Encourage Guessing vs. RLCR Rewards Balance Correctness and Calibration

Theoretical Guarantee: Core Idea of Theorem 1

Theorem 1: When the model's true probability of correctness for an answer is p*:

Calibration Incentive: When p* is fixed, the expected reward is maximized when c=p*.
Correctness Incentive: Among all calibrated predictions, the reward is optimal when correctness is maximized.

Proof Key:

Decomposition of the Brier score via Savage-Dawid representation, deriving the reward function derivative.
When c ≤ p* (or c ≥ p*), the reward is monotonically increasing (or decreasing), ensuring high accuracy receives high rewards.

Why not Log Loss? Log loss is a strictly proper scoring rule, but it is unbounded. As confidence c approaches 0, the loss tends to infinity, making it impossible to find a finite c that satisfies the theorem conditions, potentially incentivizing the model to output incorrect answers with 0 confidence.

Experimental Design and Results Analysis

Dataset Innovation: Forcing Uncertainty Reasoning

HotPotQA-Modified:
- Original multi-hop QA dataset contains 10 paragraphs (2 relevant + 8 distracting).
- Modified version randomly removes 0/1/2 key paragraphs, creating information deficits (1/3 of samples lack key information).
- Goal: Force the model to identify knowledge gaps and lower confidence.
Big-Math:
- Filtered mathematical problems from LLaMA-8B with 0-70% accuracy (total 15,000 problems).
- Only numerical answers retained, precisely verified using a math-verify tool.

Baseline Method Comparison

RLVR: Traditional binary reward + CoT - Severe calibration degradation

RLVR+BCE Classifier: Additional classifier trained to predict confidence - Requires two models, doubling inference cost

RLVR+Brier Classifier: Brier loss replaces BCE for classifier training - Limited calibration improvement

Answer Probability (AnswerProb): Directly uses the average token probability within the <answer> tag - Ignores reasoning process, overestimates confidence

Core Results: Win-Win for Calibration and Accuracy

Key Data: RLCR vs. Baselines Calibration Error Comparison on HotPotQA

In-domain Performance (HotPotQA):

Accuracy: RLCR (62.1%) ≈ RLVR (63.0%)
Calibration Improvement:
- ECE: From 0.37 → 0.03 (92% reduction)
- Brier Score: From 0.37 → 0.21 (43% reduction)
Reason Analysis: Model explicitly analyzes uncertainty in the <analysis> tag (e.g., "Conclusion from paragraph 3 might be invalid due to data conflict")

Cross-domain Generalization (6 Out-of-Domain Datasets):

Accuracy: RLCR (56.2%) > RLVR (53.9%)
Calibration Advantage More Significant:
- ECE: 0.21 vs. RLVR's 0.46 (54% better than baseline)
- AUROC: 0.68 (36% improvement in confidence's ability to distinguish positive/negative examples)
Key Conclusion: RLCR's generalization ability stems from the transferability of uncertainty reasoning.

Mathematical Reasoning (GSM8K+Math500):

Role of SFT Warm-up:
- Fine-tuned base model with 500 uncertainty analyses generated by DeepSeek-R1.
- Result: SFT+RLCR's ECE dropped to 0.058 (pure RLCR was 0.119).
Typical Error:
Original RLCR output: "71 movies" (confidence 0.6)SFT+RLCR output: "76 movies" (confidence 0.3 → actual correct answer is 63)<Analysis> tag indicates: "Overlap with the full set was not considered when adding in-group movies"

Innovative Discoveries and Technical Extensions

Consistency Validation of Confidence

Confidence Stability for Same Answers:

Fix an answer, sample 5 analysis chains to obtain confidence.
Result: 82% of samples had standard deviation < 0.1.

Confidence Conservation for Mutually Exclusive Answers:

Ideal Requirement: If answer set is mutually exclusive, total confidence should sum to 1.
Actual Results:
- In-domain (HotPotQA): RLCR total confidence ≈ 0.98.
- Out-of-domain: Still some overconfidence (total ≈ 1.2), but significantly better than RLVR (total ≈ 1.8).

Confidence Weighted Scaling at Test Time

Core Idea: Use confidence c as an untrained proxy for reliability.

Max-Confidence Selection: Select the highest confidence answer from N samples.
Weighted Majority Voting: Vote weighted by c.

Graph: Accuracy curve of confidence-weighted voting with increasing sample size

Key Advantages:

At N=5, confidence-weighted voting accuracy exceeds ordinary voting by 3.2%.
Analysis chain integration (sampling K <analysis> entries) can further reduce the Brier score.

Correlation between Model Scale and Calibration Ability

Smaller models rely more on explicit uncertainty reasoning chains to improve calibration.

Experimental Design: Compared two classifiers on 0.5B/1.5B/7B models:
- Baseline Classifier: Only inputs question and answer.
- Analysis Classifier: Additionally inputs RLCR's <analysis> content.
Results:
- 0.5B model: Analysis classifier's Brier score 37% lower.
- 7B model: The gap disappears.
Conclusion: Smaller models need explicit reasoning chains to compensate for insufficient representational capacity.

Conclusion

RLCR, through the concise and profound innovation of recompiling the reward function, for the first time unifies the optimization of language models' accuracy and calibration within a reinforcement learning framework. Its core contributions can be summarized as:

Theoretical Rigor: Proved that the combination of Brier score and correctness reward can simultaneously incentivize optimal answer selection and true confidence expression.
Empirical Superiority: Significantly reduced calibration error (maximum reduction of 92%) across 12 datasets without sacrificing accuracy.
System Scalability: Confidence output supports lightweight improvement methods such as weighted voting during testing.

This work marks a crucial step for language models from "only seeking correct answers" to "understanding their own cognitive boundaries," laying the foundation for reliable AI deployment in high-risk scenarios like medical diagnosis and legal consultation. Future research needs to further address cross-domain calibration generalization and uncertainty propagation in complex reasoning.

Note: Title references "Gelu AI@xhs"

Why Can't Language Models Directly Output Answers with Confidence?

Share Short URL