When solving complex problems, Large Language Models (like ChatGPT) typically generate a "reasoning process" before providing a final answer. Traditional evaluation methods only check the correctness of the final answer, but this paper proposes a counter-intuitive view: the final answer might just be a result of the model's "last-minute change of mind," and the thinking in the intermediate steps is more worth exploring.
Paper: Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think
Link: https://arxiv.org/pdf/2504.20708
For example, when solving a math problem, the model might make a mistake mid-step but then force a correction at the end, leading to a wrong answer; or the correct answer might have appeared in an intermediate step only to be overwritten by subsequent incorrect derivations. The paper proves through experiments that relying solely on the final answer might miss a better solution.
Finding: Intermediate Steps Hold Clues, Answer Consistency Determines Accuracy
The researchers segmented the model's reasoning process into multiple "Subthoughts," considering cues like "wait a minute" or "let's look from another angle" as indicators of entering a new thinking phase. Then, they regenerated subsequent reasoning from each intermediate step to form a distribution of answers.
Key Findings:
Correct answers often appear frequently in intermediate steps, while incorrect answers fluctuate significantly.
The more concentrated the answer distribution (low entropy), the more likely the model is correct; the more dispersed the distribution (high entropy), the higher the chance of error.
Formula for understanding:
Entropy Calculation (measures answer consistency):
Low Entropy → Concentrated Answers → High Confidence ✅
High Entropy → Dispersed Answers → Potential Error ❌
Method: How to Improve Model Performance Using "Step-by-Step Checking"?
The paper proposes a simple but effective process:
1. Truncate Reasoning: Pause at each intermediate step of the model (e.g., "calculated up to step 3").
2. Restart Generation: Regenerate subsequent reasoning from the pause point to get multiple candidate answers.
3. Voting Decision: Select the answer that appears most frequently (mode).
Example 🌰:
Suppose when the model solves an equation, the correct answer (96) was generated 3 times in intermediate steps, but the final answer is incorrect (50). Through "step-by-step checking," the system would count 96 as the most frequent answer, thus correcting the error.