The More You Think, The More You Err: CoT "Deep Deliberation" as a Catalyst for LLM Hallucinations!

In a nutshell, don't blindly believe in models' "deep thinking" anymore. This paper uses extensive experiments to conclusively prove that for knowledge-intensive tasks, longer reasoning chains not only fail to uncover more knowledge but instead become a hotbed for models to fabricate lies and fall into "confirmation bias." The more they think, the more wildly they err. (The original paper title is at the end of the article, click here to jump directly to the original link, Published on arxiv on 08 Sep 2024, by National University of Singapore)

Phase One: Identifying Core Concepts

Analysis of the Paper's Motivation

Research Background—In recent years, Large Language Models (LLMs) have made significant progress in solving complex problems, especially tasks requiring step-by-step reasoning (such as math problems). A key technique behind this is called Test-Time Scaling, which, simply put, involves allowing the model to spend more time "thinking" before answering a question, generating a long string of "internal monologues" or "reasoning chains" (Chain-of-Thought, CoT). The intuition behind this strategy is simple: the longer and more deeply one thinks, the more likely the answer is to be correct.

Research Gap—However, the authors noted that while this "more effort, more miracles" approach is effective in many domains, it is unclear whether it applies to a well-known LLM "Achilles' heel"—handling tasks that require precise factual knowledge. In these knowledge-intensive tasks, models not only need to answer correctly but also to avoid confidently fabricating information, which is what we commonly call hallucination.

Research Motivation—Therefore, the motivation of this paper is very clear: can the currently popular strategy of "letting the model think a bit longer" help models become more knowledgeable and reliable in knowledge-based questioning? Or, does thinking more lead to even greater errors? The authors aimed to answer this open question through comprehensive experiments.

Analysis of the Paper's Main Contributions

Main Innovation—Discovery of a counter-intuitive phenomenon: Through detailed testing of 12 mainstream reasoning models and 2 knowledge-intensive datasets, the paper reached a surprising conclusion: increasing the model's "thinking time" does not consistently improve the accuracy of answering factual questions, and in many cases, it leads to more severe hallucinations.Revealing the deeper reasons behind the phenomenon: The paper did not stop at discovering the phenomenon but delved into analyzing the internal mechanisms of hallucination changes. They found that the reduction in hallucinations was not because the model "recalled" correct knowledge, but because it "thought about it and decided against it," choosing to abstain from answering. Conversely, the increase in hallucinations was due to longer thinking time giving the model "courage" to attempt to answer questions it was unsure about, which naturally led to numerous errors.Proposing an explanation for confirmation bias: Through case studies, the paper points out that longer reasoning processes may induce a human-like confirmation bias in models. A model may first generate an initial, possibly incorrect guess, and then in subsequent "thinking," it continuously seeks or even fabricates evidence to support this guess, ultimately leading to a seemingly logical but actually incorrect "confident hallucination."Providing a balanced perspective: Although "thinking longer" was not effective, the paper also compared the difference between "thinking" and "not thinking" (direct output of answers). The results showed that enabling the thinking mode (even brief thinking) is generally still better than not thinking at all, especially for complex problems requiring multiple reasoning steps.

Key Technologies or Methods Supporting Innovation—Systematic experimental framework: The authors designed a unified experimental process to evaluate the "test-time scaling" effect of different models. They categorized the scaling methods into three types based on the interfaces supported by the models: Reasoning effort, Thinking budget, and Budget forcing, thereby enabling controllable and comparable evaluations across different models.Behavioral change analysis method: This is the key method supporting their core argument that "hallucination changes stem from the willingness to answer." They specifically compared those questions where models transitioned from "hallucinating" to "not hallucinating" (and vice versa) under different thinking levels. By analyzing these transitions, they were able to quantitatively show that the reduction in hallucinations primarily came from "abstaining from answering," while the increase came from "attempting new questions."Case Study: By presenting the complete "internal monologues" (reasoning chains) of models at different thinking lengths, the formation process of "confirmation bias" was intuitively exposed.

Identifying Understanding Difficulties

Understanding Key Concepts/Methods of the Paper—Test-Time Scaling: It is crucial to understand what this operation specifically refers to. It is not about retraining the model but, during inference (answering questions), using specific techniques to make the model generate longer intermediate thinking steps.Confirmation Bias: This is central to understanding "why thinking more leads to more errors." It is necessary to understand the meaning of this concept: the tendency to search for, interpret, and recall information that supports one's existing beliefs or hypotheses.Willingness to Answer: This is the core mechanism the paper uses to explain changes in the number of hallucinations. It describes the trade-off a model makes between "daring to try when uncertain" and "choosing to abstain when uncertain."

Most Challenging Part—How confirmation bias arises in large models: This process is relatively abstract. Models do not have subjective consciousness, so how do they "convince" themselves? Understanding this requires combining specific reasoning chain examples to observe how a model gradually moves from uncertainty to overconfidence.

Core Concepts Requiring Emphasis—Core concept: How test-time scaling, by inducing confirmation bias, affects the model's willingness to answer, ultimately leading to more hallucinations in knowledge-intensive tasks. This sentence connects all the key points of the paper and is the core we need to delve into in the second phase.

Concept Dependencies

Relationships Between Core Concepts—Starting point: Begin with the most basic operation, test-time scaling, explaining what it does.Core problem: Then introduce the hallucination problem encountered when this operation is applied to knowledge-intensive tasks.Core mechanism: Next, use confirmation bias to explain why longer thinking (test-time scaling) can exacerbate hallucinations.Final manifestation: Finally, explain how this confirmation bias manifests in the model's willingness to answer, thereby fully explaining all experimental phenomena observed in the paper.

Phase Two: In-depth Explanation of Core Concepts

Designing a Real-Life Analogy

Scenario Setting and Core Mechanism—Imagine a scenario: a student, not a history expert, is participating in a closed-book history quiz. One question is: "Which dynasty did Bi Sheng, the inventor of movable type printing, belong to?" The student has only a vague impression of the answer and is not entirely sure. Now, let's look at his performance under two different conditions:Condition A (Time pressure, short thinking time): The student quickly recalls, several possible dynasties flash through his mind, but he feels unsure about all of them. Due to time constraints, he doesn't have time to "fabricate" a seemingly reasonable explanation. To avoid losing points for a wrong answer, his most rational choice is to write "unsure" on the answer sheet or skip the question.Condition B (Ample time, long thinking time): The student has plenty of time to "ponder." He vaguely remembers that Bi Sheng's name sounds somewhat like the Song Dynasty. This "Song Dynasty" idea becomes his initial hypothesis. Next, he doesn't look for evidence to negate this hypothesis (because he doesn't have any in his mind anyway), but instead starts to construct a seemingly logical chain for this hypothesis on scratch paper: "Hmm, the Song Dynasty had a prosperous economy and culture, and advanced technology. Several of the Four Great Inventions are related to the Song Dynasty. Bi Sheng's name also sounds like a scholar from that era. And I remember he was mentioned in my textbook when talking about Song Dynasty technology. Yes, it must be the Song Dynasty!" In this process, he treats some vague, neutral, or even irrelevant information (like "scholar-like aura") as evidence to support his "Song Dynasty" hypothesis. After this "deep thinking," his initial uncertain guess has become very firm. Finally, he confidently writes "Song Dynasty" on the answer sheet.

Summary of Confirmation Bias Mechanism—This process is a typical example of confirmation bias. Longer thinking time did not expose him to new correct information; instead, it gave him an opportunity to "justify" and strengthen an initial, possibly incorrect, intuition using his limited knowledge.

Establishing Correspondence Between Analogy and Actual Technology

Correspondence Table

Student Corresponding actual technical concept: Large Language Model (LLM). Explanation: Both are intelligent agents that reason and answer questions based on existing knowledge.

History Quiz Question Corresponding actual technical concept: Query for knowledge-intensive tasks. Explanation: This is a direct test of the agent's factual knowledge reserves.

Student's Knowledge Reserve Corresponding actual technical concept: Model's trained internal parameters/world knowledge. Explanation: This is the agent's sole source of information for answering questions (no internet allowed in the experiment).

Allowed Thinking Time Corresponding actual technical concept: Computational budget for Test-Time Scaling. Explanation: For example, setting higher reasoning_effort or more thinking_tokens.

Reasoning Process Written on Scratch Paper Corresponding actual technical concept: Model's Chain-of-Thought (CoT). Explanation: This is the model's "internal monologue" or intermediate thinking steps before outputting the final answer.

Time Pressure, Choosing to Skip Corresponding actual technical concept: Model abstaining from answering under low computational budget. Explanation: The model quickly judges its lack of knowledge and outputs "I don't know."

Ample Time, Constructing Logic Chain and Answering Confidently Corresponding actual technical concept: Model producing overconfident hallucinations due to confirmation bias under high computational budget. Explanation: The model generates a long, seemingly reasonable CoT, ultimately giving a confident but incorrect answer.

Final Answer "Song Dynasty" Corresponding actual technical concept: Model's Hallucination Output. Explanation: This is an incorrect answer that contradicts the facts (the correct answer is Northern Song Dynasty, but simplified to Song Dynasty for analogy; the key is the process).

Delving into Technical Details

Technical Background—The core of this paper lies in experimental observation and analysis, not in proposing new mathematical formulas or algorithms. Its technical details are mainly reflected in its experimental design and analysis methods. We can use this analogy to understand the two most important metrics in the paper: Accuracy and Hallucination Ratio.

Accuracy Formula—Original mathematical form: Number of correct answers / Total number of questions;Symbolic replacement version: Score Rate = Number of questions the student answered correctly / Total number of all questions;Technical implementation: A powerful "referee" model (such as GPT-4o-mini) is used to judge whether the model's answer is consistent with the standard answer.

Hallucination Ratio Formula—Original mathematical form: Number of incorrect answers / Total number of questions;Symbolic replacement version: Random Answering Rate = Number of questions the student answered incorrectly / Total number of all questions;Technical implementation: The "referee" model marks the model's answer as "incorrect."

Mapping Technical Details to Analogy

Mapping Relationship Analysis—In the analogy, when thinking time increases (from Condition A to B), the student turns a question they would have skipped (contributing 0 to the hallucination rate) into a question they answered incorrectly (increasing the hallucination rate). This directly leads to an increase in the hallucination rate. At the same time, if a student already has a correct first impression of a question, longer thinking time might make them waver, or introduce wrong reasoning, causing them to answer incorrectly. This could lead to a decrease or stagnation in accuracy. Figure 2 in the paper shows a trend where the hallucination rate (Hallucination Ratio) of multiple models does not decrease but instead increases with increasing thinking time (Average Reasoning Tokens), which perfectly corresponds to our analogy. Figure 4's case study in the paper illustrates the thinking process of the gpt-oss-20b model. Under low thinking budget, it says "I'm uncertain. I'll say I don't know," just like the student in Condition A. Under high thinking budget, it continuously self-suggests and fabricates "evidence" (e.g., "We can check his resume...", "I did see it on the AAAI website list..."), ultimately confidently giving the wrong answer "2005," which exactly mirrors the mental activity of the student in Condition B.

Limitations of the Analogy—This analogy effectively explains the increase in hallucinations caused by "confirmation bias." However, it does not fully cover all situations. For example, for some knowledge-based questions that indeed require multiple reasoning steps to get the correct answer (such as the FRAMES dataset mentioned in the paper), longer thinking time can sometimes help the model integrate information and improve accuracy (although the paper found this situation to be uncommon). Our analogy simplifies this point, primarily focusing on explaining the core mechanism of increased hallucinations.

Summary

Reiteration of Core Connection—Allowing the model to "think longer" (test-time scaling) is like giving a student with shaky knowledge more exam time. He won't magically conjure the correct answer; instead, he'll have more opportunities to package his vague, incorrect intuition, through a self-constructed logic (confirmation bias), into a seemingly credible final answer.

Summary of Key Principle—This process leads to situations where the model would have initially admitted "I don't know" transforming into confidently "talking nonsense." Therefore, at a macro data level, we observe the paper's core finding: with increasing thinking time, the model's hallucination rate does not decrease but instead rises.

Phase Three: Detailed Description of the Process Steps

The core of this paper is not proposing a new model, but rather designing a set of evaluation and analysis procedures to study the behavior of existing models. Below, we detail how this process works, assuming we want to replicate the paper's evaluation of the gpt-oss-20b model on the SimpleQA dataset.

Inputs:

1. Model: gpt-oss-20b

2. Dataset: A list of multiple factual questions, each with a standard answer (e.g., 800 questions extracted from SimpleQA).

3. Query Template (Prompt): A fixed instruction template, such as: "Give me the answer to the following question only when you are sure of it. Otherwise, say 'I don't know'. Put your answer on its own line after 'Answer:'."

4. Evaluator: A powerful LLM, such as gpt-4o-mini, acting as a "referee."

Processing Flow:

Step One: Setting Different Thinking Levels

Models like gpt-oss-20b support controlling their thinking depth via a parameter called reasoning_effort.

The first step in this process is to define the several levels to be tested. We set three thinking levels for gpt-oss-20b: 'low', 'medium', 'high'. These constitute our experiment's independent variables.

Step Two: Batch Generation of Model Answers

Initiate a loop, iterating through each question in the SimpleQA dataset.

Inside this loop, run another loop for the three thinking levels ('low', 'medium', 'high').

For each combination of question and thinking level, perform the following operations:

1. Embed the current question into the predefined query template to form a complete input text.

2. Call the API of the gpt-oss-20b model, providing the text as input, and crucially, set the reasoning_effort parameter to the current loop's level (e.g., 'low').

3. The model will return a text containing the "chain-of-thought" and the final answer. We extract the final answer located after 'Answer:'.

4. Store this question, standard answer, thinking level, and the model's generated final answer as a record.

Flow Output: After this step, we will obtain a large set of results. For example, for 800 questions, with 3 different thinking levels for each, we will collect a total of 800 * 3 = 2400 records.

Step Three: Automated Evaluation

Now, we iterate through the 2400 records generated in the previous step.

For each record, we call the API of the "referee" model (gpt-4o-mini).

The referee model's input is structured and includes:

1. Original question (e.g., "What year did John Mylopoulos receive his AAAI Fellow award?")

2. Standard answer (e.g., "1993")

3. gpt-oss-20b's generated answer at that thinking level (e.g., "2005")

The referee model's task is to classify gpt-oss-20b's answer into one of three categories based on predefined instructions: "correct", "incorrect" (i.e., hallucination), or "not attempted" (the model responded with "I don't know" or similar expressions of uncertainty).

Flow Output: Add an "evaluation tag" (correct, incorrect, not attempted) to each record.

Step Four: Calculation and Analysis of Metrics

Grouping Process—Group all records by "thinking level" ('low', 'medium', 'high'). For each thinking level group, calculate the following core metrics: Accuracy: Calculate the number of records with an "evaluation tag" of 'correct' in that group, divided by the total number of records in that group (i.e., 800); Hallucination Ratio: Calculate the number of records with an "evaluation tag" of 'incorrect' in that group, divided by the total number of records in that group (800).

Flow Output—Obtain the accuracy and hallucination ratio values corresponding to each thinking level. For example: low: Accuracy=25%, Hallucination Ratio=40%; medium: Accuracy=24%, Hallucination Ratio=50%; high: Accuracy=23%, Hallucination Ratio=55%. These data points form the basis for plotting the curves in Figures 1 and 2 of the paper.

Step Five: Deep Behavioral Analysis (Optional, but Core to the Paper)

To investigate "why the hallucination rate increases," a more detailed comparison is needed.

Filter for all questions whose evaluation tag changed between the 'low' and 'high' levels.

Among these questions, focus on one type: those that were 'not attempted' at the 'low' level but became 'incorrect' at the 'high' level.

Calculate the proportion of such questions among all newly added hallucination issues. If this proportion is very high (as found in the paper, e.g., 95%), it strongly proves that the increase in hallucinations primarily stems from the model starting to attempt to answer questions it was unsure about.

Flow Output: Obtain a behavioral transition analysis chart similar to Figure 3 in the paper, revealing the internal mechanism of hallucination changes.

This complete process, from data input, model interaction, automated evaluation, to multi-dimensional analysis, constitutes a rigorous research methodology, making the paper's conclusions not just anecdotal evidence but systematic findings based on large-scale data.

Phase Four: Experimental Design and Validation Analysis

Interpretation of Main Experimental Design: Validation of Core Argument

Core Claim and Experimental Design—Core Claim: Increasing test-time computation (i.e., making the model "think longer") does not effectively improve the model's performance on knowledge-intensive tasks, and may even be detrimental.Experimental Design: The authors adopted a direct and clever "self-comparison" method. Instead of comparing Model A with Model B, they compared the performance of the same model under different thinking intensities. Specifically, they selected 12 mainstream large models that support reasoning chains, systematically increased the models' reasoning computation on two datasets—SimpleQA (factual questioning) and FRAMES (multi-step reasoning questioning)—and observed the trends in their accuracy and hallucination ratio. This design eliminated interference from differences in model capabilities, allowing for a pure examination of the effect of "increased thinking" itself.

Analysis of Rationality of Choices—Datasets: Choosing SimpleQA and FRAMES was an excellent choice. SimpleQA is a classic factual verification dataset, with direct questions primarily testing the model's knowledge recall. FRAMES goes further, requiring the model to perform multi-step reasoning (e.g., "What album did Pink Floyd release the year Picasso died?"), which tests the model's ability to integrate and apply knowledge. Covering both types of datasets makes the experimental conclusions more general, proving that the conclusion holds whether it's simple knowledge extraction or complex knowledge reasoning.Evaluation Metrics: Using Accuracy and Hallucination Ratio, these two metrics go straight to the heart of the matter. In knowledge-intensive tasks, we care not only about "how many were answered correctly" (Accuracy) but also "how many were answered incorrectly" (Hallucination Ratio), because a wrong answer can be more harmful than no answer. Evaluating both metrics simultaneously provides a comprehensive characterization of the model's "reliability," avoiding potential misleading interpretations from looking only at accuracy.Baseline Method: The baseline for this study is the model's performance at the lowest thinking setting. This is a perfect controlled variable design. All subsequent performances under higher thinking settings are compared to this baseline, and any change in performance can be clearly attributed to "increased thinking effort." This is far more scientific than choosing another model as a baseline.

Experimental Results Supporting Core Contributions—The experimental results are mainly presented in Figure 1 and Figure 2. Figure 1 (Accuracy) shows that the accuracy curves of most models, as thinking time (X-axis) increases, are either flat, fluctuating, or even declining, with only a very few models (like Gemini 2.5 Flash) showing a significant initial improvement before quickly plateauing. This strongly supports the argument that "increasing thinking does not necessarily improve accuracy." Figure 2 (Hallucination Ratio) is even more striking, with the hallucination rate curves of many models being flat or even rising. This directly proves the core finding that "thinking longer can even be more harmful." The main experiment clearly indicates that test-time scaling is not a "panacea" for knowledge-intensive tasks, and its effects are far less than what people might expect for other tasks.

Ablation Experiment Analysis: Contributions of Internal Components

Analysis Background—Traditionally, ablation experiments involve removing a specific module of a model, but this paper analyzes the behavior of existing models. Thus, its "ablation experiment" is reflected in its deep analysis, aiming to "eliminate" different explanations for the phenomenon and pinpoint the true cause.

Key Analysis Method—The paper's core insight is that "changes in hallucination rate are driven by the model's willingness to answer, not by an improvement in knowledge recall ability." To verify this, they designed the analysis experiment shown in Figure 3.

Hypothesis Being Ablated—A possible, more optimistic hypothesis is: "When hallucinations decrease, it is because the model, after deeper thinking, successfully recalled the correct knowledge, thereby correcting wrong answers."

Experimental Design and Results—The authors specifically examined cases where models hallucinated in "low thinking" but did not hallucinate in "high thinking." They analyzed the state of these cases in "high thinking" and found that the vast majority (e.g., 93.1% for Grok-3 mini) became "not attempted" rather than "correctly answered."

Necessity of Proof—This result quantitatively "ablates" the optimistic hypothesis mentioned above. It proves that the reduction in hallucinations does not stem from "fixing" knowledge but from the model becoming more "cautious" and choosing to abstain. This greatly strengthens the authors' core argument: the change in model behavior is strategic (whether to answer) rather than capability-based (whether it can recall).

In-depth Innovative Experiment Analysis: Insights into the Intrinsic Characteristics of the Method

Case Study Experiment Analysis—Experiment Goal: This experiment aimed to open the "black box" and visually demonstrate how an abstract psychological concept—Confirmation Bias—actually occurs in the model's reasoning chain. It sought to answer the question: "Why does a model, after thinking longer, go from 'uncertain' to 'overconfident'?"Experimental Design: As shown in Figure 4, the authors selected a highly representative case of the gpt-oss-20b model. They presented side-by-side the complete "internal monologue" (Thought process) generated by the model for the same question under low and high thinking settings. Low thinking setting: The reasoning process is brief, and after trying a few possibilities, the model frankly admits "I'm uncertain," eventually abstaining from answering. High thinking setting: The reasoning process is extremely long. The model starts with an uncertain guess ("...maybe in 2005"), then continuously searches for "evidence" to support this guess, even fabricating verification steps ("We can check his resume...", "I did see it on the AAAI website list..."). Each such "verification" increases its confidence, eventually transforming from "maybe" to "I'm fairly sure it's 2005," and providing this incorrect answer.

Significance of Experimental Conclusion—This case study convincingly reveals that longer reasoning chains provide the model with space for "self-justification" and falling into a "confirmation bias" loop. It is not performing objective knowledge retrieval but constructing a seemingly perfect, yet factually detached, narrative for an initial guess. This provides the underlying, mechanistic explanation for "why thinking longer leads to more errors."

Paper Title: Test-Time Scaling in Reasoning Models Is Not Effective for Knowledge-Intensive Tasks Yet

The More You Think, The More You Err: CoT "Deep Deliberation" as a Catalyst for LLM Hallucinations!

Share Short URL