Bengio Debunks CoT Myth! LLM Reasoning is an Illusion, 25% of Top Conference Papers Disproven

Image

Source | Xinzhiyuan

Editor | YHluck Peach

Turing Award laureate Bengio's new work is online!

This paper, co-authored by Oxford, Google DeepMind, and Mila, states that Chain-of-Thought (CoT) is not explainability.

This view completely shatters many people's perceptions:

CoT appears to provide answers step-by-step, but it may not necessarily be its true reasoning process.

Image

Paper link: https://www.alphaxiv.org/abs/2025.02

So, the "Chain of Thought" that can reveal the inner workings of LLMs is no longer reliable?

In the paper, the researchers tear away the veil of CoT, revealing a shocking truth: the transparency of Chain of Thought might just be a meticulously crafted illusion!

Image

The "Lie" of Chain of Thought Exposed

However, in reality, approximately 25% of recent AI papers incorrectly brand CoT as an "explainability technique."

This concept was first proposed by former Google researcher Jason Wei in a paper, and since then, CoT has been widely applied in reasoning models.

Image

Its biggest feature is the ability to perform multi-step reasoning, improving model accuracy. At the same time, it makes the AI black box less mysterious.

However, is the CoT thought process truly its internal monologue?

Some papers confidently claim that CoT allows us to see AI's reasoning process, but the truth is far from it.

Image

Especially in high-stakes domains, the cost of this misunderstanding could be fatal.

Researchers found that in papers using CoT, approximately 38% of medical AI, 25% of legal AI, and 63% of autonomous driving related papers blindly treated CoT as an explainability method.

Even more chilling is that clearly biased prompts can easily sway the model's answers.

Moreover, these biases are never mentioned in the "reasoning steps" provided by the AI.

AI can weave seemingly reasonable explanations for biased answers, but it never exposes the "hidden hand" behind them.

Image

Therefore, trusting these biased answers can be very dangerous.

Not only that, AI often "secretly" corrects its errors during the reasoning process.

Superficially, the steps given by large models may be full of flaws, but they can arrive at the correct answer through unstated "black box operations."

This creates an illusion of transparency. Why does this disconnect occur?

Researchers hypothesize that concise CoT cannot fully capture the distributed parallel computation processes present in base Transformer large models.

How CoT Conceals True Reasoning?

An increasing number of empirical studies have uncovered numerous cases where a model's chain of thought deviates from its internal reasoning process.

It should be noted that before examining specific patterns of unfaithfulness, the faithfulness of CoT explanations varies due to multiple factors, such as model architecture.

The researchers also summarized four key findings: Bias-Driven Rationalization and Motivated Reasoning, Silent Error Correction, Unfaithful Illogical Shortcuts, and Filler Tokens.

Each point clarifies how CoT misleads or conceals the model's actual decision-making process. We have summarized the key issues from these findings for you:

Bias-Driven Rationalization and Motivated Reasoning

Turpin et al. demonstrated bias-driven rationalization by cleverly biasing the model inputs.

For example:

Reordering the options in a multiple-choice question within the prompt such that the correct option is always in the same position (e.g., always letter B).

In this case, despite their CoT explanations never mentioning option reordering as an influencing factor, GPT-3.5 and Claude 1.0 often selected the biased option.

When models were biased towards incorrect answers, they still generated detailed CoT to rationalize those incorrect answers.

This resulted in up to a 36% decrease in accuracy across a range of tasks, while CoT provided a misleading illusion of reasoning.

Another study investigated prompt injection bias by adding explicit answers to the prompt (e.g., "The answer is C") and then asking the model to justify its choice.

Claude 3.7-Sonnet and DeepSeek-R1 acknowledged the injected answer in only approximately 25% and 39% of cases, respectively.

These findings suggest that chains of thought often function as post-hoc rationalizations, ignoring true causal factors and creating an illusion of transparent explanation.

Silent Error Correction

Researchers point out that models may make mistakes in their chain of thought and then internally correct these errors, without CoT reflecting this correction process.

For example:

In a CoT reasoning process, a model might incorrectly calculate the hypotenuse of a triangle as 16, when the correct value should be 13, but then state: "We add the hypotenuse length 13 to the other two side lengths to get the perimeter."

The model detected and corrected the error internally, but the CoT narrative never corrected or flagged this error—it reads like a coherent problem-solving process.

These silent errors indicate that the final answer is derived from computations beyond the stated steps.

Unfaithful Illogical Shortcuts

Researchers state that models may arrive at correct answers through underlying shortcuts, such as utilizing memorized patterns as alternative reasoning paths, thereby bypassing full algorithmic reasoning, which renders the explicit chain of thought irrelevant or incorrect.

Here's a typical case:

Researchers using attribution maps (a method to track which computational steps contribute to the final output) found that when solving problems like "36 + 59," Claude 3.5 Haiku simultaneously used lookup table features (e.g., for "adding numbers close to 36 and numbers close to 60") and addition computation features.

Interestingly, when asked to describe how the model arrived at the answer, the model reported that it performed digit-by-digit addition with carrying, completely ignoring the fact that it used a lookup table shortcut.

Filler Tokens

The study points out that in some algorithmic reasoning tasks, using filler tokens—such as "..." or learned "pause" tokens, which are input tokens that have no semantic contribution to the task but affect the model's internal computation—can improve model performance.

To help you understand, for example:

Researchers found that attaching learnable pause tokens (which can act as a type of filler token) to inputs led to significant performance improvements across many tasks.

Coincidentally, researchers also found that adding filler tokens allowed models to solve problems they previously failed, especially when trained with dense supervision.

The above key findings all explain that CoT unfaithfulness is a fundamental challenge prevalent across different model architectures and scales.

Its high incidence is due to factors such as prompt bias, failure to acknowledge hidden influences, and systematic error correction in complex reasoning tasks.

Why CoT Explanations Inconsistent with Internal Computations?

In the above examples, we observed some inconsistencies in CoT. So, what causes them?

Distributed Parallel Computing, Not Sequential

"Mechanistic interpretability" research indicates that the Transformer architecture may fundamentally limit the faithfulness of CoT.

LLMs built on Transformer architecture typically process information in a distributed manner simultaneously through multiple components, rather than the sequential steps presented by CoT.

It is precisely this architectural difference that leads to an inherent mismatch between how models compute and how they express themselves linguistically.

For example, when faced with a simple math problem like "24 ÷ 3 = ?", how would an LLM solve it?

It certainly wouldn't analyze "how many times 3 divides 24" step-by-step like a human student, or list long division steps.

Instead, multiple attention heads within the LLM would process the relationships between these numbers simultaneously.

It might recognize it as a memorized fact, confirm it's part of the 8 times table, and compute the division—all happening in parallel.

To generate concise and seemingly logical outputs, LLMs usually generate only one such narrative to rationalize their answers, rather than expressing all parallel paths, even those that might significantly influence the final answer.

Therefore, CoT often ignores influential factors, serving only as a partial, post-hoc rationalization of the model's underlying distributed, superimposed computations.

Redundant Paths

Furthermore, when processing tasks, LLMs often arrive at the same conclusion through multiple redundant computational paths.

For example, to calculate √144, the model might simultaneously perform the following operations:

  • Recognize it as a memorized fact (12×12=144)

  • Apply the square root algorithm

  • Pattern match with similar problems in the training data

When the crucial step "144=12×12" is removed from the model's CoT explanation of "√144=12," the LLM is still able to correctly output 12.

This indicates that the LLM's answer does not solely depend on linguistic reasoning steps but rather on its internal multiple redundant paths.

This phenomenon of redundant computational paths is termed the "Hydra Effect" by researchers.

That is, if one path in an LLM is blocked or removed, another path can take over.

This also explains why modifying or deleting certain parts of CoT has minimal impact on the final answer.

CoT Faithfulness Challenge

Tanneru et al. attempted to guide LLMs to generate faithful CoT reasoning by penalizing inconsistency during training.

However, this method achieved only partial success, especially for complex problems, where models tended to generate seemingly plausible but non-causal explanations.

For example, a model might output a logically clear CoT that conceals its true decision path.

This "rationalization" tendency is not only difficult to completely eliminate but may also reappear during fine-tuning or continuous learning.

Other researchers, Chua and Evans, further studied models specifically "reasoning trained" and found that:

DeepSeek-R1 could mention prompt cues in 59% of tests, but still failed to identify key influencing factors in 41% of cases.

Even the most advanced DeepSeek-V3 only mentioned cues in 7% of cases, indicating that the effect of reasoning training is extremely limited.

More concerning is that LLMs might "deceive" monitoring mechanisms through CoT.

When CoT is used to monitor reasoning in RL reward functions, models learn to generate seemingly harmless reasoning trajectories while secretly executing harmful strategies.

Breaking the CoT Illusion

In the conclusion of the paper, addressing the issue of CoT transparency, the researchers state, "We cannot continue to be misled by illusions."

To this end, they put forward the following suggestions:

1. Redefine the Role of CoT

CoT is not a "silver bullet" for explainability but should be viewed as a supplementary tool. It can provide clues, but by no means the whole truth.

2. Introduce Strict Verification Mechanisms

Through causal verification techniques, such as activation patching, counterfactual testing, and verifier models, deeply investigate whether AI's reasoning process is faithful.

3. Draw Lessons from Cognitive Science

Mimic human error monitoring, self-correction narratives, and dual-process reasoning (intuition + reflection) to make AI's explanations closer to reality.

4. Strengthen Human Oversight

Develop more powerful tools to enable human experts to review and verify AI's reasoning processes, ensuring their credibility.

References:

https://x.com/FazlBarez/status/1940070420692312178

https://www.alphaxiv.org/abs/2025.02

Main Tag:Artificial Intelligence

Sub Tags:Large Language ModelsBengioExplainabilityChain of Thought


Previous:Large Models Directly Understand Code Graphs: Automatic Bug Fixing Without Agents, Topping SWE-Bench Open-Source Model List

Next:Tsinghua and Others Propose Absolute Zero Self-Play Large Models, Achieving Top Performance on Multiple Tasks with Zero-Data Training

Share Short URL