When Thinking Becomes a Burden: Unveiling the "Thinking Traps" of Large Language Models

Do you believe that making AI think more can actually make it dumber? New research shows that this counterintuitive phenomenon does exist!

Imagine you ask an assistant to complete a task, detailing all requirements and constraints. But when you encourage the assistant to "think carefully before acting," they become more likely to ignore some of your instructions. This might sound incredible, but in the world of Large Language Models (LLMs), this phenomenon is real.

Today, we present a new research finding that overturns common sense: Having AI models perform Chain-of-Thought (CoT) reasoning may significantly reduce their ability to follow instructions. This study tested 15 models, including Claude 3.7, GPT series, DeepSeek-R1, and more, revealing a key flaw in the AI thinking process.

1. Does Thinking Make AI Dumber? This Study Overturns Our Cognition

Currently, Chain-of-Thought (CoT) is considered a "magic weapon" for improving AI models' ability to solve complex problems. Many recent models, such as DeepSeek-R1, Claude series, and OpenAI's O series, promote CoT as a core feature.

However, researchers found after testing models using two benchmark datasets, IFEval and ComplexBench: When models were asked to think using the CoT method, their accuracy in following instructions generally decreased. For example, the accuracy of the Llama3-8B-Instruct model plummeted from 75.2% to 59.0%, a drop of over 16 percentage points.

This phenomenon exists in almost all tested models, whether they are open-source or closed-source, small or large. What's even more surprising is that models specifically trained for reasoning ability (such as Claude 3.7-Sonnet-Think and DeepSeek-R1) perform worse in following instructions than their base versions.

2. How Does Thinking Become an Obstacle? Unveiling AI's Attention Shift

Why does this counterintuitive phenomenon occur? Researchers conducted in-depth analysis through two methods:

(1) Large-scale case studies

Researchers manually analyzed over 1500 samples and found that the impact of thinking on models' instruction following can be divided into four typical situations:

Situations where thinking is helpful:

1) Format and structure following: Thinking helps the model generate valid JSON, correctly use quotes, or follow markdown syntax and other structural requirements.

2) Vocabulary and keyword precision: Thinking enhances the model's adherence to specific vocabulary requirements, such as inserting rare characters, omitting end punctuation, etc.

Situations where thinking is harmful:

1) Over-focusing on high-level content while ignoring simple constraints: When faced with multiple requirements, thinking often leads the model to focus on content planning while neglecting some basic limitations, such as word count limits or case requirements.

2) Introducing unnecessary content leading to constraint violation: Thinking often causes the model to add redundant or well-intentioned content (such as explanations, translations, or emphasis), unintentionally violating instruction requirements.

(2) Constraint Attention Analysis

Researchers proposed a "Constraint Attention" metric to quantify the model's attention to constraint-related words in the instructions. They found: Using CoT significantly reduces the model's attention to constraint words.

图片

3. How to Make AI "Think" Better? Four Solution Strategies PK

To address this issue, researchers proposed and evaluated four mitigation strategies:

(1) Few-shot context learning Guides the model by adding carefully selected examples before the instruction. However, this method has limited effectiveness due to token length limitations and example bias.

(2) Self-reflection Allows the model to first generate a preliminary answer and thought process, then perform a second reasoning pass to reflect on and improve its answer. This method performs well on simple instructions (like IFEval) but performs worse on complex instructions. Furthermore, this method requires two forward passes, leading to higher computational costs.

(3) Self-selection thinking Allows the model to decide for itself whether explicit thinking is needed. This method performs well on ComplexBench, but analysis shows that models often overuse thinking, even when unnecessary.

(4) Classifier-selected thinking Uses an external binary classifier to decide whether CoT thinking should be applied. This method achieved the best overall performance on both benchmark tests but requires training a specific classifier for each target model.

The study results show that the classifier-selected thinking method can significantly improve the model's ability to follow instructions in most cases, almost restoring the performance level when CoT is not used.

图片

4. The Future of AI "Thinking": Selective Thinking May Be Key

This study systematically reveals, for the first time, a surprising phenomenon: having AI perform explicit Chain-of-Thought reasoning may impair its ability to follow instructions. This finding is significant for the field of AI, especially in building more reliable instruction-following models.

Researchers suggest adopting a decision process: selecting different strategies based on the complexity of the instruction. For simple tasks, self-reflection or classifier-selected thinking is better; for complex tasks, self-selection thinking or classifier-selected thinking is more effective.

It is worth noting that this study only focuses on instruction-following tasks; the impact of thinking on other areas remains to be explored. However, it has revealed a critical blind spot in the AI thinking process and provided practical mitigation strategies.

This study reminds us: In the field of AI, more thinking does not always mean better results. For Large Language Models, knowing when and how to think may be more important than simply increasing the amount of thinking.

In the future, we may see more selective thinking AI systems that can intelligently decide when to think deeply and when to answer directly, thereby achieving optimal performance across various tasks.

Paper Title: When Thinking Fails: The Pitfalls of Reasoning for Instruction-Following in LLMs

Paper Link: https://arxiv.org/abs/2505.11423

Recommended Reading

AI Agents vs. Agentic AI: Evolution from Tool-Based Assistants to Autonomous Collaboration Systems

Latest Google Research: Why Can Large Models "Learn" But Not "Apply Well"?

Birth of the First AI Thinking Encyclopedia, Model Reasoning is No Longer a Black Box

Main Tag:Artificial Intelligence

Sub Tags:Large Language ModelsChain-of-ThoughtInstruction FollowingReasoning


Previous:How Strong is the Reasoning Ability of Large Language Models? A Study Reveals LLMs' Limitations and Potential

Next:Windows Subsystem, Copilot Both Heavily Open Sourced: What Surprises Did Microsoft Bring Us in Its Late Night Event?

Share Short URL