The Sky Has Fallen! Apple Just Proved: DeepSeek, o3, Claude and Other "Reasoning" Models Lack True Reasoning Ability

Image

Latest Major Research: Apple Does Not Believe Reasoning Models Are a Significant Breakthrough Over Standard LLMs

In its latest research, "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models Through the Lens of Problem Complexity," Apple questions the widespread assumption that "large language models already possess true logical thinking capabilities—that is, true 'reasoning ability.'" What Apple's researchers observed was not a cognitive breakthrough, but an illusion: these models merely create the impression of thinking, without actually having a stable, understandable thought process.

Image

The core criticism of this research is that the most advanced reasoning models currently available suffer a 'cliff-like' collapse in their reasoning capabilities when faced with truly complex problems, exhibiting a counter-intuitive 'thinking degradation' phenomenon, to the extent that they can't even 'copy assignments according to algorithms'.

The 'Traps' of Existing Evaluation Methods: Why a New Experimental Ground is Needed?

Currently, evaluating AI's reasoning capabilities primarily relies on benchmark tests such as mathematics (e.g., MATH, AIME) and programming. However, researchers point out two major drawbacks to this approach:

Data Contamination: The problems and solutions for these benchmark tests are highly likely to already exist extensively in the models' training data. Models might simply be 'memorizing' answers rather than genuinely 'reasoning' them out. Data from the paper shows that models performed worse on the updated AIME25 dataset than on the older AIME24, which is precisely the opposite of human performance, strongly suggesting interference from data contamination on evaluation results.

Lack of Insight into the 'Thinking Process': The correctness of the final answer doesn't tell us how the model thought, whether its chain of thought was efficient, rigorous, or filled with redundancy and errors.

To escape these 'traps,' Apple's research team turned to a more 'pristine' experimental ground: a controlled puzzle-solving environment.

Ingenious Experimental Design: Quantifying AI's Thinking Ability Through 'Puzzles'

The research team selected four classic puzzles with stable logical structures but controllable complexity:

Tower of Hanoi: Tests recursion and planning abilities.

Checker Jumping: Tests sequential planning.

River Crossing Problem: Tests constraint satisfaction and planning abilities.

Blocks World: Tests planning and state management.

By changing the parameters of the puzzles (e.g., number of disks in Tower of Hanoi, number of people in River Crossing), researchers could precisely control the combinatorial complexity of the problems. Simultaneously, they used simulators to verify whether each step generated by the model was compliant, thereby deeply analyzing its complete 'thought trajectory.'

As illustrated: researchers not only evaluated the final answer but also extracted and analyzed the intermediate steps within <think> tags to gain insight into the model's thought process.

Image

Core Findings

Through extensive testing of a series of cutting-edge models (including Claude 3.7 Sonnet, DeepSeek-R1, and their corresponding non-thinking versions), the research yielded several groundbreaking findings:

Finding One: Three 'States' of Performance—'Thinking' Isn't Always Superior

When comparing 'thinking' models (LRMs) with their standard LLM versions, researchers identified three distinct performance ranges:

Low Complexity Zone: For simple problems, standard models were actually more accurate and efficient. The 'thinking' of LRMs appeared redundant and cumbersome.

Medium Complexity Zone: As problems became more difficult, the advantage of LRMs generating longer chains of thought began to emerge, with their performance surpassing standard models.

High Complexity Zone: When problem complexity exceeded a certain threshold, both types of models 'collapsed' entirely, with accuracy dropping to zero. This indicates that the 'thinking' mechanism merely delayed the onset of failure, without solving the fundamental capability bottleneck.

As shown in Figure 5: The performance comparison between thinking models and non-thinking models across low, medium, and high complexities clearly illustrates the performance in these three distinct zones.

Image

Finding Two: The 'Collapse Point' of 'Thinking' and Counter-intuitive 'Effort Degradation'

This is one of the most astonishing findings of the entire paper. The research shows:

Accuracy Plunge: All tested LRMs saw their accuracy sharply decline to zero once problem complexity reached a certain threshold.

Counter-intuitive 'Thinking Degradation': Even stranger, as they approached this 'collapse point,' the computational effort models used for thinking (i.e., the number of thinking tokens generated) did not increase with problem difficulty; instead, it began to significantly decrease.

As illustrated: As complexity increased (horizontal axis), model accuracy (top row) eventually dropped to zero. Concurrently, their thinking length (bottom row, measured in tokens), after reaching a peak, decreased rather than increased, as if they actively 'gave up thinking' when faced with difficult problems.

Image

This indicates that models appear to have an intrinsic 'scaling limit.' When they 'anticipate' a problem is too difficult to solve, even with ample computational budget (token limit), they opt to 'give up' and reduce their thinking effort.

Finding Three: Thought Trajectories Reveal Inefficiency and Struggle

By analyzing the models' generated 'thinking processes,' researchers uncovered deep-seated issues in their thought patterns:

Overthinking Simple Problems: When solving simple puzzles, models often found the correct answer in very early steps but then continued to generate large amounts of redundant or even erroneous explorations, wasting significant computational resources.

Early Fixation on Complex Problems: When faced with difficult problems, if a model made an early error, it often stubbornly continued exploring down the wrong path, finding it difficult to self-correct, ultimately leading to failure.

Finding Four: Perplexing Inability to 'Execute'

The researchers also conducted a crucial experiment, the results of which further exposed the fragility of the models' reasoning capabilities.

Can't Even 'Copy Assignments': In the Tower of Hanoi task, researchers directly provided a complete, error-free solution algorithm in the prompt, asking the model to simply 'execute' it. The results showed no improvement in the model's performance; it still collapsed at the same complexity point. This indicates that the model's bottleneck is not only in 'planning' and 'finding' solutions but also in the fundamental lack of ability to execute and verify basic, symbolic logical steps.

Extremely Imbalanced Capabilities: The Claude 3.7 Sonnet model could correctly solve Tower of Hanoi problems requiring hundreds of steps, but failed early on a River Crossing problem that only needed 11 steps. This strongly suggests that the model's 'reasoning ability' might heavily rely on common patterns in its training data (Tower of Hanoi is a textbook classic problem), rather than possessing a general, generalizable logical reasoning capability.

In Conclusion

Apple's research clearly throws cold water on previous assumptions.

The current 'thinking' mechanism of LRMs resembles a complex heuristic search or pattern matching, rather than human-like, generalizable logical reasoning. These models encounter a dual collapse in performance and 'thinking effort' when dealing with problems of high combinatorial complexity, which likely stems from fundamental limitations in their architecture.

Finally, Apple strongly recommends:

The current evaluation paradigm for large models urgently needs reform: we must move beyond evaluation methods that rely on potentially contaminated benchmark tests and final answer accuracy, shifting towards more controllable and in-depth process analysis to truly understand the boundaries of AI capabilities.

What are your thoughts on this research?

Reference:

https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf

Image

Please like 👇👇

Main Tag:Artificial Intelligence

Sub Tags:Large Language ModelsCognitive ComputingAI ResearchMachine Learning


Previous:OpenAI Upgrades Advanced Voice Features: More Lifelike and a Personal Translator

Next:Google Research Finds: Prompt Design is the Core of Multi-Agent Systems!

Share Short URL