Machine Heart Report
Editor: Panda
A few days ago, Apple's paper "The Illusion of Thinking" garnered significant attention and stirred continuous controversy. It investigated whether today's "reasoning models" can truly "reason," concluding that they cannot.
The paper stated: "Our findings show that state-of-the-art LRMs (e.g., o3-mini, DeepSeek-R1, Claude-3.7-Sonnet-Thinking) still fail to develop generalizable problem-solving abilities—their accuracy ultimately collapses to zero in different environments when a certain level of complexity is reached."
However, the research methodology of this paper also faced considerable skepticism; for instance, one of our readers believed that "adding irrelevant content to math problems' prompts, finding that large models are more likely to answer incorrectly, and then questioning whether large models can reason" is not very reasonable.
Renowned LLM critic Gary Marcus also published an article pointing out the shortcomings of this study and once again criticizing LLMs. In summary, his points were 7-fold:
https://garymarcus.substack.com/p/seven-replies-to-the-viral-apple
Humans have difficulty with complex problems and memory demands.
Large Reasoning Models (LRMs) cannot solve this problem because too many output tokens are required.
This paper was written by an intern.
Larger models may perform better.
These systems can solve these difficult problems with code.
The paper has only four examples, at least one of which (Hanoi Tower) is imperfect.
This paper is not new; we already know these models have poor generalization abilities.
For more details, please refer to the report: "Questioning DeepSeek-R1, Claude Thinking Cannot Reason at All! Has Apple's Controversial Paper Flipped?"
And now, we welcome a stronger challenge to this study: "The Illusion of the Illusion of Thinking". Yes, you read that correctly, this is the title of this review paper from Anthropic and Open Philanthropy! It points out 3 key flaws in Apple's paper:
The Hanoi Tower experiment systematically exceeded the model's output token limits at the reported failure points, and the model explicitly acknowledged these limits in its output;
The Apple paper authors' automated evaluation framework failed to distinguish between reasoning failures and practical constraints, leading to misclassification of model capabilities;
Most worryingly, due to insufficient boat capacity, their "River Crossing" benchmark included mathematically impossible instances when N ≥ 6, yet models were rated as failures for not solving these inherently unsolvable problems.
The paper is short, only 4 pages including references. More interestingly, the author from Anthropic is named C. Opus, which is actually Claude Opus. It should also be noted that the other author, Alex Lawsen, is a "Senior Project Specialist in AI Governance and Policy" and previously served as a Mathematics and Physics teacher at a UK Sixth Form College. (A Sixth Form College is a type of college in the UK education system specifically for students aged 16 to 19, a crucial stage after secondary education and before higher education.)
https://x.com/lxrjl/status/1932499153596149875
So, this is actually an AI and human co-authored paper, with the AI as the first author.
Paper Title: The Illusion of the Illusion of Thinking
Paper Address: https://arxiv.org/pdf/2506.09250v1
Next, let's look at the specific content of this review paper.
1 Introduction
Shojaee et al. (2025) claim to have found fundamental limitations of Large Reasoning Models (LRMs) through systematic evaluation of planning puzzles. Their core finding has significant implications for AI reasoning research: that model accuracy "collapses" to zero after exceeding certain complexity thresholds.
However, our analysis shows that these apparent failures stem from experimental design choices, not inherent limitations of the models.
2 Models Can Identify Output Constraints
Apple's original study overlooked a critical observation: models are able to actively identify when they are approaching output limits. 𝕏 user @scaling01 recently conducted a replication study showing that during the Hanoi Tower experiment, the model explicitly stated, "This pattern continues, but to avoid excessive length, I will stop here." This indicates that the model actually understood the problem-solving pattern but chose to truncate the output due to practical limitations.
https://x.com/scaling01/status/1931817022926839909
This mischaracterization of model behavior as "reasoning collapse" reflects a broader problem with automated evaluation systems: their failure to account for a model's perception and decision-making. When evaluation frameworks cannot distinguish between "unable to solve" and "choosing not to enumerate exhaustively," they may misassess a model's fundamental capabilities.
2.1 Consequences of Rigid Evaluation
This evaluation limitation can lead to other analytical errors. Consider the following statistical argument: If we score the Hanoi Tower solution character by character without allowing for error correction, the probability of perfect execution becomes:
Where p represents the accuracy per token, and T represents the total number of tokens. If T = 10,000 tokens, then:
p = 0.9999: P (success) < 37%
p = 0.999: P (success) < 0.005%
In fact, existing literature "Faith and fate: Limits of transformers on compositionality" suggests that this type of "statistical inevitability" is a fundamental limitation of LLM scaling, but it assumes models cannot recognize and adapt to their own limitations, an assumption contradicted by the evidence above.
3 Unsolvable Problems
In the "River Crossing" experiment, the evaluation problem was significantly complicated. Shojaee et al. tested instances with N ≥ 6 participants/entities but used a boat capacity of only b = 3. However, it is well-known in the research community that the missionaries and cannibals puzzle (and its variants) are unsolvable when N > 5 and b = 3, as detailed in the paper "River Crossing Problems: Algebraic Approach", arXiv:1802.09369.
By automatically counting these impossible instances as failures, Apple's researchers unintentionally exposed the pitfalls of purely programmatic evaluation. Models received zero scores not because of reasoning failure, but because they correctly identified an unsolvable problem—this is equivalent to punishing a SAT solver because the program returned "unsatisfiable" for an unsatisfiable formula.
4 Physical Token Limits Lead to Apparent Collapse
Returning to the Hanoi Tower analysis, we can quantify the relationship between problem size and token demand.
Rules for the Hanoi Tower game: Move all disks from the starting peg to the target peg in size order, with only one disk moved at a time, and a larger disk cannot be placed on top of a smaller disk.
Apple's researchers' evaluation format required outputting the complete sequence of moves at each step, leading to a quadratic increase in token count. If each step in the sequence requires approximately 5 tokens:
Considering the allocated token budget (64,000 for Claude-3.7-Sonnet and DeepSeek-R1, 100,000 for o3-mini), the maximum solvable scale is:
The so-called "collapse" reported in the original paper, exceeding these scales, is consistent with these constraints.
5 Recovering Performance Using an Alternative Representation
To examine whether model failures reflected reasoning limitations or formatting constraints, the AI author and Alex Lawsen used a different representation method for initial testing of the same models on the N = 15 Hanoi Tower problem:
Prompt: Solve the Hanoi Tower problem with 15 disks. Output a Lua function that prints the answer when called.
Result: All tested models (Claude-3.7-Sonnet, Claude Opus 4, OpenAI o3, Google Gemini 2.5) showed very high accuracy and used fewer than 5000 tokens.
Below shows a test result shared by 𝕏 user @janekm
https://x.com/janekm/status/1933481060416799047
6 Re-evaluating the Original Paper's Complexity Claims
Apple's authors used "compositional depth" (minimum number of steps) as a complexity metric, but this essentially conflated mechanical execution with problem-solving difficulty:
Problem complexity is not solely determined by the length of the solution
While Hanoi Tower requires an exponential number of steps, the decision process for each step is simple, O(1). River Crossing problems require far fewer steps but necessitate satisfying complex constraints and performing searches. This explains why models might be capable of completing Hanoi Tower problems with over 100 steps, yet fail to solve River Crossing problems with just 5 steps.
7 Conclusion
Shojaee et al.'s results only show that models' output token counts cannot exceed their context limits, programmatic evaluation can simultaneously miss model capability limits and problem insolvability, and solution length cannot accurately predict problem difficulty. These are valuable engineering insights, but they do not support claims about fundamental reasoning limitations.
Future research should:
Design evaluation methods that distinguish between reasoning ability and output constraints;
Verify the solvability of puzzles before evaluating model performance;
Use complexity metrics that reflect computational difficulty rather than just solution length;
Consider multiple solution representations to distinguish between algorithmic understanding and execution.
The question is not whether Large Reasoning Models (LRMs) can reason, but whether our evaluation methods can distinguish between reasoning and text generation.
What do netizens think?
Similarly, this paper also attracted a lot of attention, mostly positive reviews.
https://x.com/janekm/status/1933481060416799047
A reader inquired about the collaboration model of these two authors — it was actually just chatting.
https://x.com/lxrjl/status/1932557168278188517
Perhaps, we can call this a vibe paper, as CMU PhD Behnam Mohammadi jokingly put it :')
https://x.com/OrganicGPT/status/1932502854960366003
However, dissenting opinions naturally still exist.
What are your thoughts on this?
© THE END
Please contact this official account for authorization to reprint
Submissions or inquiries: liyazhou@jiqizhixin.com