Apple's AI research team has truly messed up this time!
A recent paper they published has drawn widespread criticism from the AI community, surprisingly due to major flaws in its testing methodology.
See previous article: Apple Declares Reasoning Models Dead! Google CEO: Forget AGI, Master AJI First
After reproducing Apple's Tower of Hanoi test from their paper, researcher Lisan al Gaib discovered a startling fact: the models didn't fail due to poor reasoning ability, but because of output token limits!
It's important to note that the Tower of Hanoi problem requires at least 2^N - 1 steps to solve, and the output format demands 10 tokens per step plus some fixed content.
What does this mean?
For Sonnet 3.7 (128k output limit), DeepSeek R1 (64K), and o3-mini (100k), when the number of discs exceeds 13, the accuracy of all models drops to 0%—not because they can't solve it, but because they physically cannot output that much content!
Even more ironically, as the problem scale increases, the models' responses become very human-like. They directly state: "Due to too many moves, I will explain the solution rather than listing all 32,767 steps."
This is as absurd as asking a mathematician to write a million numbers on an A4 sheet and then claiming they're bad at math!
Lisan al Gaib also tried breaking the problem down into smaller chunks, having the model execute only 5 steps at a time.
The result?
After testing with Gemini 2.0 Flash, it was found that decomposition actually worsened performance.
The model would get lost in the algorithm during processing, repeatedly executing certain steps.
Although the Tower of Hanoi is theoretically stateless (the optimal move at each step only depends on the current state), the model needs historical records to know where it is in the execution.
The study also revealed an interesting phenomenon: token usage peaks when there are 9-11 discs.
Why?
Because this is precisely the tipping point where models start saying "I'm not going to write down 2^n_disks - 1 steps".
Before this, the models weren't performing step-by-step reasoning either.
For smaller problems with 5-6 discs, some reasoning process could still be observed. But beyond that scale, they essentially: restate the problem → restate the algorithm → print steps. By 10-11 discs, they began refusing to output all steps.
The most outrageous part is the conclusion of Apple's paper.
They claimed that the Tower of Hanoi was harder than other tests due to training data issues. But Lisan al Gaib pointed out:
This is complete nonsense!
Models clearly recited the algorithm in their chain of thought, some even displaying it in code. Tower of Hanoi requires exponential steps (2^n), while other games only require quadratic or linear steps, which doesn't mean Tower of Hanoi is inherently more difficult in terms of reasoning.
The single-step difficulty of different games varies; difficulty cannot be simply judged by the number of steps!
Other researchers also joined the chorus of criticism.
Shin Megami Boson bluntly stated that this paper "sucks ass," achieving 100% accuracy using a weaker model on a complexity Apple rated at 0% accuracy—
And they used an even weaker model!
His experimental result graph "looks like nothing" because it's simply a straight line of 100% accuracy.
He concluded: "They tried to screw in a bolt with a hammer, and then wrote a paper saying hammers are actually quite limited at fastening things."
What angers and disappoints me the most is that Apple seems to be trying to prove AI has problems, rather than using AI to improve user experience.
Pliny the Liberator (@elder_plinius)'s criticism was spot on:
Until Siri can do more than successfully create a calendar event on the fourth try, I will not read any AI research papers coming out of that gigantic stale donut in Cupertino.
He continued:
If I were the CEO of Apple, and I saw my team publish a paper that only focuses on documenting the limitations of current methods, I would fire everyone involved on the spot. Who the hell cares about that. Go figure out how to break them!
Luci Dreams (@Luci_Drea) quipped:
"We don't have good AI, so look at your AI's flaws, don't have too much fun."
Chris Fry (@Chrispyfryz) questioned:
Seriously, what exactly are they doing over there?
R (@rvm0n_) stated:
I cannot understand how they messed up so badly.
Freedom_Aint_Free (@baianoise)'s analogy was even more precise:
This is like Kia engineers writing a paper saying Toyota cars can't run 2 million miles without major overhauls.
Ben Childs (@Ben_Childs) humorously remarked:
Look, Apple does have AI, and it's great. They just go to another high school. You wouldn't know her.
SPUDNIK (@tuber_terminal) mimicked Siri's speech recognition errors:
"Okay, so you want me to create an apple soft gun on spoon day at six ham? Should I create it?"
Apple is being 'Cooked' by Tim Cook himself—these researchers are spending time proving AI has problems instead of improving the user experience.
Do you think Cook should fire these people?
👇
👇
👇
Additionally, I used AI to collect AI news from across the internet, then used AI to select, review, translate, and summarize it before publishing it in the 'AGI Hunt' knowledge planet.
This is an AI news feed that is purely informational and unemotional (not a recommendation feed, no courses sold, no lecturing, no life advice, just information).
Welcome to join! You are also welcome to join the group chat with over 2000 members.