Yu Yang from Aofisi Quantum Bit | Official Account QbitAI
How long does it take for a large model to go from answering only 2% of questions correctly to scoring 22% in a collection of extremely difficult math problems, even surpassing the average human team's level?
Now, the results that have surprised even mathematicians are settled:
7 months.
This scene, taking place on the renowned "designed to stump large models" FrontierMath benchmark, has sparked heated discussion and new questions:
How did large models achieve this?
FrontierMath: Contains 300 math problems, with difficulty ranging from advanced undergraduate to levels even Fields Medalists find challenging.
The latest development is that Epoch AI, the official FrontierMath organizer, invited 14 mathematicians to deeply analyze 29 raw reasoning records generated by o3-mini-high when tackling these math problems.
They found:
o3-mini-high by no means solves problems by rote memorization; on the contrary, it demonstrated extremely strong knowledge retention;
o3-mini-high's reasoning relies more on intuition than on precise proof.
At the same time, they also identified the current limitations of large models, such as a lack of creativity and depth of understanding.
The official summary is as follows:
o3-mini-high can be summarized as: a knowledgeable but intuition-based reasoning engine that lacks the creativity and formality of professional mathematicians, and often rambles and is verbose.
Intuition-Based Inductive Reasoning Engine
Specifically, in the 29 reasoning records, o3-mini-high reached correct conclusions 13 times, while the remaining 16 led to failed results.
First, let's look at how o3-mini-high succeeded.
Mathematicians found that a key factor is o3-mini-high's extreme erudition.
It correctly expanded on the mathematical background of the problems, involving very advanced concepts.
General knowledge related to the problems, and an understanding of the problems themselves, did not constitute a bottleneck for o3-mini-high in problem-solving.
This is not to say that o3-mini-high relies on rote memorization.
On the contrary, mathematicians found that even when problems deliberately concealed the techniques required for solutions, o3-mini-high was still able to effectively utilize correct theorems to make progress—
In approximately two-thirds of the problems, o3-mini-high scored at least 3 points (out of 5) in terms of relevant mathematical literature retrieval.
Another interesting finding is that, compared to precise derivations, o3-mini-high appears to rely more on intuition, exhibiting a "curiosity like that of a mathematician."
One mathematician pointed out:
The model's thought process appears somewhat informal. Initial ideas are often roughly stated, the language is not rigorous enough, and there are some corner cases that do not conform to mathematical paper requirements.
That is to say, o3-mini-high often does not formalize and rigorously argue mathematical problems like mathematicians do, but rather skips a long series of steps and directly guesses the final answer.
For example, in one problem, mathematicians found that o3-mini-high arrived at a correct conjecture through informal reasoning, but it did not prove this conjecture and directly used it to solve the problem.
Although the final answer was correct, from the mathematicians' perspective, this was "cheating."
Why is this so? The official view is that the reason is not simply "model laziness": a mathematician pointed out that when necessary, the model is not afraid of calculation and writing code, although it is still "intuition-based" overall.
One possibility is that during the pre-training phase, the model was not fed enough training data regarding "formal reasoning."
Model Limitations
Giving the answer directly after writing the solution reminds one of that man—
Ahem, however, in fact, a lack of formal precision is precisely why o3-mini-high fails to solve problems in many cases.
For example, sometimes o3-mini-high's general approach is correct, but it fails to reason because it cannot establish the final critical connection.
In a partition theory problem, it was only one step away from the answer. The problem setter pointed out:
If it could sum the outputs from n=0 to [edited], the answer would be correct.
In more cases, o3-mini-high's ideas were far from the correct solution.
More importantly, mathematicians believe that o3-mini-high's biggest limitations are a lack of creativity and depth of understanding:
The model is like a well-read graduate student, able to list many research achievements and researchers. This is impressive at first glance, but experts quickly discover that this graduate student has not deeply digested these materials, but merely recited them.
The model's behavior pattern is similar to: being good at identifying relevant materials, but unable to expand or apply this knowledge in novel ways.
Another mathematician involved in the research pointed out:
o3-mini-high only tried to apply a few of its favorite ideas.
Once these ideas were exhausted, it made no real progress.
Even:
For AI, solving an 8th-grade math olympiad problem (requiring new ideas) might be more difficult than calculating the number of points on a hyperelliptic curve over a large finite field.
In addition, "hallucination" is also an issue.
Analysis results show that approximately 75% of reasoning records contain model "hallucinations":
o3-mini-high often misremembers mathematical terms and formulas, and when calling libraries and using online search tools, it also exhibits fabricated phenomena.
So, can o3-mini-high really reason like a human mathematician?
Let's look at the mathematicians' ratings:
1 point indicates completely unlike a human, 5 points indicates indistinguishable from a human mathematician.
In general, it still needs to be analyzed on a case-by-case basis. The official view is that o3-mini-high has diverse capabilities. On the one hand, it seems capable of reasoning about problems like a human, showing curiosity, and exploring different approaches to problem-solving.
On the other hand, it also shows a lack of creativity and formality, tends to "overthink," appears verbose, and occasionally exhibits self-doubt—constantly repeating sentences already completed, repeatedly performing some mathematical operations...
"Surpassing most mathematics graduate students in the world"
Why models like o3-mini-high cannot utilize rich mathematical knowledge more effectively remains a subject for further research.
But in any case, 7 months, from 2% to 22%, is already enough to amaze mathematicians.
In fact, from the launch of the FrontierMath project in September 2024 to May 2025, the official organization brought together 8 human "math teams" and large models to compete, and FrontierMath itself has continued to evolve in difficulty.
From Levels 1-3—covering undergraduate, graduate, and research-level challenges, it has now entered Level 4: incorporating problems that are challenging even for mathematicians.
In mid-May, Epoch AI also held an offline meeting, inviting 30 renowned mathematicians to design problems they could solve but that would stump AI.
And the performance of the large models left mathematicians dumbfounded.
For example, Ken Ono, a mathematician from the University of Virginia, posed a "Ph.D. level" number theory problem. In just 10 minutes, o4-mini provided a correct and interesting solution.
Ken Ono stated:
I don't want to fuel panic. But in some respects, large language models are already outperforming most of the world's best graduate students in mathematics.
Mathematicians are beginning to wonder if AI can tackle "Level 5" problems, i.e., problems that even the best mathematicians have not yet solved—
"If artificial intelligence reaches this level, the role of mathematicians will undergo a huge change."
References: [1]https://epoch.ai/gradient-updates/beyond-benchmark-scores-analysing-o3-mini-math-reasoning[2]https://epoch.ai/gradient-updates/is-ai-already-superhuman-on-frontiermath[3]https://www.scientificamerican.com/article/inside-the-secret-meeting-where-mathematicians-struggled-to-outsmart-ai/
— End —
📪 Quantum Bit AI topic planning is underway! We welcome your participation in the special topic: 365 AI application solutions, one thousand and one AI applications, or share with us the AI products you are looking for, or new AI trends you have discovered.
💬 You are also welcome to join the Quantum Bit daily AI exchange group to chat about AI together~
One-click follow 👇 Light up the star
Daily updates on cutting-edge technology
One-click triple action: "Like," "Share," "Little Heart"
Feel free to leave your thoughts in the comments section!