Global Top 30 Mathematicians Secretly Convened to Combat AI, Were Blown Away on the Spot! Exclaiming It's Close to a Mathematical Genius

Image

XinZhiYuan Report

Editor: Aeneas Hao Kun

【XinZhiYuan Executive Summary】Recently, 30 of the world's top mathematicians personally took action, launching an "encirclement and suppression" against OpenAI's o4-mini at UC Berkeley. They posed professor-level problems for two consecutive days, only to collectively have a "meltdown" on the spot! Some stated directly: this AI has indeed approached the level of a mathematical genius. What was once thought to be a distant AGI now feels like it's just a stone's throw away...

How strong is AI at mathematics?

Just recently, 30 world-renowned mathematicians gathered at UC Berkeley, hoping to surpass AI in a secret mathematics conference.

After bombarding this AI with professor-level problems for two consecutive days, researchers were astonished to find that it could solve some of the world's most difficult solvable problems!

One mathematician was immediately convinced, stating that these models have approached the level of a mathematical genius.

Image Image

Top mathematicians were impressed

On a weekend in mid-May, a secret mathematics summit quietly convened.

Thirty of the world's top mathematicians faced off against a reasoning chatbot, which was tasked with solving problems specifically designed by the experts.

As seen at the beginning, the mathematicians were thoroughly convinced.

The robot participating in this challenge was OpenAI's o4-mini, which is capable of extremely complex reasoning.

Of course, it is not the only model in the world with this capability; Google's Gemini 2.5 Flash also possesses similar abilities.

Why is o4-mini so strong at solving math problems?

This is because it was trained on specialized datasets and benefited from stronger RLHF. This method allows it to delve deeper into complex mathematical problems than traditional LLMs.

Image

Epoch AI, delving deeper

After training o4-mini, OpenAI has been closely monitoring its problem-solving abilities.

To track o4-mini's progress, OpenAI previously commissioned the non-profit organization Epoch AI to design 300 math problems with unpublished solutions, specifically to test large models.

The highlight of these problems was that, since their solutions had not yet been published, they could not possibly exist in the training data.

Indeed, when Epoch AI used these problems, which were completely different from the training data, to test several reasoning models, almost all of them failed.

Even the best-performing model had a solution rate of less than 2%.

Are LLMs truly incapable of doing math? Epoch AI did not give up exploring.

In September 2024, Epoch AI hired Elliot Glazer, who had just earned his Ph.D. in mathematics, to participate in a new benchmark project codenamed FrontierMath.

Image

The purpose of this project was to collect new math problems of varying difficulty levels. T1-T3 covered challenges at undergraduate, graduate, and research levels, respectively.

The result was that o4-mini was surprisingly impressive.

By February 2025, Glazer discovered that o4-mini was able to solve approximately 20% of the problems!

Then, in May of this year, Epoch AI also held a competition, inviting about 40 math elites, divided into 8 teams, each consisting of subject experts and outstanding undergraduates.

They were to engage in the ultimate showdown with AI on the FrontierMath benchmark proposed by Terence Tao and others.

The competition consisted of 23 problems, with a time limit of 4.5 hours. The experiment ultimately showed:

o4-mini-medium crushed the human average level (19%), solving about 22% of the problems.

However, the problems that o4-mini could solve were all cracked by at least one team of mathematicians. Thus, human teams collectively solved about 35% of the problems.

Image

The results showed that o4-mini defeated six teams in total, demonstrating astonishing potential in the field of mathematics.

Image

T4 Level Test, begins

Subsequently, he began to work on the fourth level of testing – this time, to find 100 problems that would be extremely challenging even for professional mathematicians.

Globally, there are very few people who can propose such problems, let alone provide solutions.

To this end, he requested top mathematicians worldwide, requiring them to sign non-disclosure agreements and even communicate only through encrypted messaging apps like Signal.

He was concerned that if traditional communication methods like email were used, they might be scanned by LLMs, unintentionally becoming training data and thus polluting the entire test dataset.

Due to the exceptionally strict approach, the project's progress was initially very slow.

To accelerate progress, Glazer pushed Epoch AI to host this offline meeting on May 17th (Saturday) and 18th (Sunday).

At the meeting, mathematicians would finalize the last batch of highest-level math problems.

Image

Rack their brains, swear to stump AI

Ken Ono, a mathematician at the University of Virginia and the conference leader and judge, divided the 30 attendees into groups of six.

Image

During the two-day conference, these top scholars had to compete with each other to see who could design problems that they could solve but that would stump the AI reasoning robot.

The reward for this project was also very attractive.

For every problem o4-mini failed to solve, the problem setter would receive a $7500 reward.

The outcome, however, was something no one expected: o4-mini delivered a devastating blow to the mathematicians!

Late on Saturday night, all the mathematicians in the room felt utterly defeated – o4-mini's unexpected mathematical talent directly rendered the entire group's efforts futile.

Ono posed a problem, an open problem in number theory recognized by experts in his field. It was an excellent test, reaching the level of a Ph.D. student.

He confidently gave this problem to o4-mini, and in the next ten minutes, he was directly hit with a shock!

O4-mini, like flowing water, calculated the complete solution in real-time, simultaneously displaying its reasoning process.

It first spent two minutes retrieving and thoroughly understanding the relevant literature, then wrote on the screen that, for learning purposes, it wanted to first try a simplified "toy" version.

A few minutes later, it wrote that it was ready to tackle the harder original problem.

Another five minutes passed, and o4-mini provided a correct yet playfully smug answer.

Ono described: It started to become triumphant, even adding a line, "No citation needed, for this mysterious number was calculated by me!"

Image

Mathematicians disheartened: I thought AGI would never arrive

A disheartened Ono quickly logged onto Signal early Sunday morning to brief all attendees on the situation.

I had absolutely no idea what it would be like to contend with such an LLM, and I have never seen such powerful reasoning in a model before. This is clearly how scientists work. It's terrifying.

Ultimately, the team still managed to find 10 problems that stumped the robot, but the AI's astonishing capabilities still left all researchers in awe.

Ono felt that working with it was like collaborating with a "powerful partner." Yang Hui He, a mathematician at the London Institute for Mathematical Sciences and one of the pioneers in AI applications in mathematics, said: "This is something a top-notch graduate student can do – no, in fact, it does more."

Image

Moreover, o4-mini's speed was astonishing. It far surpassed professional mathematicians; what human experts would take weeks or even months to complete, it accomplished in mere minutes.

Not only that, but o4-mini's progress this time also sounded an alarm for humanity.

Both Ono and He are concerned that the results provided by o4-mini might be over-relied upon by people.

"Proof methods include induction, proof by contradiction, and now there's intimidation," said Yang Hui He.

"When someone speaks with enough authority, people feel awe. I think o4-mini has mastered the essence of intimidation proof, because it speaks every sentence with unquestionable confidence."

As the meeting neared its end, the entire team began to ponder the future of mathematicians.

The discussion shifted to the unavoidable T5 – problems that even the most top-tier mathematicians cannot solve.

If, ultimately, AI reaches that level, then clearly, the role of mathematicians will undergo a dramatic change.

At that point, mathematicians might shift to only posing problems and interacting with reasoning robots, guiding them to discover new mathematical truths, much like professors guide graduate students.

Therefore, Ono predicts that fostering creativity in higher education will be key to ensuring the continued legacy of mathematics as a discipline.

"I've always told my colleagues that the idea that AGI will never come, that it's just a computer, is completely wrong," Ono said.

"I don't want to sensationalize, but in many ways, these LLMs have surpassed the vast majority of our world's best graduate students.

Image

Terence Tao: Already knew it

In fact, Terence Tao has long been aware of AI's extraordinary ability in mathematical research.

Recently, he has been frequently sharing updates on social platforms, reporting on the astonishing progress of AI in solving math problems.

For instance, just a few days ago, he shared this news:

An 18-year-old unsolved mathematical problem has been broken through three times in just 30 days by AlphaEvolve collaborating with humans!

Image

On June 2nd, Fan Zheng's latest paper published on arXiv once again pushed the sum-difference set exponent θ record up by 0.000027, from 1.173050 to 1.173077.

0.000027 – a microscopic increment, yet it pushed the ceiling of additive combinatorics up by another inch.

Image

Paper link: https://arxiv.org/abs/2506.01896

Such rapid and continuous progress is inseparable from the cooperation between mathematicians and AI (AlphaEvolve).

This breakthrough astonished Terence Tao: "To me, this is a fascinating example."

Terence Tao believes this demonstrates how highly computer-assisted, moderately computer-assisted, and traditional "paper-and-pencil" methods will interact in future mathematical research.

Each of these paradigms has its advantages and disadvantages.

For example, current AlphaEvolve finds it extremely difficult to utilize the asymptotic constructions used in subsequent papers; on the other hand, without AlphaEvolve's brute-force search, human methods would find it very difficult to discover these points of improvement.

And last month, Terence Tao also collaborated with AI to tackle the classic "ε-δ" limit problem in analysis.

Image

GitHub Copilot performs quite well in helping beginners and handling basic tasks.

It can help users quickly get started with the Lean language, provide syntax hints, and intelligently complete basic definitions and declarations.

In simpler proofs, such as the sum theorem for function limits, Copilot can accurately predict the proof structure and key steps, acting like a capable assistant.

However, when proofs become complex, Copilot's shortcomings become apparent.

For example, when dealing with the difference and product theorems for function limits, it struggles with complex algebraic derivations and finding appropriate mathematical lemmas (such as those related to absolute values).

Copilot sometimes experiences "hallucinations," generating non-existent strategies or making low-level errors that cause the proof process to become messy.

At this point, Terence Tao had to personally intervene to correct errors, or even completely take over the proof.

Image

But in any case, the current development of LLMs brings us closer to Terence Tao's prediction:

By 2026, AI, combined with search and symbolic mathematics tools, will become a trusted co-author in mathematical research.

References:

https://www.scientificamerican.com/article/inside-the-secret-meeting-where-mathematicians-struggled-to-outsmart-ai/

Image

Image

Main Tag:Artificial Intelligence

Sub Tags:Mathematical ResearchLarge Language ModelsHuman-AI CollaborationAI Capabilities


Previous:Does Listening to Music While Working Make You More Productive? | Nature Careers

Next:OpenAI Upgrades Advanced Voice Features: More Lifelike and a Personal Translator

Share Short URL