“My colleagues are really saying these models are approaching mathematical genius”
By Lyndie Chiou Edited by Clara Moskowitz
In mid-May, a secret math meeting was convened over a weekend. 30 of the world’s most renowned mathematicians gathered in Berkeley, California, some traveling from as far as the U.K. The panel squared off against a “reasoning” chatbot tasked with solving problems they designed to test its mathematical prowess. After throwing two days’ worth of professor-level problems at the bot, the researchers were astonished to find it could answer some of the world’s hardest-to-solve questions. “My colleagues are really saying these models are approaching mathematical genius,” says Ken Ono, a mathematician at the University of Virginia and a leader and judge at the meeting.
The chatbot is powered by o4-mini, a so-called reasoning large language model (LLM). It was trained by OpenAI to be capable of highly complex reasoning. Google’s equivalent, Gemini 2.5 Flash, has similar capabilities. Like the LLMs that power earlier versions of ChatGPT, o4-mini learns to predict the next word in a sequence. However, compared with earlier LLMs, o4-mini and its equivalent models are lighter and more flexible, having been trained on specialized datasets and receiving stronger reinforcement from humans. This approach enables the chatbot to delve deeper into complex math problems than traditional LLMs.
To track o4-mini’s progress, OpenAI previously commissioned Epoch AI, a nonprofit that benchmarks LLMs, to design 300 math problems whose answers had not yet been released. Even traditional LLMs can correctly answer many complex math problems. However, when Epoch AI posed these questions, which were different from those they had been previously trained on, to several such models, the most successful ones could solve less than 2 percent of them, indicating a lack of reasoning ability in those LLMs. But o4-mini would eventually prove to be different.
Epoch AI hired Elliot Glazer, who recently earned his Ph.D. in mathematics, for a new benchmarking collaboration called FrontierMath, which will run in September 2024. The project gathered new problems of varying difficulty levels, with the first three tiers covering undergraduate, graduate, and research-level challenges. By February 2025, Glazer found that o4-mini could solve about 20 percent of the problems. Then he proceeded to the fourth tier: 100 problems that were challenging even for academic mathematicians. Only a fraction of the world’s population could design such problems, let alone answer them. Participating mathematicians had to sign nondisclosure agreements requiring them to communicate only through the messaging app Signal. Other means of contact, such as traditional e-mail, could be scanned by LLMs and inadvertently trained on, thus polluting the dataset.
The group’s progress in finding problems was slow but steady. But Glazer wanted to speed things up, so Epoch AI held an in-person meeting on Saturday, May 17, and Sunday, May 18, where participants would finalize the last batch of challenging problems. Ono split the 30 attendees into six-person teams. Over the course of two days, the academics competed against one another to design problems that they could solve but that would stump the AI reasoning bot. For every problem o4-mini could not solve, the mathematician who designed it would receive a $7,500 reward.
By that Saturday night, Ono was frustrated with the bot, whose surprising mathematical abilities were impeding the team’s progress. “I proposed a problem that an expert in our field would consider an open problem in number theory—a good Ph.D.-level problem,” he says. He set o4-mini to work on it. For the next 10 minutes, Ono watched in awe as the bot demonstrated its solution in real time, showing its reasoning process. For the first two minutes, the bot looked up and mastered the relevant literature in the field. Then it wrote on screen that it wanted to try to solve a simpler “toy” version of the problem first, in order to learn. A few minutes later it wrote that it was finally ready to tackle the harder problem. Five minutes later o4-mini presented a correct, but cheeky, solution. “It started to get very playful,” says Ono, who is also a freelance math consultant for Epoch AI. “At the end it also wrote, ‘No citation needed, for this mysterious number was computed by me!’”
After the defeat, Ono hopped on Signal early Sunday morning to inform the other contestants. “I didn’t expect to be competing with an LLM like this,” he says. “I’ve never seen this kind of reasoning in a model. This is what scientists are supposed to do. It’s terrifying.”
Although the team eventually succeeded in finding 10 problems that stumped the bot, the researchers were shocked by the AI’s progress in just one year. Ono likened it to working with a “powerful partner.” Yang Hui, a mathematician at the London Institute for Mathematical Sciences and an early pioneer in the application of AI in mathematics, says: “This is equivalent to what a very good graduate student would do—in fact, more.”
The bot was also much faster than expert mathematicians, completing in minutes work that could take human experts weeks or months.
The showdown with o4-mini was exciting, but its progress was also alarming. Ono and Hui expressed concern that o4-mini’s results could be overly trusted. “There are inductive proofs, there are contrapositive proofs, and there are intimidation proofs,” Hui says. “If you say something authoritatively enough, people get scared. I think o4-mini has mastered the intimidation proof; it says everything with such confidence.”
At the close of the meeting, the group began to contemplate the future of mathematicians. Discussions turned to the inevitable “fifth tier” of problems—those that even the best mathematicians cannot solve. If AI reaches that level, the role of mathematicians will change dramatically. For example, mathematicians might shift to simply posing problems and interacting with reasoning bots to help them discover new mathematical truths, much like a professor working with a graduate student. As such, Ono predicts that fostering creativity in higher education will be key to passing mathematics down to future generations.
“I’ve been telling my colleagues it’s a serious mistake to say AGI is never going to come, that it’s just a computer,” Ono says. “I don’t want to contribute to the panic, but in many ways, these large language models are already outperforming most of the world’s best graduate students.”
Inside the secret math meeting where researchers struggled to outsmart AI: https://www.scientificamerican.com/article/inside-the-secret-meeting-where-mathematicians-struggled-to-outsmart-ai/