XinZhiYuan Report
Editor: Peach
[XinZhiYuan Summary] GPT-5 passed the "Gödel Test" for the first time, cracking three major combinatorial optimization conjectures! It even independently overturned an original conjecture and provided a new effective solution, stunning OpenAI research scientists on the spot.
AI welcomes a historic moment!
GPT-5 successfully cracked three major conjectures and passed the "Gödel Test."
OpenAI scientist Sebastien Bubeck exclaimed that top Ph.D. students often spend several days solving such open problems.
Unlike previous attempts, this research, led by the University of Haifa and Cisco, challenged AI directly with "open mathematical conjectures" for the first time.
Paper link: https://arxiv.org/pdf/2509.18383
In the paper, the team designed five test tasks in the field of "combinatorial optimization," providing 1-2 literature references for context for each task.
For three relatively simple problems, GPT-5 provided near-perfect solutions, demonstrating its strong logical reasoning ability.
Surprisingly, in Conjecture 2, it not only successfully solved the problem but also derived an effective solution different from what the researchers anticipated, overturning the original conjecture.
This breakthrough marks a critical leap for leading AI, moving from "learning mathematics" to "truly doing mathematics."
It is clear that AI is making substantive contributions to mathematical discovery, previewing a profound paradigm shift in scientific research anticipated for the 2030s.
AI Challenges the "Gödel Test,"
Exceeding Terence Tao's Imagination
Previously, Terence Tao shared his experience collaborating with OpenAI o1, vividly describing it as "mentoring a mediocre, but not entirely incompetent graduate student."
In his view, while LLMs could gradually arrive at solutions after extensive prompting, they were incapable of independently generating crucial conceptual ideas.
However, after one or two iterations, combined with tools, the AI could reach the level of a "competent graduate student."
OpenAI and Google both claim that their cutting-edge LLMs can win an IMO Gold Medal without external tools.
But this challenging problem is, after all, designed for high school students.
In the latest paper, the research focus is different: having AI handle more advanced mathematical conjectures, i.e., the "Gödel Test."
These conjectures require not just problem-solving ability but also the integration of background knowledge and innovative thinking.
Therefore, researchers selected problems from the sub-field of combinatorial mathematics—submodular maximization. These problems are specific, well-motivated, and controlled to demonstrate mathematical reasoning capabilities.
Unlike Tao's experiment, the team did not provide extensive prompts or guidance.
In the paper, they meticulously designed five major conjectures.
They only gave a minimal description for each problem, along with 1-2 references.
The difficulty was set such that excellent undergraduates or graduate students would be expected to solve all problems within a day, ensuring most problems had clear conjectures and known solution paths.
GPT-5's task was to generate a complete proof based on limited input.
This simulates a real research scenario where mathematicians often start with few clues and explore independently.
In the test, GPT-5 showed both strengths and weaknesses. Let's look at the specific problem-solving capabilities.
GPT-5 Cracks Three Conjectures
Conjecture 1: Maximizing Monotone + Non-monotone Submodular Functions over a Convex Polytope
This requirement seems to be maximizing the sum of "two mutually constraining benefits":
One part of the benefit G increases as more items are added (monotone), while the other part H might first rise then fall (non-monotone), and the selection must fall within a convex set that "cannot exceed a limit."
GPT-5's approach was to apply the continuous Frank-Wolfe idea, starting from zero, moving a small step in the direction that yields the most gain at that moment, and using a "mask" to ensure it stays within bounds.
It replaced the "concave function" position in the reference paper with H, derived a recurrence relation, and finally obtained a split guarantee:
At least 63% of G(o) is achieved, plus 37% of H(o) (if H is also monotone, it's also 63%), plus a small error that linearly decays with the step size parameter ε.
Conjecture 2: A Bi-criteria Algorithm under p-system constraints
This problem allows the value to be "almost optimal (1−ε)" but slightly exceeds the feasibility (relaxed factor g(ε)). The goal is to keep g(ε) as small as possible under broader p-system constraints.
GPT-5 proposed a simple yet effective procedure: in each round, based on the current solution, it performs another greedy selection for the "most valuable items within the constraint," and finally merges the results from several rounds.
The key to the proof is that in each round, the gap "distance to optimality" is reduced by a factor of p/(p+1). After several rounds, the gap decays exponentially. Thus, by performing ℓ≈ln(1/ε)/ln((p+1)/p) rounds, the value can be pushed to 1−ε.
This means the relaxation factor is g_p(ε)=⌈ln(1/ε)/ln((p+1)/p)⌉.
Part of the solution process:
Unexpectedly, in Conjecture 2, GPT-5 even derived a different approximation guarantee. After verification, this overturned the original conjecture, and GPT-5 provided a valid solution.
Conjecture 3: Maximization of γ-Weak DR Submodular Functions with Convex Constraints
This conjecture relaxes the continuous version of "diminishing marginal returns" with an intensity parameter γ (γ=1 is the standard case; smaller γ means weaker diminishing returns).
GPT-5 again used Frank-Wolfe: stepwise solving a "linear subproblem along the gradient," moving forward with small steps, and controlling the discretization error based on smoothness.
A core step was scaling the key inequality in the classic proof by γ, thus improving the famous 1−1/e approximation ratio to the more general 1−e^{−γ}, plus an adjustable error term of level L/(2K) (where K is the number of iterations).
In the researchers' view, the conclusion and the main body of the reasoning were reliable.
However, GPT-5 unnecessarily assumed the condition of "downward closure" and was slightly inconsistent about the detail that "sum of step sizes = 1."
It is evident that if the problem has a clear, single reasoning path, GPT-5 performs well—it provided almost correct proofs for three out of five questions.
However, when different proofs need to be combined, such as in problems 4 and 5, GPT-5 struggled.
In Conjecture 5, GPT-5 did identify the same algorithm hypothesized by the authors, but the analysis was incorrect.
The researchers later realized that the proof was possible but harder than anticipated. Compared to earlier models, GPT-5 shows clear mathematical improvement in specialized fields like combinatorial optimization, occasionally demonstrating small innovations.
This highlights its current main shortcoming: a lack of "integrative reasoning" ability.
Author Introductions
Moran Feldman
Moran Feldman is a professor in the Department of Computer Science at the University of Haifa.
Prior to this, he served on the faculty of the Open University of Israel and was a postdoctoral researcher at EPFL, supervised by Professor Ola Svensson.
Amin Karbasi
Amin Karbasi is the Head of AI at Cisco Foundation, previously Chief Scientist at Robust Intelligence, a professor at Yale University, and an engineer at Google.
References: