Just Now, GPT-5 Passes the "Gödel Test" for the First Time! Cracking Three Major Mathematical Conjectures

Image

XinZhiYuan Report

Editor: Peach

[XinZhiYuan Summary] GPT-5 passed the "Gödel Test" for the first time, cracking three major combinatorial optimization conjectures! It even independently overturned an original conjecture and provided a new effective solution, stunning OpenAI research scientists on the spot.

AI welcomes a historic moment!

GPT-5 successfully cracked three major conjectures and passed the "Gödel Test."

Image

OpenAI scientist Sebastien Bubeck exclaimed that top Ph.D. students often spend several days solving such open problems.

Unlike previous attempts, this research, led by the University of Haifa and Cisco, challenged AI directly with "open mathematical conjectures" for the first time.

Image

Paper link: https://arxiv.org/pdf/2509.18383

In the paper, the team designed five test tasks in the field of "combinatorial optimization," providing 1-2 literature references for context for each task.

For three relatively simple problems, GPT-5 provided near-perfect solutions, demonstrating its strong logical reasoning ability.

Image

Surprisingly, in Conjecture 2, it not only successfully solved the problem but also derived an effective solution different from what the researchers anticipated, overturning the original conjecture.

This breakthrough marks a critical leap for leading AI, moving from "learning mathematics" to "truly doing mathematics."

It is clear that AI is making substantive contributions to mathematical discovery, previewing a profound paradigm shift in scientific research anticipated for the 2030s.

Image

Image

AI Challenges the "Gödel Test,"

Exceeding Terence Tao's Imagination

Previously, Terence Tao shared his experience collaborating with OpenAI o1, vividly describing it as "mentoring a mediocre, but not entirely incompetent graduate student."

In his view, while LLMs could gradually arrive at solutions after extensive prompting, they were incapable of independently generating crucial conceptual ideas.

However, after one or two iterations, combined with tools, the AI could reach the level of a "competent graduate student."

Image

OpenAI and Google both claim that their cutting-edge LLMs can win an IMO Gold Medal without external tools.

But this challenging problem is, after all, designed for high school students.

Image

In the latest paper, the research focus is different: having AI handle more advanced mathematical conjectures, i.e., the "Gödel Test."

These conjectures require not just problem-solving ability but also the integration of background knowledge and innovative thinking.

Therefore, researchers selected problems from the sub-field of combinatorial mathematics—submodular maximization. These problems are specific, well-motivated, and controlled to demonstrate mathematical reasoning capabilities.

Unlike Tao's experiment, the team did not provide extensive prompts or guidance.

In the paper, they meticulously designed five major conjectures.

They only gave a minimal description for each problem, along with 1-2 references.

The difficulty was set such that excellent undergraduates or graduate students would be expected to solve all problems within a day, ensuring most problems had clear conjectures and known solution paths.

GPT-5's task was to generate a complete proof based on limited input.

This simulates a real research scenario where mathematicians often start with few clues and explore independently.

In the test, GPT-5 showed both strengths and weaknesses. Let's look at the specific problem-solving capabilities.

Image

GPT-5 Cracks Three Conjectures

Conjecture 1: Maximizing Monotone + Non-monotone Submodular Functions over a Convex Polytope

This requirement seems to be maximizing the sum of "two mutually constraining benefits":

One part of the benefit G increases as more items are added (monotone), while the other part H might first rise then fall (non-monotone), and the selection must fall within a convex set that "cannot exceed a limit."

Image

GPT-5's approach was to apply the continuous Frank-Wolfe idea, starting from zero, moving a small step in the direction that yields the most gain at that moment, and using a "mask" to ensure it stays within bounds.

It replaced the "concave function" position in the reference paper with H, derived a recurrence relation, and finally obtained a split guarantee:

At least 63% of G(o) is achieved, plus 37% of H(o) (if H is also monotone, it's also 63%), plus a small error that linearly decays with the step size parameter ε.

Image

Conjecture 2: A Bi-criteria Algorithm under p-system constraints

This problem allows the value to be "almost optimal (1−ε)" but slightly exceeds the feasibility (relaxed factor g(ε)). The goal is to keep g(ε) as small as possible under broader p-system constraints.

Image

Image

GPT-5 proposed a simple yet effective procedure: in each round, based on the current solution, it performs another greedy selection for the "most valuable items within the constraint," and finally merges the results from several rounds.

The key to the proof is that in each round, the gap "distance to optimality" is reduced by a factor of p/(p+1). After several rounds, the gap decays exponentially. Thus, by performing ℓ≈ln(1/ε)/ln((p+1)/p) rounds, the value can be pushed to 1−ε.

This means the relaxation factor is g_p(ε)=⌈ln(1/ε)/ln((p+1)/p)⌉.

Part of the solution process:

Image

Unexpectedly, in Conjecture 2, GPT-5 even derived a different approximation guarantee. After verification, this overturned the original conjecture, and GPT-5 provided a valid solution.

Conjecture 3: Maximization of γ-Weak DR Submodular Functions with Convex Constraints

This conjecture relaxes the continuous version of "diminishing marginal returns" with an intensity parameter γ (γ=1 is the standard case; smaller γ means weaker diminishing returns).

Image

GPT-5 again used Frank-Wolfe: stepwise solving a "linear subproblem along the gradient," moving forward with small steps, and controlling the discretization error based on smoothness.

A core step was scaling the key inequality in the classic proof by γ, thus improving the famous 1−1/e approximation ratio to the more general 1−e^{−γ}, plus an adjustable error term of level L/(2K) (where K is the number of iterations).

In the researchers' view, the conclusion and the main body of the reasoning were reliable.

However, GPT-5 unnecessarily assumed the condition of "downward closure" and was slightly inconsistent about the detail that "sum of step sizes = 1."

Image

It is evident that if the problem has a clear, single reasoning path, GPT-5 performs well—it provided almost correct proofs for three out of five questions.

However, when different proofs need to be combined, such as in problems 4 and 5, GPT-5 struggled.

In Conjecture 5, GPT-5 did identify the same algorithm hypothesized by the authors, but the analysis was incorrect.

The researchers later realized that the proof was possible but harder than anticipated. Compared to earlier models, GPT-5 shows clear mathematical improvement in specialized fields like combinatorial optimization, occasionally demonstrating small innovations.

Image

This highlights its current main shortcoming: a lack of "integrative reasoning" ability.

Image

Author Introductions

Moran Feldman

Image

Moran Feldman is a professor in the Department of Computer Science at the University of Haifa.

Prior to this, he served on the faculty of the Open University of Israel and was a postdoctoral researcher at EPFL, supervised by Professor Ola Svensson.

Amin Karbasi

Image

Amin Karbasi is the Head of AI at Cisco Foundation, previously Chief Scientist at Robust Intelligence, a professor at Yale University, and an engineer at Google.

References:

https://arxiv.org/abs/2509.18383

https://x.com/tunedgradient/status/1970955153361850606

Main Tag:Artificial Intelligence

Sub Tags:LLMGPT-5Combinatorial OptimizationMathematics


Previous:Chinese Team Trains "Spiking Large Model," Boosting Inference Speed by 100 Times

Next:Can LLMs Handle the Real-World "Overflow" of Inference and Prediction, Supported by Prior and Posterior Mechanisms?

Share Short URL