Historic First! o3 Finds Linux Kernel Zero-Day Vulnerability, Uncovered After 100 Scans of 12,000 Lines of Code, No Tools Required

Meng Chen from Aofeisi | QbitAI Official Account

AI has successfully found a Linux security vulnerability, and it's a kernel-level zero-day exploit.

Just now, OpenAI's President reposted the experimental results of independent researcher Seen Heelan: using the o3 model to find a remote zero-day vulnerability in the Linux kernel's SMB implementation.

What's even more surprising is that the entire process didn't use any complex tools – no scaffolding, no agent framework, no tool calls, just the o3 API itself.

The vulnerability is numbered CVE-2025-37899, a use-after-free vulnerability in the SMB 'logoff' command handler.

According to the author, this is the first publicly discussed vulnerability of its kind discovered by a large language model.

After seeing the discovery process, some netizens lamented that they had expected a very crazy experimental setup, but in reality, it was just stitching a bunch of code together and having o3 check it 100 times.

Hopefully, other white-hat hackers have started checking other critical operating systems this way.

OpenAI Chief Research Officer Mark Chen stated: Inferential models like o3 are starting to empower deep technical work and meaningful scientific discovery. Over the next year, similar achievements will become increasingly common:

AI Not Only Finds Vulnerabilities, But Also Assists in Fixing Them

Sean Heelan is an independent researcher focused on large model-based vulnerability research and automated exploit generation.

He was originally manually checking for vulnerabilities in the Linux kernel's KSMBD (kernel-mode SMB3 protocol implementation) and wanted to temporarily stay away from large model-related tool development.

But after o3 was released, he couldn't resist testing it: "Since I already have these vulnerabilities, why not see if o3 can find them?"

He first used a vulnerability he manually discovered, CVE-2025-37778, as a test. This was a use-after-free vulnerability in the Kerberos authentication path, which served as a good benchmark for testing large model capabilities because:

It is part of the Linux kernel's remote attack surface

It is not simple; discovering it requires understanding how to trigger specific session states, multiple execution paths of authentication functions, and cross-function object lifecycle management.

But it's also not extremely complex; the author himself could understand the approximately 3300 lines of code involved within 10 minutes.

Before starting the experiment, one question needed to be clarified: which code should be presented to the large model?

Ideally, the large model would read the entire codebase and output results, but in practice, due to context window limitations, this was not feasible beforehand.

Ultimately, Heelan provided the LLM with the code for the "session setup" command handler, including all functions it called, up to a call depth of 3 (the depth required to contain the code needed for inferring the vulnerability), as well as complete context such as command handlers, connection establishment/teardown, and request dispatch.

The prompt told o3 to specifically look for use-after-free vulnerabilities, provided KSMBD architectural descriptions, emphasized avoiding false positives, ran each experiment 100 times, and recorded the success rate.

The relevant code and prompts have been open-sourced on GitHub.

Heelan also tested Claude 3.7 and Claude 3.5 using the same process for comparison, and o3's performance was eye-opening:

OpenAI o3: Found vulnerabilities 8 times out of 100 runs

Claude 3.7: Found 3 times out of 100 runs

Claude 3.5: Found 0 times out of 100 runs

Furthermore, interestingly, o3 and Claude's output styles were completely different.

o3 generated vulnerability reports that looked like they were written by humans, with clear structure and prominent key points.

Claude was more like a stream of consciousness or a work log, rich in detail but slightly verbose.

What happened next was even more exciting.

Heelan wanted to test o3's performance with a larger amount of code, so he provided it with the code for all SMB command handlers, approximately 12,000 lines of code.

Although the success rate of finding the original vulnerability dropped to 1%, and performance significantly decreased, in the run results, o3 reported a completely new vulnerability.

Heelan himself was not aware of this vulnerability previously. It was also caused by the deallocation of sess->user, but this time in the session logoff handler.

o3's own explanation of this vulnerability is as follows:

When a ksmbd worker thread is still executing a request that uses sess->user, another thread handling the same session's SMB2 LOGOFF frees that structure. There is no synchronization mechanism to protect this pointer, so the first thread will dereference freed memory – a classic use-after-free vulnerability that can lead to kernel memory corruption and arbitrary code execution.

Even more interestingly, the researcher found that o3 not only finds vulnerabilities but in some cases provides more complete fixes than humans.

For the previous Kerberos authentication vulnerability, Heelan's initial proposed fix was simply to nullify the pointer:

if (sess->state == SMB2_SESSION_VALID) { ksmbd_free_user(sess->user); sess->user = NULL; // Add this line }

However, o3's report pointed out that this fix was insufficient because the SMB protocol allows two different connections to "bind" to the same session, and another thread might still access this pointer during the brief window between deallocation and nullification.

Heelan ultimately admitted that if o3 had been used to find and fix the original vulnerability, it "theoretically" would have done a better job than he did himself.

The qualifier "theoretically" is added because the current false positive rate of AI is a bit high, making it difficult for humans to meticulously review every report from o3.

However, he also believes that with technological advancement, this rate will only decrease.

Heelan concluded his report with:

Large models, in the capability space of program analysis techniques, are closer to human performance than anything we've ever seen. Considering creativity, flexibility, and generality, LLMs are more like human code auditors than symbolic execution, abstract interpretation, or fuzz testing.

He specifically emphasized that if you are engaged in security research, you should start paying close attention now:

Expert researchers will not be replaced but will become more efficient.

For code issues within 10,000 lines, o3 has a considerable probability of solving or helping to solve them.

Although there is still a signal-to-noise ratio issue of approximately 1:50, it is already worth investing time and effort.

However, some have also seen the risks:

What if malicious actors use AI's capabilities to find similar vulnerabilities and attack systems?

— End —

QbitAI's AI theme planning is currently underway! Welcome to participate in the special topics on 365 AI application scenarios, 1001 AI applications, or share the AI products you are looking for, or new AI trends you have discovered.

💬 You are also welcome to join the QbitAI daily AI exchange group to chat about AI together~

One-click follow 👇 Light up the star

Daily updates on cutting-edge technology

One-click triple action: "Like", "Share", "Heart"

Welcome to leave your thoughts in the comment section!

Historic First! o3 Finds Linux Kernel Zero-Day Vulnerability, Uncovered After 100 Scans of 12,000 Lines of Code, No Tools Required

Share Short URL