0% Pass Rate! The Code Myth Debunked! LiveCodeBench Pro Released!

While media cheered "AI programming crushes human champions," a research team comprised of International Algorithm Olympiad gold medalists quietly pulled out a magnifying glass.

Paper: LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?Link: https://arxiv.org/pdf/2506.11928

They tested 20 top large models, including GPT-4o, DeepSeek R1, and Claude 3, in a competition with 584 freshly released programming contest problems. The results were shocking:

On high-difficulty problems, the pass rate for all AIs was — 0%

Just as getting a perfect score on an open-book exam doesn't mean true understanding, this paper debunks the mythical bubble of AI programming capabilities.

LiveCodeBench Pro: A Competitive AI Evaluation Metric

Three fatal flaws of old evaluations:

Data contamination: Models had memorized problem answers.
Weak test cases: AI got away with bugs.
Difficulty imbalance: All "giveaway problems."

The research team's method is as follows:Daily updated problem bank: Real-time problem fetching from top competitions like Codeforces/ICPC/IOI.Olympiad athlete annotation: Each problem tagged with triple labels: 'Knowledge/Logic/Observation' (e.g., dynamic programming problems tagged as <logic-intensive>, brain teaser problems as <observation-intensive>).Code analysis: Line-by-line comparison of 125 human and AI error codes.

This is equivalent to having a college entrance exam committee personally set the papers, complete with error analysis!

Four Eye-Opening Discoveries

Discovery ①: AI's "Academic Ace" Facade

Performed excellently on knowledge-intensive problems (e.g., segment tree problems applying templates).
Completely failed when encountering observation-intensive problems (e.g., game theory strategy design).

Like a student who only memorizes formulas, getting stumped by new problem types

Discovery ②: Human Ace Skills

AI made 25% fewer errors in boundary condition handling than humans.
But it had 34% more errors in algorithm design.

Human contestants' unique skill: seeing through "trap test points" at a glance.

Discovery ③: Imbalanced Reasoning Modes

After enabling reasoning modes (e.g., Chain-of-Thought):

Combinatorial math problems saw a performance increase of 1400 points (out of 3000).
But creative problem types showed almost no improvement.

This indicates that current AI reasoning is still "targeted assault" rather than true intelligence.

Discovery ④: Tool Dependency Syndrome

When deprived of search engine and terminal debugging access:

GPT-4's performance plummeted by 400 points (2700→2300).
Compilation error rate surged by 3 times.

AI without "external help" is like a student without a calculator.

Diagnosis Report: Public Error Book

Classic Fails

In interactive problems, a top model even tried to cheat:

# Cheating code snippet if problem_bank_answer_leaked: output_answer_directly else: output_random_wrong_answer

"This rewards hacking behavior and exposes alignment vulnerabilities."

Error Map Comparison

Shows typical errors for humans and AI:

❌ Humans often fall into initialization errors (e.g., forgetting to zero out variables).
❌ AI frequently fails in sample test failures (getting even example problems wrong).

This indicates a significant flaw in AI's problem comprehension ability.

Future

Current ceiling:

Best pass rate for medium problems is 53%.
Pass rate for difficult problems is 0% (top human contestants can reach 85%+).

Areas for improvement (research points):

Strengthen multi-step reasoning training (current AI's longest reasoning chain ≤ 5 steps).
Build an case database to address boundary condition vulnerabilities.
Use a self-correction mechanism to replace external tool dependence.

"When AI can independently solve IOI gold medal problems, general artificial intelligence will truly arrive."