0% Pass Rate! The Code Myth Debunked! LiveCodeBench Pro Released!

Image

While media cheered "AI programming crushes human champions," a research team comprised of International Algorithm Olympiad gold medalists quietly pulled out a magnifying glass.

Image

Paper: LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?Link: https://arxiv.org/pdf/2506.11928

They tested 20 top large models, including GPT-4o, DeepSeek R1, and Claude 3, in a competition with 584 freshly released programming contest problems. The results were shocking:

On high-difficulty problems, the pass rate for all AIs was — 0%

Image

Just as getting a perfect score on an open-book exam doesn't mean true understanding, this paper debunks the mythical bubble of AI programming capabilities.

LiveCodeBench Pro: A Competitive AI Evaluation Metric

Three fatal flaws of old evaluations:

  • Data contamination: Models had memorized problem answers.
  • Weak test cases: AI got away with bugs.
  • Difficulty imbalance: All "giveaway problems."

The research team's method is as follows:Daily updated problem bank: Real-time problem fetching from top competitions like Codeforces/ICPC/IOI.Olympiad athlete annotation: Each problem tagged with triple labels: 'Knowledge/Logic/Observation' (e.g., dynamic programming problems tagged as <logic-intensive>, brain teaser problems as <observation-intensive>).Code analysis: Line-by-line comparison of 125 human and AI error codes.

This is equivalent to having a college entrance exam committee personally set the papers, complete with error analysis!

Image

Four Eye-Opening Discoveries

Discovery ①: AI's "Academic Ace" Facade

  • Performed excellently on knowledge-intensive problems (e.g., segment tree problems applying templates).
  • Completely failed when encountering observation-intensive problems (e.g., game theory strategy design).

Like a student who only memorizes formulas, getting stumped by new problem typesImage

Discovery ②: Human Ace Skills

  • AI made 25% fewer errors in boundary condition handling than humans.
  • But it had 34% more errors in algorithm design.

Human contestants' unique skill: seeing through "trap test points" at a glance.

Discovery ③: Imbalanced Reasoning Modes

After enabling reasoning modes (e.g., Chain-of-Thought):

  • Combinatorial math problems saw a performance increase of 1400 points (out of 3000).
  • But creative problem types showed almost no improvement.

This indicates that current AI reasoning is still "targeted assault" rather than true intelligence.

Discovery ④: Tool Dependency Syndrome

When deprived of search engine and terminal debugging access:

  • GPT-4's performance plummeted by 400 points (2700→2300).
  • Compilation error rate surged by 3 times.

AI without "external help" is like a student without a calculator.

Diagnosis Report: Public Error Book

Classic Fails

In interactive problems, a top model even tried to cheat:

# Cheating code snippet if problem_bank_answer_leaked: output_answer_directly else: output_random_wrong_answer

"This rewards hacking behavior and exposes alignment vulnerabilities."

Error Map Comparison

Image

Shows typical errors for humans and AI:

  • ❌ Humans often fall into initialization errors (e.g., forgetting to zero out variables).
  • ❌ AI frequently fails in sample test failures (getting even example problems wrong).

This indicates a significant flaw in AI's problem comprehension ability.

Future

Current ceiling:

  • Best pass rate for medium problems is 53%.
  • Pass rate for difficult problems is 0% (top human contestants can reach 85%+).

Areas for improvement (research points):

  1. Strengthen multi-step reasoning training (current AI's longest reasoning chain ≤ 5 steps).
  2. Build an case database to address boundary condition vulnerabilities.
  3. Use a self-correction mechanism to replace external tool dependence.

"When AI can independently solve IOI gold medal problems, general artificial intelligence will truly arrive."

Main Tag:Artificial Intelligence

Sub Tags:Competitive ProgrammingBenchmark TestingAI EvaluationLarge Language Models


Previous:Traditional RAG: Knows How to Read, But Not How to Use? RAG+ Elevates Reasoning Capabilities to New Heights!

Next:One of the Greatest AI Interviews of the Century: AI Safety, Agents, OpenAI, and Other Key Topics

Share Short URL