While media cheered "AI programming crushes human champions," a research team comprised of International Algorithm Olympiad gold medalists quietly pulled out a magnifying glass.
Paper: LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?Link: https://arxiv.org/pdf/2506.11928
They tested 20 top large models, including GPT-4o, DeepSeek R1, and Claude 3, in a competition with 584 freshly released programming contest problems. The results were shocking:
On high-difficulty problems, the pass rate for all AIs was — 0%
Just as getting a perfect score on an open-book exam doesn't mean true understanding, this paper debunks the mythical bubble of AI programming capabilities.
LiveCodeBench Pro: A Competitive AI Evaluation Metric
Three fatal flaws of old evaluations:
- Data contamination: Models had memorized problem answers.
- Weak test cases: AI got away with bugs.
- Difficulty imbalance: All "giveaway problems."
The research team's method is as follows:Daily updated problem bank: Real-time problem fetching from top competitions like Codeforces/ICPC/IOI.Olympiad athlete annotation: Each problem tagged with triple labels: 'Knowledge/Logic/Observation' (e.g., dynamic programming problems tagged as <logic-intensive>, brain teaser problems as <observation-intensive>).Code analysis: Line-by-line comparison of 125 human and AI error codes.
This is equivalent to having a college entrance exam committee personally set the papers, complete with error analysis!
Four Eye-Opening Discoveries
Discovery ①: AI's "Academic Ace" Facade
- Performed excellently on knowledge-intensive problems (e.g., segment tree problems applying templates).
- Completely failed when encountering observation-intensive problems (e.g., game theory strategy design).
Like a student who only memorizes formulas, getting stumped by new problem types
Discovery ②: Human Ace Skills
- AI made 25% fewer errors in boundary condition handling than humans.
- But it had 34% more errors in algorithm design.
Human contestants' unique skill: seeing through "trap test points" at a glance.
Discovery ③: Imbalanced Reasoning Modes
After enabling reasoning modes (e.g., Chain-of-Thought):
- Combinatorial math problems saw a performance increase of 1400 points (out of 3000).
- But creative problem types showed almost no improvement.
This indicates that current AI reasoning is still "targeted assault" rather than true intelligence.
Discovery ④: Tool Dependency Syndrome
When deprived of search engine and terminal debugging access:
- GPT-4's performance plummeted by 400 points (2700→2300).
- Compilation error rate surged by 3 times.
AI without "external help" is like a student without a calculator.
Diagnosis Report: Public Error Book
Classic Fails
In interactive problems, a top model even tried to cheat:
# Cheating code snippet if problem_bank_answer_leaked: output_answer_directly else: output_random_wrong_answer
"This rewards hacking behavior and exposes alignment vulnerabilities."
Error Map Comparison
Shows typical errors for humans and AI:
- ❌ Humans often fall into initialization errors (e.g., forgetting to zero out variables).
- ❌ AI frequently fails in sample test failures (getting even example problems wrong).
This indicates a significant flaw in AI's problem comprehension ability.
Future
Current ceiling:
- Best pass rate for medium problems is 53%.
- Pass rate for difficult problems is 0% (top human contestants can reach 85%+).
Areas for improvement (research points):
- Strengthen multi-step reasoning training (current AI's longest reasoning chain ≤ 5 steps).
- Build an case database to address boundary condition vulnerabilities.
- Use a self-correction mechanism to replace external tool dependence.
"When AI can independently solve IOI gold medal problems, general artificial intelligence will truly arrive."