New AI Utopia Reports
Editor: Yuan Yu, So Sleepy
[New AI Utopia Guide] HKU Huang Chao team's open-source DeepCode, for the first time in PaperBench testing on 'paper reproduction code', surpasses machine learning PhDs from 8 top universities like Cambridge and Berkeley, and leads advanced commercial code agents like Claude Code and Cursor.
In the AI field, academic papers often carry the most cutting-edge breakthroughs in algorithms, model architectures, etc.
But truly understanding the core knowledge of papers and successfully reproducing the algorithms and experimental results often faces huge challenges.
The main bottleneck is the lack of 'key implementation details'!
In reality, paper authors often abstract complex algorithm logic into a few lines of mathematical formulas, omitting core details that truly affect success or failure, such as specific hyperparameter ranges, tricky adjustments during training, detailed data preprocessing steps, network initialization strategies, etc.
However, the absence of these key implementation points creates a huge gap between theory and practice.
Even senior researchers are often helpless about this.
How to solve it?
Recently, HKU Professor Huang Chao's team open-sourced DeepCode, providing a super powerful AI tool to solve this problem.
It not only analyzes paper content and understands algorithm logic but also automatically generates runnable code.
DeepCode Demo
DeepCode Visual Interaction Interface
In benchmark tests, DeepCode excels in reproduction success rate and code quality, surpassing machine learning PhDs from top universities on multiple metrics.
DeepCode has been highly regarded since releasing its first version—DeepCode v1.0.0 in July this year, topping the GitHub Trending list and gaining nearly 8,000 stars (as of November 1 data).
Open-source link: https://github.com/HKUDS/DeepCode
Leading Across Four Benchmarks
Researchers compared DeepCode in the following four benchmarks: human experts; state-of-the-art commercial code agents; scientific code agents; LLM-based agents.
Results show DeepCode achieved the highest scores across all.
First Surpassing Human Experts: 75.9% vs 72.4%
In OpenAI's PaperBench benchmark, DeepCode's overall accuracy is 75.9%, exceeding the human expert group's 72.4%.
PaperBench benchmark specs: Dataset: OpenAI official standardized eval; Scale: Full reproduction of 20 ICML2024 papers; Dimensions: 8316 independent scoreable components; Scoring: SimpleJudge hierarchical weighted system; Complexity: End-to-end from paper text to executable code.
To ensure scientific rigor, the team established a high-quality human expert baseline.
Strict human expert qualification standards first.
Experts from 8 top research universities' ML PhDs (current/graduated).
Universities: UC Berkeley, Cambridge, CMU, Columbia, Cornell, Purdue, TU Wien, UMass Amherst.
Strict screening: Resume pre-screen & academic verification; ML theory standardized test; Git & SE practice eval; Full skill chain validation in paper repro tasks.
This ensures all participants have full pipeline from theory to code.
Enviro: NVIDIA A10 GPU std (some A100); 4-week flexible dev; Unlimited ChatGPT/Copilot; 3 attempts per paper, best@3 scoring.
Results prove: For deep understanding & long-dev complex tasks, even with AI tools, DeepCode reaches higher code quality & accuracy.
This shows DeepCode surpasses expert-level repro, a milestone in autonomous scientific SWE.
Better than Existing AI Coding: 84.8% vs 58.7%
On same benchmark, 5 random papers from 20, systematic comparison with mainstream commercial code agents.
DeepCode shows clear lead: 84.8%, ahead of Claude Code (58.7%) by ~26.1%.
For fairness, all agents used latest models: Claude 4.5 Sonnet-think, GPT 5 codex-high.
Gap from multi-agent arch design, not base model.
Also leads in scientific code agents & LLM agents: vs best PaperCoder (51.1%), DeepCode 73.5%, +22.4%.
Validates team's planning, hierarchical decom, code gen & iterative debug multi-module > simple pipelines.
vs Best LLM agent (43.3%), DeepCode (73.5%) +30.2%.
Shows for complex repro, complex agent scaffolding (not longer inference or bigger models) key.
DeepCode's Three Core Capabilities
Paper2Code (Paper → Code): Input: Academic paper PDF; Output: Production-grade code impl + full test suite + detailed docs.
DeepCode's core strength: auto-parse complex math, understand algo logic, gen high-quality runnable code. Helps researchers quick repro SOTA algos, validate innovations, accelerate progress.
Paper2Code
Text2Web (Idea → Web): Input: NL interface reqs & func expects; Output: Responsive frontend + modern UI + full interaction logic.
DeepCode smartly understands intent, auto mobile-adapt, gen design-compliant UI. For quick prototyping, MVP dev, idea landing.
Frontend Implementation
Text2Backend (Req → Service): Input: Backend func reqs & biz logic desc; Output: High-perf API + opt DB design + scalable arch.
DeepCode auto-selects best stack, considers perf/security, cloud-native deploy. For microservices quick dev, legacy refactor, enterprise digital trans.
DeepCode Core Tech Framework
DeepCode uses systematic 3-stage framework, decomplex code gen into arch blueprint build, code impl, auto-verify via multi-agent collab for doc-to-executable.
Stage 1: Arch Blueprint Build - Transforms long docs to structured blueprint via hierarchical split, multi-agent deep analysis, blueprint fusion to tackle long-doc understanding.
Multi-agent analysis: Concept & Algo agents parallel deep-dive doc dims for global + detail view.
Code Planner fuses outputs, aligns high-level arch w/ low-level specs, resolves inconsistencies.
Generates full blueprint guiding code gen.
Stage 2: Automated Code Build - Builds repo systematically from blueprint, dual-mech for cross-file consistency & domain knowledge gaps in large codebases.
Stage 3: Dynamic Verify & Optimize - Multi-level QA via static analysis + dynamic exec dual-verify, full assurance from structure to func correctness, self-improving loop.
Challenges & Thoughts on AI Coding
Current AI coding tools good at completion/simple tasks, but lack in deep-understanding complex ones.
Scientific paper repro typical—needs math grasp, abstract-to-code, tech details.
DeepCode progress shows specialized arch enables domain success, but general deep understanding limited.
How to better grasp complex biz logic/tech reqs open Q.
· From Assist Tool to Dev Partner: AI tools evolving from code completion to full dev support.
DeepCode's req analysis-to-gen-to-verify flow reps trend.
But new issues: How maintain dev control w/ more AI autonomy? Ensure code fits team norms/arch?
Solve via tech & practice.
· Vibe Coding Practicality: Lowers barrier, more devs.
Challenges: Code quality/consistency? Long-term maintainability w/ less low-level attn? Security/stability w/ speed?
DeepCode verify mech idea, but needs industry exploration.
Author Intros
Li Zongwei
Li Zongwei (b.1999), HKU PhD student under Prof. Huang Chao, focuses on LLM agent frontier. Work selected in CIKM 2024 most influential papers. Core contributor to DeepCode OSS, ~8k GitHub stars.
Li Zhonghang
Li Zhonghang (b.1998), HKU visiting PhD, LLM agents & smart cities. First author UrbanGPT, selected KDD2024 & ICDE2022 influential. Core DeepCode contributor, ~8k stars.
Guo Zirui
Guo Zirui (b.2000), HKU PhD, RAG & agents. First author LightRAG & RAG-Anything OSS, >32k stars total, mainstream graph RAG framework.
Huang Chao
Huang Chao, HKU advisor, LLM/agents/graph ML, >13k GS cites. Team OSS LightRAG, RAG-Anything, DeepCode, etc. >70k stars, 50x Trending.
References: