Can Models Truly "Reflect on Code"? Beihang University Releases Repository-Level Understanding and Generation Benchmark, Refreshing the LLM Understanding Evaluation Paradigm

Large Language Models (LLMs) have made significant progress in code understanding and generation, capable of providing intelligent feedback across various programming languages, detecting potential bugs, and updating code snippets based on human instructions. Code Reflection, as the ability of an LLM to examine and modify its prior responses, significantly enhances development efficiency and programming accessibility.

Although benchmarks like HumanEval and LiveCodeBench evaluate code generation and real-world relevance, existing work overlooks the practical scenario of modifying code within a repository.

Considering the challenges of improving reflection capabilities and avoiding data contamination in dynamic benchmarks, this paper introduces LiveRepoReflection, a challenging benchmark for evaluating code understanding and generation capabilities in a multi-file repository context. It comprises 1,888 rigorously screened test cases across 6 programming languages, ensuring diversity, correctness, and high difficulty.

We created RepoReflection-Instruct, a large-scale, quality-filtered instruction-tuning dataset for training RepoReflectionCoder through a two-round dialogue process involving code generation and error-driven fixes. Our leaderboard evaluates over 40 LLMs, comprehensively reflecting the models' performance in repository-based code reflection.

Image: Model evaluation flowchart

Project Homepage: http://livereporeflection.github.io/

Paper Title: Turning the Tide: Repository-based Code Reflection

Evaluation Data: https://github.com/LiveRepoReflection/LiveRepoReflection

Code Link: https://github.com/LiveRepoReflection/LiveRepoReflection-Project

Image: Background illustration

Background – Repository-Level Code Reflection Task

Image: Comparison of code reflection and traditional code generation tasks

Compared to traditional code generation tasks, code repository reflection faces more complex challenges. As shown in the figure above, in contrast to the simple task of generating code from scratch, repository reflection requires models to understand multi-file dependencies and to systematically modify code based on compilation or runtime errors.

The LiveRepoReflection benchmark, through its rigorous evaluation process design, tests whether models can:

1. Understand complete repository structures and multi-file dependencies

2. Make targeted code modifications based on error messages

3. Maintain consistent high performance across different programming languages

Image: Automated dynamic pipeline construction flowchart

Automated Dynamic Pipeline Construction

To avoid data contamination and benchmark overfitting, the authors propose an automated dynamic data construction pipeline:

1. Extract repository code from sources like Exercism, and optimize repository file structure

2. Collect code snippets from public sources such as GitHub, Hugging Face, and Stack Overflow

3. Filter data by programming language

4. Use a randomly selected "creative" LLM to generate program themes and definitions from seed data

5. Use multiple "reasoning" LLMs to generate unit tests and reference solutions

6. Cross-execute to validate each unit test-solution pair, filter out anomalies, retaining tests with the lowest pass rates and solutions with the highest pass rates

7. Package all content into the final repository structure

Image: Data construction pipeline illustration

This process ensures the benchmark's high quality and dynamic update capability, effectively preventing models from overfitting to specific datasets.

At the same time, we standardized the repository structure as much as possible, providing standard structures for six programming languages: Python, Java, Rust, CPP, Go, and JavaScript.

Image: Supported programming languagesImage: LiveRepoReflection benchmark overview

LiveRepoReflection Evaluation Benchmark

To ensure the quality and difficulty of LiveRepoReflection, the research team adopted a rigorous screening process:

1. Executable program screening: From 100,000 repository cases generated by the automated pipeline, they were run in a sandbox, including environment setup, compilation, and testing. Cases that all LLMs could pass or that ran for more than 180 seconds were excluded, retaining 10,000 high-difficulty, high-correctness cases.

2. Difficulty screening: 10 mainstream strong reasoning LLMs were used to test each code program case, with each LLM having one modification opportunity. Based on the pass rate, cases were categorized into "simple," "medium difficulty," and "high difficulty," retaining 2,300 high-quality, high-difficulty, and high-diversity cases.

3. Manual annotation: 8 graduate students inspected each case in a complete code execution sandbox environment to confirm the rationality of the code program case, environment configuration, file structure, reference answers, and correctness of unit tests, ultimately retaining 1,888 test cases.

Image: Test case screening flowchartImage: LiveRepoReflection vs Aider Polyglot benchmark comparison

Compared to the existing Aider Polyglot benchmark, LiveRepoReflection has achieved significant improvements in multiple dimensions: the number of problems has increased by more than 8 times, richer problem descriptions and example contexts are provided, and each repository contains more files on average, more realistically simulating the complex structure of actual codebases.

Image: RepoReflectionCoder training flowchart

RepoReflectionCoder Training

To train a high-performance RepoReflectionCoder, the research team constructed the RepoReflection-Instruct instruction corpus:

1. High-quality data screening: From approximately 500,000 code examples generated by the automated pipeline, high-quality data conforming to five standards was retained through strict rejection sampling:

At least one unit test file

At least one reference answer file

Number of code signature files matches number of reference answer files

Environment configuration file consistent with declared programming language

Standardized file naming and extensions

2. Quality scoring mechanism: A weighted scoring function was used to evaluate each code program, considering inverse indicators of execution capability, novelty, difficulty, code style, and perplexity.

3. Data decontamination: Efficiently filtered candidate texts with similarity higher than 0.8 to the test set using the MinHash algorithm and LSH index, ensuring the purity of training data and test set.

4. Multi-turn interactive generation: Four top-tier models were used to simulate over 840,000 rounds of coding dialogues, including direct generation (40%), error-driven fixes (40%), style standardization (10%), and dialogue summarization (10%).

Image: RepoReflectionCoder multi-turn interactive training diagram

Experimental Results

Image: Experimental results table 1Image: Experimental results table 2Image: Experimental results table 3

Experiments evaluated over 40 LLMs including GPT-4.5, Claude-3.7, OpenAI o-series / GPT4-series, Gemini, Qwen-series, and Grok, using four key metrics:

1. Pass@1: The proportion of coding tasks completed by the LLM on the first attempt

2. Pass@2: Success rate of the second attempt after reviewing failed code and error messages

3. Fix Weight (FW): Relative contribution of error diagnosis and correction in successful second attempts

4. Well-formedness (WF): Percentage of LLMs strictly adhering to the specified editing format in the system prompt

Image: Model performance comparison chart 1Image: Model performance comparison chart 2

Results show:

1. Leading closed-source models consistently performed best in one-shot and post-feedback accuracy.

2. Open-source models lagged but showed similar relative improvements when a second attempt was allowed.

3. All systems found Python tasks to be the simplest, while C++ and Rust were the most challenging.

4. Almost every model had a well-formedness rate exceeding 90%.

5. RepoReflectionCoder significantly outperformed the base Qwen2.5-Coder but still lagged behind top closed-source performers.

Compared to the Aider Polyglot benchmark, LiveRepoReflection is more challenging; almost all models performed lower on the new benchmark than on the old one, proving its higher difficulty level and value for evaluating real-world code generation and repair capabilities.

Image: Conclusion and Outlook

Conclusion and Outlook

This study introduces LiveRepoReflection, a high-difficulty benchmark for multi-file code repository understanding and generation. It ensures the diversity, correctness, and challenging nature of test cases through automated processes and manual verification. Simultaneously, the research team built the RepoReflection-Instruct dataset and trained RepoReflectionCoder, achieving significant performance improvements in repository-based code reflection capabilities.

Experimental results indicate that LiveRepoReflection can accurately and effectively measure models' reflection and repair capabilities in cross-file dependency and iterative repair scenarios, providing a solid foundation for subsequent research. Although there is still room for improvement in model performance, this study sets a new standard for multi-file repository code understanding and generation.

Authors and Institutions

Portrait of Zhang Wei

Zhang Wei, a first-year Ph.D. student jointly trained by Beihang University and Shanghai AI Laboratory, focuses on code intelligence and previously interned at Tongyi Qianwen.

Portrait of Yang Jian

Yang Jian, Associate Professor at Beihang University's School of Computer Science, has published over 20 first/corresponding author papers in international journals/conferences such as ICLR, NeurIPS, and ACL, with over 8,000 Google Scholar citations, and served as an area chair for international conferences like NeurIPS and ACL. He joined Qwen as an AliStar and actively promotes open-source large code models.

Portrait of Li Zhoujun

Li Zhoujun, Professor at Beihang University's School of Computer Science, Director of the Information Security Department, and Deputy Director of the Institute of Intelligent Information Processing. He is a member of the first Cybersecurity Disciplinary Review Group of the State Council Academic Degrees Committee, Vice Chairman of the Language Intelligence Professional Committee of the Chinese Association for Artificial Intelligence, and founder and chief scientist of Shenzhen SmartThink.

Main Tag:Large Language Models

Sub Tags:Code ReflectionCode GenerationCode UnderstandingBenchmarking


Previous:ReaGAN: Empowering Each Node as an Intelligent Reasoning Expert in Graphs

Next:Is Your Model's Attention Drifting? RUC and Tsinghua University Introduce LeaF: Pruning Distracting Tokens for Focused Learning

Share Short URL