AI's New SOTA in Bug Fixing: ExpeRepair Achieves 60.33% Fix Rate on SWE-Bench Lite, Learns from Experience Like Humans – Developed by Institute of Software, Chinese Academy of Sciences

Submitted by ExpeRepair Team

QbitAI Official Account

AI has learned to fix bugs like humans!

"I just fixed this bug last week." "Why is this error back again?" "Why are new hires making the same mistake?"...

If you're a programmer, do you often encounter these frustrating scenarios?

Existing AI repair tools are like "goldfish," with only 7 seconds of memory, starting from scratch every time they encounter a problem. The greatest advantage of human engineers is precisely their ability to quickly find solutions from historical experience.

And now, AI has learned this advantage:

ExpeRepair—a repository-level defect repair system with "dual memory," it simulates two memory modes of human cognition:

Episodic memory: stores historical repair cases (e.g., "how to specifically fix a security vulnerability in the Sympy project")
Semantic memory: refines high-level repair strategies (e.g., "when handling resource leaks, files must be closed and handles released simultaneously")

When a new problem is encountered, ExpeRepair simultaneously activates both types of memory: retrieving similar cases from episodic memory as a reference, and extracting general strategies from semantic memory to guide decision-making, dynamically generating tailored repair solutions.

On the authoritative SWE-Bench Lite benchmark, ExpeRepair topped the list with a 60.33% fix rate:

This research was proposed by a team from the Institute of Software, Chinese Academy of Sciences. More details are below.

Unveiling ExpeRepair's 'Ultimate Brain'

1. Dual Memory System: Learning Like Humans

ExpeRepair simulates two memory modes of the human brain:

1) Episodic Memory:

Records specific repair cases, storing complete repair trajectories (problem description, test scripts, repair code, verification results), including successful and failed cases, forming both positive and negative experiences.

2) Semantic Memory:

Refines abstract repair strategies, automatically summarizing high-level experiences through LLM, such as "when handling file operations, both permission checks and resource release must be considered simultaneously."

2. Dynamic Knowledge Update: Getting Smarter with Use

The memory system continuously evolves:

1) Additions: adds new experiences when new problem patterns are discovered

2) Merges: optimizes the expression of similar experiences

3) Eliminations: deletes outdated or contradictory experiences

At the same time, it keeps the memory bank concise (no more than 15 core strategies).

3. Intelligent Retrieval: Precisely Matching Historical Experience

When encountering new problems, it "learns" from historical solutions like an experienced engineer:

1) Quickly matches similar historical problems via the BM25 algorithm

2) Extracts the top-3 most relevant repair cases

3) Generates dynamic prompts by combining abstract strategies

Deconstructing the ExpeRepair Repair Process

ExpeRepair decomposes complex automated program repair problems into three tasks: test generation, patch generation, and patch validation.

It employs two agents, namely the test agent and the patch agent, to handle these tasks collaboratively, similar to how human developers categorize and solve software problems.

Test Generation

Reproduction tests are crucial for proving the existence of a problem and for verifying the correctness of candidate patches.

Existing methods typically generate reproduction test scripts based solely on the problem description, which has two main limitations:

(1) Frequent execution failures due to missing dependencies, configurations, or specific environment setups;

(2) Insufficient problem reproduction, as such scripts often narrowly target symptoms described in the problem description without capturing the broader context of the failure.

ExpeRepair addresses these limitations by enabling the test agent to iteratively generate and optimize reproduction tests based on dynamic feedback and accumulated memory from past repair trajectories.

Specifically, the agent first retrieves the most relevant demonstration cases from episodic memory, which directly addresses limitation (1): when the current test execution encounters a failure (e.g., missing libraries or configuration errors), demonstration cases related to the same or similar failures provide concrete examples of how these failures were successfully handled in the past. Then, the agent extracts all summary natural language insights from semantic memory, which capture generalizable high-level strategies distilled from previous repairs. For example: "When testing security-sensitive functions, implement comprehensive test cases to verify correct handling of malicious inputs, edge cases, and potential attack vectors, ensuring robust validation and appropriate error responses." These insights help address limitation (2) by expanding the agent's reasoning beyond merely reproducing the stated symptoms.

Patch Generation

After successfully reproducing the problem, ExpeRepair initiates the patch generation process to produce candidate repair solutions that can resolve the fault while preserving existing functionality.

Before generating a patch, it is crucial to first determine where the problem occurred. This approach employs a hierarchical localization strategy: first, suspicious files are identified based on the problem description and code base structure, and then specific faulty code lines are pinpointed within the identified files.

As previous research has indicated, LLM-generated patches are often incomplete because fixing a code-base level problem typically requires coordinating changes across multiple parts of the code, or performing a series of modification operations.

To address this issue, similar to the test generation process, ExpeRepair will iteratively invoke episodic memory and semantic memory, and utilize these memories to generate and optimize patches. The patch agent will follow the same process as the test generation phase, retrieving relevant demonstration cases and extracting reflective insights.

Patch Validation

When a candidate patch successfully passes the reproduction tests, it is not immediately adopted as the final patch.

This is because the team empirically observed that reproduction scripts often narrowly focus on specific symptoms in the problem description, which can lead to misjudgments – meaning a patch might pass limited tests but fail in broader scenarios. To mitigate this issue, ExpeRepair will prompt the patch agent to revise and enhance the patch to address edge case handling, regression risks, and compliance with specific language best practices.

Next, the team will augment the reproduction test suite by generating additional validation tests. The test agent will be asked to create tests for boundary conditions and extreme cases. This process reduces the risk of narrow and fragile repair solutions that only address reported surface symptoms without comprehensively handling underlying defects.

When selecting the final patch, ExpeRepair first runs all candidate patches in an extended test suite that includes both reproduction and validation tests. Then, the candidate patches and their corresponding test results are submitted to a dedicated review agent, which selects the final patch based on criteria such as correctness and adherence to best practices.

Practical Results

Below are some experimental results.

On SWE-Bench Lite, ExpeRepair, using Claude-3.5 Sonnet + o4-mini, achieved a 48.3% fix success rate, surpassing other methods using similar models.

Using Claude-4 + o4-mini, it achieved a 60.3% fix success rate, ranking first on SWE-Bench Lite.

Experiments also showed that removing either episodic memory or semantic memory led to a significant performance degradation.

https://arxiv.org/abs/2506.10484

https://github.com/ExpeRepair/ExpeRepair