AI automatically fixes bugs, achieving a resolution rate of 44%! This is the latest and strongest level among global open-source models.
A new open-source model from Ant Group has surpassed all other open-source solutions on SWE-bench Lite, with performance comparable to closed-source models.
The specific performance on SWE-bench Lite is as follows:
Ranked first among all Open Weight Models;
Ranked sixth among all Open Source Systems;
Overall ranked 14th;
Outperforms "KGCompass," currently the best open-source model on the leaderboard, by 7.33%.
They pioneered the integration of a code repository graph modality into large models (Code Graph Model, CGM), enabling large language models to directly understand code graphs for more efficient bug fixing and code completion.
This completely eliminates reliance on black-box models (such as GPT-4 or Claude 3.7) and complex Agent workflows, achieving more controllable, transparent, and secure SE automation.
Moreover, CGM is entirely based on open-source models. It's well-known that open-source models typically do not perform well on SWE-bench, with almost all SOTA solutions previously relying on closed-source models. CGM, based on the Qwen model, achieves performance comparable to closed-source models.
CGM requires only 4 steps to quickly locate and generate patches, eliminating the complex orchestration process found in Agent solutions, significantly boosting efficiency.
Enabling AI to Truly Understand Large Model Codebases
Since the rise of large models, AI programming has rapidly emerged, performing exceptionally well in small tasks like writing functions. For instance, on benchmarks like HumanEval, many models have exceeded 90% accuracy.
However, real software engineering is far more complex than "writing a function." Tasks such as bug fixing and feature enhancement typically require cross-file, cross-module operations and demand that the model understands the complex structure, dependencies, and class inheritance systems within a project.
The current mainstream approach often uses Agent-based solutions built on closed-source models. These can simulate human programmer behavior, such as observing code, calling tools, and engaging in multi-turn interactions to complete tasks.
However, these methods also have several issues:
Uncontrollable behavior paths, prone to accumulating reasoning errors;
Reliance on closed-source models like GPT-4 and Claude, making private deployment or customization difficult;
High engineering costs and lower efficiency.
At the same time, current open-source model solutions struggle to achieve SOTA-level results.
Therefore, the research team proposed: Can we use only open-source models, without relying on Agents, to solve repository-level tasks? CGM was developed from this idea.
🔍Deep Integration of Graph Structure and Large Models
CGM adopts a cross-modal modeling approach similar to Vision-Language Models (VLM). It combines the text understanding capabilities of traditional LLMs with the structural graph of a code repository, forming a graph-language multimodal model. The core of the model integrates two modalities:
Graph modality: Building the repository into a structured graph, where nodes include 7 types such as functions, classes, files, and packages, and edges represent dependencies like calls, contains, and inherits;
Language modality: User-input natural language descriptions and code prompts, driving the model to generate patches or answers.
The model input consists of code graphs and text prompts, which will undergo bimodal alignment of structure-semantics within the LLM.
The specific structural integration method is as follows:
A small encoder (CodeT5+) is used to encode each node, compressing it into a single "node token." Each node is divided into text blocks of up to 512 tokens.
An adapter (a two-layer MLP) maps the encoded node representations into the LLM's input embedding space. This effectively extends the LLM's context by 512 times, allowing it to better handle vast code repository contexts.
A Graph-aware Attention Mask is used to replace the LLM's original causal attention, allowing the attention mechanism to only act between adjacent nodes. Similar to GNN's message passing mechanism, this enables the LLM to directly perceive and utilize code's structural dependencies.
✏️Two-Stage Training: Structural Understanding + Problem Generalization
Based on this model architecture, the team trained the LLM in two stages to enable it to understand the topological structure of code graphs.
Stage One: Subgraph Reconstruction Pre-training
To train CGM to effectively capture the semantic and structural information of code graphs, the team designed a "Graph-to-Code" task. Subgraphs are randomly sampled from large code graphs (limiting node count to control output code length), and the model needs to reconstruct the original code snippets based on these input subgraphs (which only contain node types and connection relationships, not complete code content).
Then, a hierarchical method is used to maintain the structural consistency and readability of the reconstructed code. Repository context is concatenated according to topological sorting and line number order: high-level nodes (e.g., REPO, PACKAGE) are placed at the beginning of the output sequence or file; file nodes determine their order through topological sorting; and internal nodes within files (e.g., CLASS, FUNCTION) are concatenated by line number order.
Stage Two: Noise-Enhanced Fine-tuning
This stage fine-tunes CGM using real GitHub issue-fix patch data.
The model learns to generate code patches based on two inputs: (i) a relevant code subgraph; (ii) a text prompt indicating the actual files that might need modification based on the patch. To enhance the model's robustness, 10% noise input was deliberately introduced into the prompts: for example, a prompt might include an irrelevant file that doesn't actually need modification, or omit at least one crucial file that should have been modified. Introducing this controlled noise during training helps the model generalize better to scenarios where input information is incomplete or contains interference.
📎Inference Stage: Graph-RAG Framework Replaces Agent
Finally, to further enhance practical application capabilities, CGM built a lightweight, Agent-free framework called Graph-RAG.
It replicates the human programmer's bug-fixing workflow but is more efficient than existing Agent solutions.
The number of core modules has been further streamlined from 10 to 4: Rewriter → Retriever → Reranker → Generator (CGM model).
Rewriter: Rewrites problem descriptions, extracts keywords and relevant files;
Retriever: Extracts connected subgraphs from the code graph through semantic and structural retrieval;
Reranker: Ranks retrieval results, selects the most critical files for generation;
Generator: Combines the subgraph and prompt to generate the final fix code.
Based on the above, CGM achieved leading results in multiple benchmarks. Specifically, as follows—
Experimental Results
The research team systematically evaluated CGM's performance on multiple mainstream benchmarks, covering two main task categories: (1) code repair and (2) code completion.
Repository-level Code Repair
On the SWE-bench Lite Leaderboard, CGM ranked first in the open-weight leaderboard with a result of 44.00%.
On SWE-bench Verified, CGM improved by 10.20% compared to the best open-source baseline, reaching 50.40%;
For Java projects, CGM achieved 14.29% on SWE-bench-java Verified, an improvement of 4.4% compared to the best open-source baseline.
These results indicate that CGM can handle large-scale, repository-level bug fixing tasks across different languages and projects, demonstrating strong structural understanding and generalization capabilities.
Repository-level Code Completion
In complex code generation tasks, CGM also significantly outperforms open-source models of comparable size on ComplexCodeEval and CrossCodeEval, especially excelling in scenarios requiring cross-file reasoning and completion.
Additionally, the research team deployed CGM on different base models (CodeLlama-7B and DeepSeek-Coder-7B) and compared it with recent RAG systems. The results show that CGM has good versatility, can adapt to various base models, and performs better than traditional RAG methods.
In summary, CGM does not rely on complex Agent systems. It is the first to integrate the code graph modality into large models, enabling AI to "truly understand a project" by grasping the complex dependencies between text and code within a repository, much like humans do.
More importantly, it can be implemented based on open-source models and is not limited to specific models. This provides a flexible, transparent, and controllable solution for enterprises and developers.
🚀Finally, CGM's technical paper, core code, model weights, and training data have all been open-sourced. Interested individuals can learn more about the details.
https://arxiv.org/abs/2505.16901
https://github.com/codefuse-ai/CodeFuse-CGM
https://huggingface.co/codefuse-ai/CodeFuse-CGM-72B
https://huggingface.co/datasets/codefuse-ai/CodeGraph
😎The team's previous work:
Code LLM Survey: Awesome-Code-LLM (TMLR)
https://github.com/codefuse-ai/Awesome-Code-LLM
Previous Research on Graph+LLM: GALLa (ACL 2025)
https://github.com/codefuse-ai/GALLa
Efficient Attention Architecture: Rodimus (ICLR 2025)
https://arxiv.org/abs/2410.06577
Code Multi-task Fine-tuning Framework: MFTCoder (KDD 2024)