A Self-Improving Coding Agent

Coding agents have become one of the hottest topics in 2025. Both academic institutions and industry are seeking more efficient implementation paths.

Historical experience in machine learning suggests that hand-designed solutions are eventually replaced by learned ones. We are curious about a question: Can the agent itself autonomously modify and improve its own code by discovering new prompting schemes or tools, without human design and implementation?

In 2024, the paper "Automated Design of Agentic Systems" (Hu et al., 2024) was the first to attempt using a meta-agent to optimize agent implementation, pushing the field of Automated Agentic System Design (ADAS) forward. However, that research did not explore "self-improvement" because it involved two independent agents: a target agent executing tasks and a meta-agent improving the target agent.

Researchers from the University of Bristol and iGent AI believe that a completely self-referential meta-agent programming approach is achievable today and offers a viable alternative.

Research Diagram

Paper Title: A SELF-IMPROVING CODING AGENT

Paper Link: https://arxiv.org/pdf/2504.15228

Code Address: https://github.com/MaximeRobeyns/self_improving_

Specifically, this research contributes the following:

The Self-Improving Coding Agent (SICA) eliminates the distinction between meta-agent and target agent, enabling it to edit its own codebase and self-improve in terms of cost, speed, and benchmark performance.

Self-referential agents can effectively improve their own implementation. Researchers found that even considering safety constraints and resource efficiency, performance on a random subset of SWE Bench verified problems improved by 17% to 53%.

The researchers shared the implementation of the Self-Improving Coding Agent (SICA) with the community. SICA is implemented in standard Python without domain-specific languages, providing a reference agent framework for building new SICA systems and for those seeking to post-train LLMs for tool usage and other agent tasks.

Method Overview

SICA's main operational loop is similar to the ADAS loop described in Hu et al. (2024). In particular, both SICA and ADAS maintain an archive of previous agents and their benchmark results.

In the first step, SICA selects the best-performing agent from the archive so far as the meta-agent, instructing it to review the archive, identify improvement strategies, and implement them.

Algorithm 1 illustrates this high-level scheme:

Algorithm Diagram

Note that this differs from ADAS, which has a fixed meta-agent and therefore does not select the meta-agent from the archive (the archive in ADAS is for target agents, not meta-agents).

Secondly, both ADAS and SICA evaluate the new agent on a set of evaluation benchmarks and store the evaluation results.

The researchers define the "best" performing agent using a utility function that includes common expectations for a good agent: normalized benchmark performance score p_score ∈ [0, 1], wall-clock time in seconds p_time, and dollar cost p_cost. The basic utility is calculated using the formula

Utility Formula

where the researchers set the coefficients to w_score = 0.5, w_cost = 0.25, and w_time = 0.25. Here, the worst cost for each problem is set to 10 USD, and a timeout of 300 seconds is enforced, after which the agent is canceled. To provide partial reward for work done before the timeout, the final utility is calculated as follows, with a timeout penalty τ = 0.5:

Final Utility Formula

It is important to note that since no weight updates are performed, this numerical score is only used for selecting the next meta-agent and the base agent for the next iteration.

First, the initial coding agent is introduced, followed by the benchmark running framework, and how this framework naturally allows us to create a self-referential task (i.e., improving the coding agent).

The structure of the agent context is crucial; it includes the content of open files, etc., not just prompts. In the initial coding agent, the context structure is shown in Figure 3.

Agent Context Structure Diagram

Presented first is the system prompt containing the agent's definition, listing the definitions of tools available to the agent and callable sub-agents. The end of the system prompt includes system information, such as instructions on how to break out of the agent loop and return to the calling process.

Next is the "core prompt," which is set as the first user information in a chat template format, containing the problem statement specified by the caller (the caller could be the user invoking the agent or an agent invoking a sub-agent). Here, the researchers also inserted a view of the files the agent has open and a representation of the current working directory.

Finally, the rest of the LLM agent context is the running assistant information. This includes a chain of reasoning about the problem, tool calls and their responses, sub-agent calls and their responses, and any notifications or comments from an asynchronous supervision system.

The LLM context is laid out this way to allow adding new content while preserving the KV cache, reducing latency and cost between tool calls. For the base agent, a simple scheme was adopted here where file edits are added as diffs and periodically re-merged into the stale file view after accumulating a certain amount. Agents can write more complex KV cache saving schemes to reduce costs.

Experimental Results

The researchers first show runs on benchmarks containing relatively long-cycle tasks, selecting SWE Bench Verified, which includes test problem decomposition, code navigation, and fast, efficient file editing. From this, the researchers selected a fixed random subset of 50 problems. They also randomly sampled 50 problems from LiveCodeBench, which are similar to competitive programming and typically involve more theoretical reasoning. Additionally, the researchers selected two "synthetic" benchmark task examples, which are entirely defined and curated within the agent's codebase.

Most agent systems were run using Sonnet 3.5 (v2) (Anthropic), with only a "reasoning agent" using o3 mini. Results are shown in Figure 3; more detailed iteration-level performance data can be found in Table 1.

Experimental Results Chart 1

The results show encouraging progress. In file editing benchmarks, the agent initially only had the ability to overwrite files, which was slow and token-inefficient, but it quickly implemented self-proclaimed "intelligent" diff and range-based editing tools; in symbol navigation tasks, the agent implemented a simple AST-based symbol locator in the 9th iteration (perhaps taking advantage of the fact that the benchmark is primarily based on Python), achieving good improvements, which were also reflected in other tasks. LiveCodeBench scores also showed subtle improvements, but less significant than some other benchmarks.

The researchers also evaluated the effectiveness of the self-referential agent system in improving task performance in areas requiring stronger reasoning abilities. They conducted another experiment using two question-answering tasks from the benchmark set: Figure 4 shows the experimental results. The results here show less improvement and highlight the important interaction between the base model and the "scaffolding system."

In Figure 4, the base agent system in iteration 0 uses Sonnet 3.5, and the "reasoning" sub-agent uses o3-mini. The o3-mini model itself scores 87% and 79% on AIME and GPQA Diamond respectively, possessing "high" reasoning ability, while the entire agent system achieved an average score of 76% on these two benchmarks.

Experimental Results Chart 2

By observing the trajectories, the researchers found that in many runs, the main agent simply delegated problems to the o3-mini based reasoning agent, not fully utilizing the entire agent system's capabilities. For "reasoning models" like o1, o3-mini, or DeepSeek-R1, the researchers suspect that adding coarse components designed to induce reasoning behavior (like those in iterations 4 and 6 in Figure 4) might actually interrupt the reasoning chain of thought of models trained outside the agent system, leading to performance degradation.

For more research details, please refer to the original paper.

© THE END

Please contact this official account for authorization to reprint

Submissions or seeking coverage: liyazhou@jiqizhixin.com

Main Tag:Coding Agents

Sub Tags:AISoftware EngineeringMachine LearningSelf-Improvement


Previous:Why LLM Agents Perform Poorly: Google DeepMind Research Reveals Three Failure Modes, RL Fine-tuning Can Mitigate

Next:Why Did the AI Assistant with 22 Million Downloads Collapse in Reputation?

Share Short URL