In recent times, research and discussions surrounding the topic of AI self-evolution/self-improvement have become increasingly frequent.
Earlier this month, we compiled and reported on some of these, including Sakana AI's and the University of British Columbia's collaboration on the "Darwin-Gödel Machine (DGM)", CMU's "Self-Reward Training (SRT)", Shanghai Jiao Tong University's proposed continuous self-improvement framework for multimodal large models "MM-UPT", and The Chinese University of Hong Kong's joint work with vivo on the self-improvement framework "UI-Genie". Please refer to the article: "Will the LSTM Father's 22-year-old Vision Come True? AI 'Self-Evolution' Papers Released Concentrated in One Week, Is a New Trend Emerging?"
Since then, related research has continued to emerge, with the following image showcasing some examples:
Just a few days ago, OpenAI CEO and well-known 𝕏 personality Sam Altman, in his blog post "The Gentle Singularity," envisioned a future where AI/intelligent robots achieve self-improvement. He wrote: "We'll have to make the first few million humanoid robots the old-fashioned way, but after that, they'll be able to operate entire supply chains to build more robots, and those robots can then build more chip fabrication facilities, data centers, and so on."
Soon after, an 𝕏 user @VraserX claimed that an OpenAI insider had revealed the company was already internally running recursively self-improving AI. This tweet sparked widespread discussion – some found it unsurprising, while others questioned the authenticity of this so-called "OpenAI insider."
https://x.com/VraserX/status/1932842095359737921
Regardless, AI is indeed moving towards achieving self-evolution.
MIT's recently published "Self-Adapting Language Models" is one of the latest examples, proposing a method for LLMs to update their own weights: SEAL🦭, or Self-Adapting LLMs. In this framework, LLMs can generate their own training data (self-editing) and update weights based on new inputs. This self-editing is learned through reinforcement learning, with the reward being the improved downstream performance of the updated model.
Paper Title: Self-Adapting Language Models
Paper Address: https://arxiv.org/pdf/2506.10943
Project Page: https://jyopari.github.io/posts/seal
Code Address: https://github.com/Continual-Intelligence/SEAL
This paper sparked widespread debate upon its release. On Hacker News, some users commented that while the self-editing method is clever, it doesn't yet constitute a "continuously self-improving agent."
The first author of the paper, Adam Zweiger, also provided a similar explanation on 𝕏:
Others suggested that this indicates we are approaching the so-called "event horizon" – a concept that also appeared in the very first sentence of Sam Altman's "The Gentle Singularity" blog, though Altman's phrasing was more aggressive: "we have already crossed the event horizon." Simply put, the event horizon refers to an irreversible tipping point, beyond which humanity will inevitably enter a phase of profound transformation, such as the path to superintelligence.
Of course, some are also wary and concerned about self-improving AI.
Let's now look at the results obtained from this popular research paper.
Self-Adapting Language Models (SEAL)
The SEAL framework allows language models to self-improve when encountering new data by generating their own synthetic data and optimizing parameters (self-editing).
The training objective of the model is: to use the data provided in the model context to directly generate these self-edits (SE) by generating tokens.
Self-edit generation needs to be learned through reinforcement learning, where the model is rewarded when the self-edits it generates, upon application, improve the model's performance on the target task.
Therefore, SEAL can be understood as an algorithm containing two nested loops: an outer RL loop for optimizing self-edit generation; and an inner update loop that uses the generated self-edits to update the model through gradient descent.
This method can be seen as an instance of meta-learning, which studies how to generate effective self-edits in a meta-learning fashion.
General Framework
Let θ denote the parameters of the language model LM_θ. SEAL operates on a single task instance (C, τ), where C is the context containing task-related information, and τ defines the downstream evaluation used to assess model adaptation.
For example, in a knowledge integration task, C is a passage intended to be integrated into the model's internal knowledge, and τ is a set of questions and their answers about that passage. In a few-shot learning task, C contains few-shot demonstrations for a new task, and τ is query inputs and ground-truth outputs.
Given C, the model generates a self-edit SE (whose form varies by domain) and updates its own parameters through supervised fine-tuning: θ′ ← SFT (θ, SE).
The team used reinforcement learning to optimize the self-edit generation process: the model performs an action (generates SE), then receives a reward r based on LM_θ′'s performance on τ, and updates its policy to maximize the expected reward:
However, unlike standard reinforcement learning setups, in this setup, the reward assigned to a given action depends on the model parameters θ when the action is performed (because θ is updated to θ′ and then evaluated).
As a result, the underlying reinforcement learning state must include the policy's parameters and is given by (C, θ), even if the policy's observations are limited to C (placing θ directly in the context is not feasible).
This means that (state, action, reward) triplets collected using a previous version of the model θ_old may be outdated and inconsistent with the current model θ_current. Therefore, the team adopted a policy-based approach where self-edits SE are sampled from the current model, and crucially, rewards are also calculated using the current model.
The team experimented with various online policy methods, such as Group Relative Policy Optimization (GRPO) and Proximal Policy Optimization (PPO), but found training to be unstable.
Ultimately, they chose ReST^EM from the DeepMind paper "Beyond human data: Scaling self-training for problem-solving with language models," which is a simpler method based on filtered behavioral cloning – i.e., "Rejection Sampling + SFT."
ReST^EM can be viewed as an Expectation-Maximization (EM) process: the E-step samples candidate outputs from the current model policy, and the M-step strengthens only those samples that receive a positive reward through supervised fine-tuning. This method optimizes an approximation of objective (1) under the following binary reward:
More precisely, when optimizing (1), the gradient must be calculated. However, in this setup, the reward term r (SE, τ, θ_t) depends on θ_t but is not differentiable. To address this, the team treats the reward as fixed with respect to θ_t. With this approximation, for a mini-batch containing N contexts and M sampled self-edits per context, the Monte Carlo estimator becomes:
where p_θ_t represents the model's autoregressive distribution, and y_s^(i,j) is the s-th token of self-edit SE_ij, i.e., the j-th sample for context C_i. Since sequences with r = 0 can be ignored in (4), the team's research shows that under the binary reward (2) (with stop-gradient applied to the reward term), ReST^EM can optimize (1) simply by "SFT on good self-edits." Algorithm 1 presents SEAL's training loop.
Finally, they also note that while this paper's implementation uses a single model to generate and learn from self-edits, these roles can also be separated. In such a "teacher-student" form, the student model would be updated using edits proposed by another teacher model. The teacher model would then be trained via reinforcement learning to generate edits that maximize the student's learning effectiveness.
Instantiating SEAL for Specific Domains
With the theory in place, the team also developed instances of SEAL. Specifically, they chose two domains: knowledge integration and few-shot learning.
In knowledge integration, the goal is to effectively integrate information provided in an article into the model's weights. The image below shows the relevant setup.
The image below shows the setup for few-shot learning.
For a more detailed description of these two instantiations, please refer to the original paper. Below, let's look at SEAL's actual performance.
Experimental Results
Few-shot Learning
The model used in the experiment was Llama-3.2-1B-Instruct, with ARC as the benchmark. The methods compared include ICL (In-Context Learning), TTT + Self-Editing (without reinforcement learning), and Oracle TTT. The results are shown in the table below.
As can be seen, SEAL significantly improved the adaptation success rate compared to the baselines: 72.5% vs. 20% (using self-edits from the base model but without reinforcement learning training) and 0% (no adaptation). However, performance is still lower than Oracle TTT, indicating further room for improvement in the new method.
Knowledge Integration
Knowledge integration used a larger Qwen2.5-7B, aiming to integrate new factual content from SQuAD articles. The methods compared here include the base model, the model trained only on articles, the model trained on articles + synthetic data, and the model trained on articles + GPT-4.1 synthetic data. The results are shown in the table below.
As can be seen, SEAL's accuracy performance surpassed the baseline in both single-article (n = 1) and continuous pre-training (n = 200) scenarios.
After initial training with synthetic data generated by the base Qwen-2.5-7B model, the model's performance already showed a significant improvement, from 32.7% to 39.7% and 41.0% respectively. Further reinforcement learning then led to additional performance gains (47.0% and 43.8%).
Figure 4 shows the accuracy after each external reinforcement learning iteration.
As can be seen, two iterations were sufficient for SEAL to surpass the GPT-4.1 data setup; subsequent iterations showed diminishing returns, indicating that the strategy quickly converged to an editing form that distills passages into easily learnable atomic facts (see the qualitative example in Figure 5).
In this example, we can see how reinforcement learning leads to the generation of more detailed self-edits, resulting in better performance. While the progress is clear in this example, in other cases, the differences between iterations can sometimes be more subtle.
Additionally, the team discussed some limitations of the SEAL framework in the paper, including catastrophic forgetting, computational overhead, and context-dependent evaluation. Please refer to the original paper for details.
Finally, a small survey: when do you think true self-evolving AI will be realized?