Synced Review
Editor: Panda
Recently, research and discussions on the topic of AI self-evolution/advancement have become increasingly dense.
At the beginning of this month, we compiled and reported on some of these, including Sakana AI's "Darwin-Gödel Machine (DGM)" in collaboration with the University of British Columbia and other institutions, CMU's "Self-Rewarding Training (SRT)," Shanghai Jiao Tong University and other institutions' proposed continuous self-improvement framework for multimodal large models "MM-UPT," and the "UI-Genie" self-improvement framework from The Chinese University of Hong Kong in collaboration with vivo and other institutions. Refer to the article: "LSTM Father's 22-Year-Old Vision Coming True? AI 'Self-Evolution' Papers Released Concentratedly in One Week, New Trends Emerging?"
Since then, related research has continued to emerge. The following image collage shows some examples:
A few days ago, Sam Altman, OpenAI CEO and famous 𝕏 influencer, even envisioned a future where AI/intelligent robots achieve self-improvement in his blog "The Gentle Singularity." He wrote: "We will have to make the first few millions of humanoid robots in the traditional way, but after that, they will be able to operate the entire supply chain to make more robots, and those robots can then build more chip manufacturing facilities, data centers, and so on."
Shortly after, 𝕏 user @VraserX leaked that an OpenAI insider claimed the company was already internally running AI capable of recursive self-improvement. This tweet sparked widespread discussion—some said it was not surprising, while others questioned the authenticity of this so-called "OpenAI insider."
https://x.com/VraserX/status/1932842095359737921
Regardless, AI is indeed moving towards self-evolution.
MIT's "Self-Adapting Language Models," released yesterday, is one of the latest examples, proposing a method that allows LLMs to update their own weights: SEAL🦭, or Self-Adapting LLMs. In this framework, LLMs can generate their own training data (self-editing) and update weights based on new inputs. This self-editing can be learned through reinforcement learning, with the reward being the improved downstream performance of the updated model.
Paper Title: Self-Adapting Language Models
Paper URL: https://arxiv.org/pdf/2506.10943
Project Page: https://jyopari.github.io/posts/seal
Code URL: https://github.com/Continual-Intelligence/SEAL
This paper sparked widespread discussion upon its release. On Hacker News, one user commented that this self-editing method is very clever, but it cannot yet be said that it has achieved a "continually self-improving agent."
The paper's first author, Adam Zweiger, also provided a similar explanation on 𝕏:
Others suggested that this indicates we are approaching the so-called event horizon—a concept that also appeared in the very first sentence of Sam Altman's "The Gentle Singularity" blog, though Altman was more aggressive, stating that "we have already crossed the event horizon." Simply put, an event horizon refers to an irreversible tipping point, beyond which humanity will inevitably enter a phase of profound transformation, such as the path to superintelligence.
Of course, some are also wary and concerned about self-improving AI.
Next, let's look at the findings of this popular research paper.
Self-Adapting Language Models (SEAL)
The SEAL framework allows language models to self-improve when encountering new data by generating their own synthetic data and optimizing parameters (self-editing).
The training objective of this model is: to use data provided in the model's context to directly generate these self-edits (SE) by generating tokens.
Self-edit generation is learned through reinforcement learning, where the model is rewarded when the generated self-edits improve the model's performance on the target task after application.
Therefore, SEAL can be understood as an algorithm containing two nested loops: an external RL loop for optimizing self-edit generation; and an internal update loop that uses the generated self-edits to update the model via gradient descent.
This method can be seen as an instance of meta-learning, which studies how to generate effective self-edits in a meta-learning fashion.
General Framework
Let θ denote the parameters of the language model LM_θ. SEAL operates on a single task instance (C, τ), where C is context containing task-relevant information, and τ defines the downstream evaluation used to assess the model's adaptation.
For example, in a knowledge integration task, C is a passage intended to be integrated into the model's internal knowledge, and τ is a set of questions about that passage and their corresponding answers. In a few-shot learning task, C contains a few-shot demonstration for a new task, and τ is query inputs and ground-truth outputs.
Given C, the model generates a self-edit SE (whose form varies by domain) and updates its own parameters via supervised fine-tuning: θ′ ← SFT (θ, SE).
The team used reinforcement learning to optimize the self-edit generation process: the model performs an action (generates SE), then receives a reward r based on LM_θ′'s performance on τ, and updates its policy to maximize the expected reward:
However, unlike standard reinforcement learning settings, in this setup, the reward assigned to a given action depends on the model parameters θ at the time of action execution (because θ will be updated to θ′ and then evaluated).
Thus, the underlying reinforcement learning state must include the policy's parameters and be given by (C, θ), even if the policy's observations are limited to C (placing θ directly in context is not feasible).
This means that (state, action, reward) triplets collected using an older version of the model θ_old might be outdated and inconsistent with the current model θ_current. Therefore, the team adopted a policy-based approach, where self-edits SE are sampled from the current model, and crucially, rewards are also calculated using the current model.
The team experimented with various on-policy methods, such as Group Relative Policy Optimization (GRPO) and Proximal Policy Optimization (PPO), but found training to be unstable.
Ultimately, they chose ReST^EM from the DeepMind paper "Beyond human data: Scaling self-training for problem-solving with language models.", which is a simpler method based on filtered behavior cloning—that is, "Rejection Sampling + SFT."
ReST^EM can be viewed as an Expectation-Maximization (EM) process: the E-step samples candidate outputs from the current model policy, and the M-step reinforces only those samples that receive a positive reward through supervised fine-tuning. This method optimizes an approximation of objective (1) under the following binary reward:
More precisely, when optimizing (1), one must compute the gradient. However, in this setting, the reward term r (SE, τ, θ_t) depends on θ_t but is non-differentiable. To address this, the team treated the reward as fixed relative to θ_t. With this approximation, for a mini-batch containing N contexts and M sampled self-edits per context, the Monte Carlo estimator becomes:
where p_θ_t denotes the model's autoregressive distribution, and y_s^(i,j) is the s-th token of the self-edit SE_ij, i.e., the j-th sample for context C_i. Since sequences with r = 0 can be ignored in (4), the team's research shows that under binary reward (2) (applying stop-gradient to the reward term), ReST^EM can optimize (1) by simply using "SFT on good self-edits." Algorithm 1 presents SEAL's training loop.
Finally, they also noted that while this paper's implementation uses a single model to generate and learn from self-edits, these roles can also be separated. In such a "teacher-student" setup, the student model would be updated using edits proposed by another teacher model. The teacher model would then be trained via reinforcement learning to generate edits that maximize the student's learning effectiveness.
Instantiating SEAL for Specific Domains
With the theory in place, the team also built instances of SEAL. Specifically, they chose two domains: knowledge integration and few-shot learning.
Among these, the goal of knowledge integration is to effectively incorporate information provided in an article into the model's weights. The figure below shows the related setup.
The figure below shows the setup for few-shot learning.
For a more detailed description of these two instantiations, please refer to the original paper. Now let's look at SEAL's actual performance.
Experimental Results
Few-Shot Learning
The model used in the experiment was Llama-3.2-1B-Instruct, with ARC as the benchmark. The methods compared included ICL (In-Context Learning), TTT + self-editing (without reinforcement learning), and Oracle TTT. The results are shown in the table below.
As can be seen, SEAL significantly improved adaptation success rate compared to baselines: 72.5% vs. 20% (using base model self-edits but without reinforcement learning training) and 0% (no adaptation), but performance is still lower than Oracle TTT, indicating further room for improvement for the new method.
Knowledge Integration
For knowledge integration, a larger Qwen2.5-7B was used, with the goal of integrating new factual content from SQuAD articles. The methods compared here included the base model, a model trained only on articles, a model trained on articles + synthetic data, and a model trained on articles + GPT-4.1 synthetic data. The results are shown in the table below.
It can be seen that in both single-article (n = 1) and continuous pre-training (n = 200) scenarios, the SEAL method's accuracy performance exceeded the baseline.
After training with synthetic data generated by the base Qwen-2.5-7B model, the model's performance already showed significant improvement, increasing from 32.7% to 39.7% and 41.0% respectively. Further reinforcement learning then improved performance even more (47.0% and 43.8%).
Figure 4 shows the accuracy after each external reinforcement learning iteration.
It can be seen that two iterations were sufficient for SEAL to surpass the GPT-4.1 data setting; subsequent iterations showed diminishing returns, indicating that the strategy quickly converged to an editing form that distills paragraphs into easily learnable atomic facts (see qualitative example in Figure 5).
In this example, it can be seen how reinforcement learning leads to the generation of more detailed self-edits, resulting in better performance. While progress is evident in this example, differences between iterations can sometimes be more subtle in other examples.
Additionally, the team discussed some limitations of the SEAL framework in the paper regarding catastrophic forgetting, computational overhead, and context-dependent evaluation. Please refer to the original paper for details.
Finally, a small survey: when do you think truly self-evolving AI will be realized?
© THE END
Please contact this official account for authorization to reproduce.
Submissions or media inquiries: liyazhou@jiqizhixin.com