Machine Learning Report
Editors: Zhang Qian, +0
Enabling AI to achieve self-evolution has always been a human dream.
As early as 2003, AI pioneer and father of LSTM, Jürgen Schmidhuber, proposed a concept called the "Gödel Machine" – it uses a recursive self-improvement protocol that would rewrite its own code if it could prove that a new code's strategy was better. However, this remained merely a hypothesis.
In recent years, research on model self-learning and evolution has gradually increased. Many researchers' goals are shifting from simply "training models" to "enabling models to learn and evolve independently." Google's recently released AlphaEvolve is a significant representative in this area.
In the past week, progress in this direction has been particularly abundant. It has been observed that several papers on "enabling LLMs (or agents) to self-train" have appeared on arXiv, including even the "Darwin Gödel Machine" inspired by the "Gödel Machine" concept. Perhaps, the self-evolution capability of AI models is accelerating.
In this article, we will detail several recent papers, which are:
The "Darwin Gödel Machine (DGM)" by Sakana AI in collaboration with the University of British Columbia and other institutions: DGM uses foundation models and open-ended algorithms to create and evaluate new AI agents, capable of reading and modifying its own Python codebase for self-improvement, and assessing the effectiveness of changes by evaluating performance on coding benchmarks. Experiments show that DGM can continuously self-improve and transfer across different models and programming languages.
Carnegie Mellon University's "Self-Rewarding Training (SRT)": proposes an online self-training reinforcement learning algorithm called "Self-Rewarding Training," designed to enable large language models to self-supervise and train using their own judgment signals, thereby improving performance without external labels.
Shanghai Jiao Tong University and other institutions' continuous self-improvement framework for multi-modal large models, "MM-UPT": achieves continuous self-improvement of multi-modal large models in a completely unsupervised setting through the reinforcement learning framework GRPO. They proposed a concise and efficient framework: MM-UPT (Multi-Modal Unsupervised Post-Training), and verified its effectiveness on multiple image-text mathematical reasoning benchmarks.
The self-improvement framework "UI-Genie" by The Chinese University of Hong Kong in collaboration with vivo and other institutions: aims to address two core challenges in GUI agents: first, the difficulty of verifying trajectory results, and second, the difficulty of obtaining high-quality training data at scale. To address these two challenges, the research team proposed a reward model and a self-improvement pipeline, respectively.
Darwin Gödel Machine: Enabling AI to Self-Improve by Rewriting its Own Code
Paper Title: Darwin Gödel Machine: Open-Ended Evolution of Self-Improving Agents
Paper Link: https://arxiv.org/abs/2505.22954
Blog: https://sakana.ai/dgm/
A long-term goal of AI research is to create AI systems capable of continuous learning. A compelling path to achieving this goal is to enable AI to self-improve by rewriting its own code, including the code responsible for learning. This concept, proposed decades ago by Jürgen Schmidhuber, is known as the "Gödel Machine," a hypothetical self-improving AI. When it mathematically proves the existence of a superior strategy, it optimizes problem solutions by recursively rewriting its own code, thus becoming a core concept in the field of meta-learning (i.e., "learning to learn").
While the theoretical Gödel Machine ensures provable beneficial self-modifications, its realization relies on an impractical assumption: AI must be able to mathematically prove that code modifications will bring net benefits before implementing changes.
To address this problem, Sakana AI, in collaboration with Jeff Clune's lab at the University of British Columbia, proposed a more feasible solution: utilizing the principles of open-ended algorithms like Darwinian evolution to empirically test and find improvements that enhance performance.
They named this achievement the "Darwin Gödel Machine (DGM)." The DGM system leverages foundation models to propose code improvements and incorporates the latest innovations in open-ended algorithms to search an ever-growing library of diverse and high-quality AI agents. Experiments show that the more compute DGM receives, the more significant its self-improvement becomes. Given the clear trend that AI systems relying on learning will eventually surpass manually designed systems, DGM is likely to outperform human-designed AI systems in the short term.
The first DGM is a coding agent that can:
read and modify its own code;
evaluate whether modifications improve performance;
open-endedly explore the AI design space.
The Darwin Gödel Machine is a self-improving programming agent that enhances its performance on programming tasks by rewriting its own code. It can achieve various self-improvements, including: adding patch validation steps, optimizing file viewing functions, enhancing editing tools, generating and sorting multiple solutions to select the optimal option, and recording historical attempts (including reasons for failure) when implementing new modifications.
The Darwin Gödel Machine, by applying open-ended exploration principles, gradually builds an expanding library of agents. This system continuously creates new agents and rates them by alternately performing self-modifications and evaluating downstream tasks.
On sw-bench, DGM automatically improved its performance from 20.0% to 50.0%. On Polyglot, DGM's performance jumped from an initial 14.2% to 30.7%, far exceeding the representative agents manually designed by Aider. These substantial gains demonstrate DGM's ability to discover and implement beneficial changes to its own code.
Self-Rewarding Training for Models: Potential, Collapse, and Mitigation Strategies
Paper Title: Can Large Reasoning Models Self-Train?
Paper Link: https://arxiv.org/abs/2505.21444
Project Address: https://self-rewarding-llm-training.github.io/
Code Address: https://github.com/tajwarfahim/srt
Dataset: https://huggingface.co/collections/ftajwar/self-rewarding-llm-training-6835218091832c3664176553
Reinforcement learning with verifiable rewards has significantly enhanced the reasoning capabilities of large language models, especially in mathematics and coding. However, this method relies on human-created ground-truth verifiers, making the generation of reward signals for each problem costly and limited. In this work, the research team asks the following questions:
Can reasoning models self-train using only their own feedback, without access to ground truth labels?
Can the performance of self-training reach the level of reinforcement learning trained with ground truth labels?
Can self-training continue indefinitely? Will its improvements eventually be limited?
Which strategies can effectively sustain a model's self-training?
Self-Rewarded Training (SRT)
Inspired by previous research on consistency-based self-improvement, the research team introduced a simple yet effective self-training reinforcement learning methodology called Self-Rewarded Training (SRT). This method evaluates correctness during reinforcement learning training through consistency among multiple solutions generated by the model, thereby providing self-supervised signals without labeled data.
SRT Overview. In the RLVR method, the system generates reward signals for reinforcement learning training through a ground truth verifier. In contrast, the SRT method does not rely on a ground truth verifier; instead, it estimates the true value through a majority voting mechanism of the model's own generated results and uses this alternative reward signal to train the model.
SRT Matches RL Performance in Early Training Stages
The research team empirically demonstrated that in early training stages, SRT can achieve performance comparable to standard reinforcement learning methods explicitly trained on golden standard answers. Test datasets included: AMC, AIME24, AIME25. However, the research team found that its performance eventually collapses, as shown in the training situation on the DAPO dataset in the far-right image.
Self-training inevitably collapses
The research team analyzed the training dynamics of SRT when trained on the challenging DAPO dataset.
These findings indicate that the model learns to maximize self-assigned rewards by producing consistent (see second image above) but incorrect (see far left image above) answers. Manual inspection confirmed this: after collapse, the model's output degenerates into random token sequences with a fixed, prompt-independent answer (e.g., "the answer is 1"). This behavior has a simple and precise theoretical basis:
The reinforcement learning optimization problem defined by the SRT objective explicitly encourages consistency among outputs, regardless of their correctness. Therefore, the optimal strategy under this objective degenerates to producing the same answer regardless of the input, thereby artificially maximizing the reward. Continuously self-training on this proxy objective naturally drives the model towards this trivial solution, especially when this solution is simpler than solving the actual task.
Mitigation strategies may be effective
The research team proposed some strategies to mitigate reward hacking, laying the foundation for effective methods to sustain continuous model improvement in the future.
(i) Early Stopping: A small validation set can reliably detect the model's optimal performance point and prevent collapse during self-training. For all heldout sets, the optimal performance point appears almost at the same location, so using any one heldout set for early stopping is effective.
(ii) Self-training with Off-line Generated Labels: An effective method is to generate pseudo-labels from a stable, previously fixed checkpoint, rather than utilizing labels from the evolving policy. Doing so can stabilize training while achieving performance comparable to SRT.
(iii) Self-training with Curriculum Learning: The research team hypothesized that model collapse occurs faster when training on more challenging datasets, a conjecture consistent with the research team's empirical findings. The intuition is that on more challenging datasets, models are more likely to abandon their pre-trained knowledge in favor of optimizing self-consistency rather than truly learning to solve the underlying task. The research team leveraged this hypothesis by implementing a curriculum learning strategy by identifying the "simplest" subset of the DAPO dataset based on (a) pass rate and (b) frequency of majority votes (more details can be found in the paper).
Performance on these curriculum subsets reached levels comparable to standard reinforcement learning training using ground truth labels on the entire DAPO dataset. These promising results suggest that curriculum learning strategies may further extend the benefits of SRT, opening up exciting avenues for future research.
MM-UPT: Continuous Self-Evolution for Multi-Modal Large Models
Paper Title: Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO
Paper Link: https://arxiv.org/abs/2505.22453
Project Code: https://github.com/waltonfuture/MM-UPT
In recent years, multi-modal large language models have made significant progress in tasks such as visual question answering and image-text reasoning. However, further performance improvements on these powerful foundation models often rely on high-quality human-annotated data for supervised fine-tuning or reinforcement learning, which presents severe challenges in terms of cost and scalability. While previous research has explored unsupervised post-training methods, most processes are complex, difficult to iterate, and have low data utilization.
In this paper, the authors for the first time explored continuous self-improvement of multi-modal large models in a completely unsupervised setting through the reinforcement learning framework GRPO. They proposed a concise and efficient framework: MM-UPT (Multi-Modal Unsupervised Post-Training), and verified its effectiveness on multiple image-text mathematical reasoning benchmarks.
The core idea of MM-UPT primarily consists of two key points:
GRPO in reinforcement learning provides stable and efficient online policy optimization capabilities;
Majority voting can generate pseudo-labels for model outputs on unlabeled data, driving self-optimization.
The entire process is as follows:
Given an image and a question, the model generates multiple candidate answers;
The majority vote is used to select the most frequent answer as the "pseudo-label" for the current input;
This "pseudo-label" is used to calculate the reward, guiding the model to update according to the GRPO policy;
This entire process does not require any external supervision signals or ground truth answers, allowing the model to perform reinforcement learning based on its own "consensus" behavior, thereby achieving continuous performance improvement.
The authors conducted extensive experiments on four multi-modal mathematical reasoning benchmarks (MathVision, MathVista, We-Math, MathVerse). The results in Table 1 show:
Without using any human-annotated answers but using standard training sets, MM-UPT can improve the accuracy of Qwen2.5-VL-7B from 66.3% to 72.9% (MathVista);
It surpasses previous unsupervised self-improvement methods (such as Genixer, STIC, SRLM, etc.);
Its performance even rivals supervised GRPO;
After unsupervised training with masked answers on standard datasets, the authors further explored a more challenging question: Can the model self-improve by generating its own training data? To this end, MM-UPT introduced two simple synthetic data generation strategies:
In-Context Synthesizing
Given an image, the original question, and the original answer, the model generates a new question. The generated question is structurally similar to the original question, equivalent to semantic rephrasing or conditional substitution for data augmentation.
Direct Synthesizing
Only image input is provided, and the model generates questions entirely based on the image content. This method generates more diverse questions but also has a certain probability of hallucination. Regardless of which method is used to generate questions, MM-UPT uses majority voting to generate pseudo-labels, driving the model to perform reinforcement learning updates.
The results in Table 2 show that even if the training data is entirely generated by the model itself, MM-UPT can still significantly improve multi-modal reasoning capabilities, and even surpass data using original questions in some tasks. This indicates that multi-modal large models have a certain potential for "self-questioning + self-optimization," providing a solid foundation for future paradigms that rely on AI to generate training corpora for self-evolution.
Why is MM-UPT effective? The authors explain its effectiveness with a simple example. Suppose for a binary classification problem, the model predicts correctly with a higher probability . From this model, independently sample
answers
, and majority vote selects the most frequent answer as the pseudo-label. Define the random variable
to represent the number of correct predictions, then the probability of majority vote being correct is:
Since , we have:
That is: majority voting is more reliable than a single prediction. This is the rationale behind using majority voting as pseudo-labels in MM-UPT – it can construct an effective self-supervised reward signal. However, the authors also point out boundary conditions: when the model lacks prior knowledge of the task (e.g., on difficult datasets like ThinkLite-11K), majority voting can instead reinforce incorrect predictions, leading to performance degradation.
Overall, MM-UPT provides a self-improvement method for the post-training phase of multi-modal large models that does not require human annotation or external reward models, demonstrating the potential of reinforcement learning in unsupervised scenarios. Future research can explore combining stronger self-evaluation mechanisms (such as LLM-as-a-Judge), complex reward design, etc., to further expand the capabilities of the MM-UPT framework.
UI-Genie: A New Framework for Efficient Self-Improvement of GUI Agents
Paper Title: UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents
Paper Link: https://arxiv.org/abs/2505.21496
Project Address: https://github.com/Euphoria16/UI-Genie
In this paper, the research team introduces a self-improvement framework called UI-Genie, aiming to address two core challenges in GUI agents: first, the difficulty of verifying trajectory results, and second, the difficulty of obtaining high-quality training data at scale. To address these two challenges, the research team proposed a reward model and a self-improvement pipeline, respectively.
This reward model, UI-Genie-RM, adopts an image-text interleaved architecture that can efficiently process historical context information and unify action-level and task-level rewards:
Eliminate manual annotation through iterative synthetic trajectory generation
Co-evolve agent and reward model through self-improvement loops
Generate high-quality datasets without human intervention
To support the training of UI-Genie-RM, the research team developed carefully designed data generation strategies, including rule-based validation, controlled trajectory corruption, and hard negative mining.
To address the second challenge, the research team designed a self-improvement pipeline that progressively enhances the capabilities of the agent and reward model through reward-guided exploration and result validation in dynamic environments, thereby expanding the range of complex GUI tasks that can be solved.
In terms of model training, the research team generated UI-Genie-RM-517k and UI-Genie-Agent-16k datasets, which are not only the first reward-specific datasets for GUI agents but also demonstrate the ability to generate high-quality synthetic trajectories without human annotation.
UI-Genie Dataset Statistics. UI-Genie-RM-517k is the first reward dataset specifically for GUI agents, while UI-Genie-Agent-16k contains synthetic trajectories without human annotation.
Experimental results show that after three generations of self-improvement iterations on data and models, UI-Genie has achieved industry-leading levels in multiple GUI agent benchmarks. The research team has open-sourced the complete framework implementation and generated datasets to promote further research in this field.
Performance comparison of UI-Genie, Qwen2.5-VL, and UI-TARS on three benchmarks.
There are many other papers on model self-improvement. If you are also doing related research, please leave a comment to recommend your work.
© THE END
Please contact this official account for authorization to reprint
Submissions or media inquiries: liyazhou@jiqizhixin.com