AI Math Ability Skyrockets 100%, Self-Evolution Nears RL Limits! CMU's New Work Overturns Perceptions
You want to know useful information about artificial intelligence, delivered promptly. Copyright NoticeReproduced from Xinzhiyuan, copyright belongs to the original author, used for academic sharing, if infringement, please leave a message to delete.Editors: Taozi, Xiniu【Introduction】Data depletion is becoming a new bottleneck in AI development! The CMU team proposes a revolutionary solution, SRT: enabling LLMs to achieve self-evolution without human annotation! SRT can iteratively improve mathematical and reasoning abilities in its early stages, with performance even approaching that of traditional reinforcement learning, revealing its disruptive potential.The biggest stumbling block on the path to AGI is that internet data is becoming insufficient!DeepSeek-R1 and OpenAI's 'o'-series reasoning models have emerged, no longer solely relying on human-annotated 'standard answers,' but breaking the deadlock through RL.But here's the problem — currently, LLMs still need humans to design 'correct signals' to guide training.If a problem is so complex that even humans don't know the answer, these AIs will be stumped.To address this, CMU, in collaboration with independent researchers, has launched a new method called 'Self-Reward Training' (SRT), which can be described as AI's 'secret sauce' for self-cultivation!
Copyright NoticeReproduced from Xinzhiyuan, copyright belongs to the original author, used for academic sharing, if infringement, please leave a message to delete.Editors: Taozi, Xiniu【Introduction】Data depletion is becoming a new bottleneck in AI development! The CMU team proposes a revolutionary solution, SRT: enabling LLMs to achieve self-evolution without human annotation! SRT can iteratively improve mathematical and reasoning abilities in its early stages, with performance even approaching that of traditional reinforcement learning, revealing its disruptive potential.The biggest stumbling block on the path to AGI is that internet data is becoming insufficient!DeepSeek-R1 and OpenAI's 'o'-series reasoning models have emerged, no longer solely relying on human-annotated 'standard answers,' but breaking the deadlock through RL.But here's the problem — currently, LLMs still need humans to design 'correct signals' to guide training.If a problem is so complex that even humans don't know the answer, these AIs will be stumped.To address this, CMU, in collaboration with independent researchers, has launched a new method called 'Self-Reward Training' (SRT), which can be described as AI's 'secret sauce' for self-cultivation! Paper address: https://arxiv.org/pdf/2505.21444Its core idea is to let LLMs use their own 'self-consistency' as an intrinsic supervision signal to generate rewards for self-optimization.Simply put, AI will be like a philosopher, looking at its own answers and self-questioning: Is this derivation logic self-consistent? Are there any flaws?Then, it will score itself based on the 'degree of self-consistency' of the answer, and then use that score to continuously improve.Crucially, SRT completely eliminates the need for human-annotated data and can be naturally applied to 'test-time training.'The experimental results are eye-opening: in the early training stages, SRT's performance is comparable to standard reinforcement learning methods trained with ground truth answers.Currently, the research team's code is public.
Paper address: https://arxiv.org/pdf/2505.21444Its core idea is to let LLMs use their own 'self-consistency' as an intrinsic supervision signal to generate rewards for self-optimization.Simply put, AI will be like a philosopher, looking at its own answers and self-questioning: Is this derivation logic self-consistent? Are there any flaws?Then, it will score itself based on the 'degree of self-consistency' of the answer, and then use that score to continuously improve.Crucially, SRT completely eliminates the need for human-annotated data and can be naturally applied to 'test-time training.'The experimental results are eye-opening: in the early training stages, SRT's performance is comparable to standard reinforcement learning methods trained with ground truth answers.Currently, the research team's code is public. Address: https://github.com/tajwarfahim/srt
Address: https://github.com/tajwarfahim/srt Self-Reward Training: AI's Secret Sauce for Self-CultivationIn the absence of external supervision, models need to rely on themselves to generate supervision signals.Intuitively, if a model can identify higher-quality answers among multiple answers it generates, then this identified improvement can serve as a training signal.This situation naturally occurs in problems with a positive 'generation-validation gap,' such as mathematics, logical reasoning, and code generation tasks.A simple but effective method is to use majority voting to exploit this gap.Experiments show that this yields higher accuracy than answers generated by a single model.In the setup of this paper, the majority voting steps include:1. Sampling multiple answers for each prompt;2. Grouping answers based on the parsed final solution;3. Using the most common solution (mode) to estimate the true answer.
Self-Reward Training: AI's Secret Sauce for Self-CultivationIn the absence of external supervision, models need to rely on themselves to generate supervision signals.Intuitively, if a model can identify higher-quality answers among multiple answers it generates, then this identified improvement can serve as a training signal.This situation naturally occurs in problems with a positive 'generation-validation gap,' such as mathematics, logical reasoning, and code generation tasks.A simple but effective method is to use majority voting to exploit this gap.Experiments show that this yields higher accuracy than answers generated by a single model.In the setup of this paper, the majority voting steps include:1. Sampling multiple answers for each prompt;2. Grouping answers based on the parsed final solution;3. Using the most common solution (mode) to estimate the true answer. Self-Evolution Method SRTThe research team proposed a novel method that frames the model's self-improvement process as a reinforcement learning task.In this process, labels are not fixed but are dynamically generated by the model's evolving majority voting results.Simply put, it means letting the model 'vote' for the best answer itself and using these answers as guidance to gradually improve its performance.Each round of reinforcement learning operations can be simply understood as the following steps:1. Sample a small batch of prompts, then use the current model to generate n possible answers for each prompt.2. Through 'majority voting,' find the most common answer for each prompt, serving as a temporary 'ground truth' (pseudo-label).3. Check whether each generated answer is consistent with the majority-voted answer; if consistent, assign it a reward (expressed as: r(y) = 1[answer(y) = y_majority]).4. Based on this batch of data and the calculated rewards, update the model once to make it smarter.Specifically, the research team designed a reward mechanism that cleverly leverages model self-consistency to define the reward method.This allows their method to easily adapt to common reinforcement learning algorithms, such as PPO, RLOO, REINFORCE, and REINFORCE+++.Additionally, since each problem prompt typically generates 16 to 64 answers, SRT does not increase additional computational burden compared to other label-based algorithms.As long as majority voting makes the model's generated results slightly better than the validation results in each reinforcement learning iteration, this repeated self-reward can continuously provide useful guidance signals, helping the model to make progress.While the prospect of model self-improvement is exciting, there are still limitations: the model's self-generated reward is merely a proxy for potential correctness.This proxy reward may trigger 'reward hacking': where the model, to maximize its self-assigned reward, produces increasingly self-consistent but potentially incorrect answers.Overall, this research makes four contributions:1. Proposes a simple and effective self-training reinforcement learning method — Self-Reward Training (SRT). This method uses the consistency among solutions generated by multiple models to estimate correctness in reinforcement learning training, providing self-supervision signals without labeled data.2. Demonstrates through experiments that, in the early training stages, SRT's performance is comparable to standard reinforcement learning methods trained with ground truth answers.3. Analyzes the limitations of self-generated rewards, revealing that the model's reward function initially correlates with correctness but may degrade to reflect only confidence rather than true accuracy, leading to reward hacking.4. Proposes strategies to mitigate reward hacking, laying the foundation for future continuous model improvement methods.
Self-Evolution Method SRTThe research team proposed a novel method that frames the model's self-improvement process as a reinforcement learning task.In this process, labels are not fixed but are dynamically generated by the model's evolving majority voting results.Simply put, it means letting the model 'vote' for the best answer itself and using these answers as guidance to gradually improve its performance.Each round of reinforcement learning operations can be simply understood as the following steps:1. Sample a small batch of prompts, then use the current model to generate n possible answers for each prompt.2. Through 'majority voting,' find the most common answer for each prompt, serving as a temporary 'ground truth' (pseudo-label).3. Check whether each generated answer is consistent with the majority-voted answer; if consistent, assign it a reward (expressed as: r(y) = 1[answer(y) = y_majority]).4. Based on this batch of data and the calculated rewards, update the model once to make it smarter.Specifically, the research team designed a reward mechanism that cleverly leverages model self-consistency to define the reward method.This allows their method to easily adapt to common reinforcement learning algorithms, such as PPO, RLOO, REINFORCE, and REINFORCE+++.Additionally, since each problem prompt typically generates 16 to 64 answers, SRT does not increase additional computational burden compared to other label-based algorithms.As long as majority voting makes the model's generated results slightly better than the validation results in each reinforcement learning iteration, this repeated self-reward can continuously provide useful guidance signals, helping the model to make progress.While the prospect of model self-improvement is exciting, there are still limitations: the model's self-generated reward is merely a proxy for potential correctness.This proxy reward may trigger 'reward hacking': where the model, to maximize its self-assigned reward, produces increasingly self-consistent but potentially incorrect answers.Overall, this research makes four contributions:1. Proposes a simple and effective self-training reinforcement learning method — Self-Reward Training (SRT). This method uses the consistency among solutions generated by multiple models to estimate correctness in reinforcement learning training, providing self-supervision signals without labeled data.2. Demonstrates through experiments that, in the early training stages, SRT's performance is comparable to standard reinforcement learning methods trained with ground truth answers.3. Analyzes the limitations of self-generated rewards, revealing that the model's reward function initially correlates with correctness but may degrade to reflect only confidence rather than true accuracy, leading to reward hacking.4. Proposes strategies to mitigate reward hacking, laying the foundation for future continuous model improvement methods. Experimental ResultsWhat are the advantages and limitations of the newly proposed SRT algorithm?To this end, researchers conducted a series of studies based on the Qwen2.5-Math-7B model, specifically addressing the following four core questions:1. How effective is the SRT algorithm compared to standard reinforcement learning methods based on true labels? Can it generalize to unseen problems?2. Can self-improvement iteratively continue to enhance performance? Or is there an inherent upper limit to this improvement?3. Which underlying factors influence the effectiveness of self-improvement?4. What are the actual effects when SRT is used for performance improvement during the testing phase?
Experimental ResultsWhat are the advantages and limitations of the newly proposed SRT algorithm?To this end, researchers conducted a series of studies based on the Qwen2.5-Math-7B model, specifically addressing the following four core questions:1. How effective is the SRT algorithm compared to standard reinforcement learning methods based on true labels? Can it generalize to unseen problems?2. Can self-improvement iteratively continue to enhance performance? Or is there an inherent upper limit to this improvement?3. Which underlying factors influence the effectiveness of self-improvement?4. What are the actual effects when SRT is used for performance improvement during the testing phase? Self-Training Based on Majority VotingAs shown in Figure 2, on the MATH and AIME training sets, the self-supervised SRT method can achieve results comparable to reinforcement learning based on true labels, without the need for true label signals.It is worth noting that the pass@1 scores in Figure 2 are all evaluated on a held-out test set, indicating that the self-training process can robustly generalize beyond the training distribution.
Self-Training Based on Majority VotingAs shown in Figure 2, on the MATH and AIME training sets, the self-supervised SRT method can achieve results comparable to reinforcement learning based on true labels, without the need for true label signals.It is worth noting that the pass@1 scores in Figure 2 are all evaluated on a held-out test set, indicating that the self-training process can robustly generalize beyond the training distribution. However, the results on the DAPO dataset are more complex.Specifically, when training on DAPO, researchers found that the performance of the SRT algorithm on the test set initially improved at a rate comparable to standard RL based on true answers.But at around 400-600 training steps, SRT reached peak performance and then began to decline, while standard RL training based on true labels continued to improve.Overall, the study found a striking and unexpected trend: even without any annotated samples, SRT's performance curve closely matched that of RL based on standard answers in the early training stages.Within statistical error, SRT's peak test pass@1 scores on the MATH and AIME'83-AIME'23 datasets were essentially on par with supervised RL methods.On the more challenging DAPO dataset, SRT still reached 75% of RL's final performance.Furthermore, across all three training sets, SRT's peak performance showed an approximately 100% relative improvement compared to the baseline model.
However, the results on the DAPO dataset are more complex.Specifically, when training on DAPO, researchers found that the performance of the SRT algorithm on the test set initially improved at a rate comparable to standard RL based on true answers.But at around 400-600 training steps, SRT reached peak performance and then began to decline, while standard RL training based on true labels continued to improve.Overall, the study found a striking and unexpected trend: even without any annotated samples, SRT's performance curve closely matched that of RL based on standard answers in the early training stages.Within statistical error, SRT's peak test pass@1 scores on the MATH and AIME'83-AIME'23 datasets were essentially on par with supervised RL methods.On the more challenging DAPO dataset, SRT still reached 75% of RL's final performance.Furthermore, across all three training sets, SRT's peak performance showed an approximately 100% relative improvement compared to the baseline model. Analysis of Anomalous Phenomena After SRT Performance PeakWhen SRT reached its performance peak on the DAPO training set (see Figure 2), researchers observed that its test accuracy began to significantly deteriorate.In fact, a similar phenomenon of clear performance collapse also occurred when training on the MATH-12k dataset for more than two epochs.For this behavior, the authors provide a simple and precise theoretical explanation:The reinforcement learning optimization problem defined by the SRT objective explicitly encourages consistency among outputs, regardless of correctness.Therefore, the optimal strategy under this objective is to generate identical responses regardless of input, thereby artificially obtaining the maximum possible reward.Thus, it can naturally be expected that continuous training under such a proxy objective may lead to this degenerate solution, especially when optimizing this objective is easier than learning to solve the actual task.
Analysis of Anomalous Phenomena After SRT Performance PeakWhen SRT reached its performance peak on the DAPO training set (see Figure 2), researchers observed that its test accuracy began to significantly deteriorate.In fact, a similar phenomenon of clear performance collapse also occurred when training on the MATH-12k dataset for more than two epochs.For this behavior, the authors provide a simple and precise theoretical explanation:The reinforcement learning optimization problem defined by the SRT objective explicitly encourages consistency among outputs, regardless of correctness.Therefore, the optimal strategy under this objective is to generate identical responses regardless of input, thereby artificially obtaining the maximum possible reward.Thus, it can naturally be expected that continuous training under such a proxy objective may lead to this degenerate solution, especially when optimizing this objective is easier than learning to solve the actual task.
 Test-Time Self-ImprovementOne appealing application of self-training is to improve model accuracy through test-time training.Applying SRT as a test-time training technique is exceptionally simple: just treat the unlabeled test set entirely as a training dataset and directly apply SRT.Next, researchers compared the majority voting performance after SRT test-time training with performance without any test-time training.As shown in Figure 4, under the maj@32 metric, test-time training achieved through SRR brought a relatively limited but still perceptible performance improvement compared to directly applying the conventional majority voting baseline to the base model's generated outputs.Furthermore, on larger test datasets, the performance gain was even more significant compared to majority voting of the base model.
Test-Time Self-ImprovementOne appealing application of self-training is to improve model accuracy through test-time training.Applying SRT as a test-time training technique is exceptionally simple: just treat the unlabeled test set entirely as a training dataset and directly apply SRT.Next, researchers compared the majority voting performance after SRT test-time training with performance without any test-time training.As shown in Figure 4, under the maj@32 metric, test-time training achieved through SRR brought a relatively limited but still perceptible performance improvement compared to directly applying the conventional majority voting baseline to the base model's generated outputs.Furthermore, on larger test datasets, the performance gain was even more significant compared to majority voting of the base model.
 Why Doesn't Test-Time Training Cause Performance Collapse?Interestingly, after test-time training, intuitive inspection of model outputs revealed that although the model's predictions for almost every test prompt degenerated into a single response (which is the optimal behavior for the SRT objective), the test accuracy remained high.Researchers speculate that the stability of test-time self-training stems from a critical difference in dataset scale.For example, the AIME24 test dataset contains only 30 self-improvement samples.With this limited sample size, the model quickly converges to stable majority-voted answers on these samples by reinforcing specific CoT derivations.Once convergence is achieved, SRT cannot obtain meaningful gradient signals to further update parameters, thus naturally stabilizing test-time performance.
Why Doesn't Test-Time Training Cause Performance Collapse?Interestingly, after test-time training, intuitive inspection of model outputs revealed that although the model's predictions for almost every test prompt degenerated into a single response (which is the optimal behavior for the SRT objective), the test accuracy remained high.Researchers speculate that the stability of test-time self-training stems from a critical difference in dataset scale.For example, the AIME24 test dataset contains only 30 self-improvement samples.With this limited sample size, the model quickly converges to stable majority-voted answers on these samples by reinforcing specific CoT derivations.Once convergence is achieved, SRT cannot obtain meaningful gradient signals to further update parameters, thus naturally stabilizing test-time performance. In contrast, during regular training on large-scale datasets, continuously input new samples constantly drive the model to over-optimize for consistency.Under these conditions, the model tends to adopt an overly simplified generalization strategy (generating identical answers), ultimately collapsing due to producing single predictions unrelated to the prompts.
In contrast, during regular training on large-scale datasets, continuously input new samples constantly drive the model to over-optimize for consistency.Under these conditions, the model tends to adopt an overly simplified generalization strategy (generating identical answers), ultimately collapsing due to producing single predictions unrelated to the prompts. Can Large Model Collapse Be Avoided?So, can LLM collapse be avoided?As mentioned above, the optimization objective of Self-Reward Training (SRT) may lead to significant performance improvements in the early stages, but ultimately trigger model collapse.To address this, researchers explored the following complementary strategies to tackle the model collapse problem and further enhance the upper limit of self-training performance:1. Early Stopping Strategy: Using a small amount of labeled validation data to monitor the model's state and promptly terminate training to prevent collapse;2. Algorithmic Strategy: Fundamentally reducing the risk of collapse by using pseudo-labels generated by a stable base model (rather than a continuously updating model);3. Data-Driven Curriculum Learning Strategy: Breaking through the limitations of simple early stopping, improving model performance through a progressive learning mechanism.
Can Large Model Collapse Be Avoided?So, can LLM collapse be avoided?As mentioned above, the optimization objective of Self-Reward Training (SRT) may lead to significant performance improvements in the early stages, but ultimately trigger model collapse.To address this, researchers explored the following complementary strategies to tackle the model collapse problem and further enhance the upper limit of self-training performance:1. Early Stopping Strategy: Using a small amount of labeled validation data to monitor the model's state and promptly terminate training to prevent collapse;2. Algorithmic Strategy: Fundamentally reducing the risk of collapse by using pseudo-labels generated by a stable base model (rather than a continuously updating model);3. Data-Driven Curriculum Learning Strategy: Breaking through the limitations of simple early stopping, improving model performance through a progressive learning mechanism. Early Stopping StrategyIn experiments, even with only a small amount of labeled validation data, it was possible to effectively identify the performance peak during the self-training process, thereby avoiding the risk of model collapse.As shown in Figure 6, by continuously monitoring the training process on the DAPO dataset and evaluating it on multiple test sets, the authors discovered a key phenomenon:The performance peaks on different held-out test sets all appeared at similar training steps.This regularity indicates that any one test set can be used for early stopping decisions.Specifically, the vertical dashed line in Figure 6 shows the effect of early stopping using only 1% of the DAPO data as a validation set — at this point, the model's performance on all other evaluation datasets remained close to optimal.
Early Stopping StrategyIn experiments, even with only a small amount of labeled validation data, it was possible to effectively identify the performance peak during the self-training process, thereby avoiding the risk of model collapse.As shown in Figure 6, by continuously monitoring the training process on the DAPO dataset and evaluating it on multiple test sets, the authors discovered a key phenomenon:The performance peaks on different held-out test sets all appeared at similar training steps.This regularity indicates that any one test set can be used for early stopping decisions.Specifically, the vertical dashed line in Figure 6 shows the effect of early stopping using only 1% of the DAPO data as a validation set — at this point, the model's performance on all other evaluation datasets remained close to optimal.
 Algorithmic StrategyThe root cause of model collapse lies in SRT (Self-Training Reinforcement Learning) overemphasizing consistency rather than correctness — even if the output is incorrect, consistency between models will continue to be reinforced.To address this issue, researchers proposed a simple and effective solution: generating pseudo-labels from stable, fixed checkpoints (rather than continuously updated policies).In practice, they used the Qwen2.5-Math-7B base model to generate pseudo-labels via a majority voting mechanism, then stored these offline generated labels for subsequent reinforcement learning training.Figure 7 shows that using such offline labels not only significantly improved training stability but also achieved model performance comparable to SRT.This finding has important implications: dynamically updating pseudo-labels (online annotation) during training may not necessarily bring significant advantages and may instead become a cause of training instability.
Algorithmic StrategyThe root cause of model collapse lies in SRT (Self-Training Reinforcement Learning) overemphasizing consistency rather than correctness — even if the output is incorrect, consistency between models will continue to be reinforced.To address this issue, researchers proposed a simple and effective solution: generating pseudo-labels from stable, fixed checkpoints (rather than continuously updated policies).In practice, they used the Qwen2.5-Math-7B base model to generate pseudo-labels via a majority voting mechanism, then stored these offline generated labels for subsequent reinforcement learning training.Figure 7 shows that using such offline labels not only significantly improved training stability but also achieved model performance comparable to SRT.This finding has important implications: dynamically updating pseudo-labels (online annotation) during training may not necessarily bring significant advantages and may instead become a cause of training instability.
 Curriculum Learning StrategyFurthermore, researchers proposed a key hypothesis: model collapse occurs faster when training on more challenging datasets.The intrinsic mechanism is that when faced with high-difficulty data, the model is more likely to abandon pre-trained knowledge and instead obtain rewards by optimizing self-consistency (rather than truly learning to solve the task).Based on this hypothesis, researchers employed curriculum learning by filtering the 'simplest' subset of the DAPO dataset for training.Specifically, they retained the top 1/3 simplest prompt samples selected based on the following two metrics:1. Base model pass rate (requires true labels)2. Majority voting frequency (does not require true labels)
Curriculum Learning StrategyFurthermore, researchers proposed a key hypothesis: model collapse occurs faster when training on more challenging datasets.The intrinsic mechanism is that when faced with high-difficulty data, the model is more likely to abandon pre-trained knowledge and instead obtain rewards by optimizing self-consistency (rather than truly learning to solve the task).Based on this hypothesis, researchers employed curriculum learning by filtering the 'simplest' subset of the DAPO dataset for training.Specifically, they retained the top 1/3 simplest prompt samples selected based on the following two metrics:1. Base model pass rate (requires true labels)2. Majority voting frequency (does not require true labels) As shown in Figure 8, training on these simpler subsets significantly delayed the onset of reward hacking, allowing the model to continuously improve over multiple training epochs.Notably, after adopting the curriculum learning strategy, the model's performance ultimately reached a level comparable to standard reinforcement learning training using true labels on the entire DAPO dataset.These groundbreaking results indicate that curriculum learning strategies are expected to further expand the effectiveness boundaries of SRT, opening new directions for future research.
As shown in Figure 8, training on these simpler subsets significantly delayed the onset of reward hacking, allowing the model to continuously improve over multiple training epochs.Notably, after adopting the curriculum learning strategy, the model's performance ultimately reached a level comparable to standard reinforcement learning training using true labels on the entire DAPO dataset.These groundbreaking results indicate that curriculum learning strategies are expected to further expand the effectiveness boundaries of SRT, opening new directions for future research.
Main Tag:Artificial Intelligence
Sub Tags:Large Language Models、Mathematical Reasoning、Self-Improvement、Reinforcement Learning
Share Short URL
Original URL:https://mp.weixin.qq.com/s/PVi5J3pX9IdcwKgl0CXihA