MLNLP community is a well-known machine learning and natural language processing community at home and abroad, with members including NLP master's and doctoral students, university teachers, and enterprise researchers from China and other countries. The community's vision is to promote exchange and progress among the natural language processing and machine learning academic community, industry, and enthusiasts, especially for beginners.
Source | Synced Review
The first author of this article, Yiping Wang, is a Ph.D. student at the University of Washington; his advisor and corresponding author, Shaolei Du, is an Assistant Professor at the University of Washington; the other two corresponding authors, Yelong Shen and Shuohang Wang, are Principal Researchers at Microsoft GenAI.
Recently, large language models (LLMs) have made significant progress in reasoning capabilities, especially in complex mathematical tasks. One key method driving this progress is Reinforcement Learning with Verifiable Reward (RLVR), which provides a 0-1 outcome reward based on the correctness of the final answer to a math problem. However, much research effort has focused on improving existing reinforcement learning algorithms (such as PPO, GRPO), while research on the data utilized in RLVR is still relatively insufficient.
Recently, researchers from the University of Washington, Microsoft, and other institutions explored an important question: How much data is actually needed in RLVR to achieve good performance?
They discovered a surprising phenomenon: using just one piece of mathematical data can significantly improve the model's performance on various mathematical reasoning tasks!
Paper Title: Reinforcement Learning for Reasoning in Large Language Models with One Training Example
Paper URL: https://arxiv.org/abs/2504.20571
Code URL: https://github.com/ypwang61/One-Shot-RLVR
W&B Experiment Log: https://wandb.ai/yipingwanguw/verl_few_shot?nw=nwuseryipingwang22
X(Twitter): https://x.com/ypwang61/status/1917596101953348000
The paper found that using only one training example in RLVR training (referred to as 1-shot RLVR) can improve the performance of Qwen2.5-Math-1.5B on MATH500 from 36.0% to 73.6%, and Qwen2.5-Math-7B from 51.0% to 79.2%.
This performance is similar to the effect of RLVR using a 1.2k dataset (including this one example). Using two training examples in RLVR even slightly surpasses the performance using the 1.2k dataset (called DSR-sub) and is comparable to the performance of RLVR using the 7.5k MATH training set. This performance can be observed across 6 common mathematical reasoning tasks.
This reasoning capability inspired by 1-shot RLVR using one mathematical training data point can even extend to non-mathematical reasoning tasks, such as ARC-Easy/Challenge.
Background Introduction
In this work, the paper uses three loss functions: policy gradient loss, KL divergence loss, and entropy loss. Here, the policy loss uses the GRPO format loss function, corresponding to the 0-1 outcome reward for solving the math problem; the KL loss is used to maintain the model's language quality on general tasks; and the entropy loss (with a negative coefficient) is used to encourage the model to generate more diverse reasoning patterns.
For data selection, the researchers used an indicator called historical variance score to rank the data in the data pool (the aforementioned 1.2k DSR-sub dataset) in order to prioritize data with larger accuracy variance during model training. However, the paper emphasizes that this data selection is not necessarily optimal, it's just to better illustrate the phenomenon. Furthermore, 1-shot RLVR is effective even on data with less high historical variance scores, suggesting it may be a more general phenomenon.
In addition, the researchers also found that the data that enables 1-shot RLVR to perform very well is actually not particularly difficult. The initial model already has a certain probability of solving it.
Experimental Observations
Through 1-shot RLVR, the paper also found many interesting phenomena:
(1) Saturation then Generalization: The paper found that in 1-shot RLVR, the training accuracy on a single training example quickly reaches close to 100%, but the performance on downstream tasks continues to improve as training progresses. (Later explanation states that because entropy loss encourages exploration of diversity, the accuracy is slightly less than 100%, so policy gradient is always maintained during training).
Meanwhile, during the saturation-then-generalization process, overfitting happens relatively late, with obvious garbled output mixed with correct answers only appearing after a single example rollout exceeds 1 million times. Furthermore, the reasoning output for downstream tasks remains normal and performs well at this time.
(2) 1-shot RLVR is effective on many math examples and has good generalization. The paper tested over a dozen examples, and most of them achieved an improvement of close to or over 30% on MATH500. At the same time, a single training data point from one math topic (such as geometry) can simultaneously improve performance on other math topics (such as algebra, number theory, etc.).
(3) More self-reflection: The 1-shot RLVR training process also shows an increase in response length, as mentioned in previous work like R1. More importantly, the paper observed an increase in the frequency of self-reflection-related vocabulary in the model's output on downstream tasks.
(4) 1-shot RLVR can be used with different models and algorithms. Researchers tested different models (Qwen2.5-Math-1.5B/7B, Llama-3.2-3B-Instruct, DeepSeek-R1-Distill-Qwen-1.5B) and different RL algorithms (GRPO, PPO), and observed significant improvements in all cases. Furthermore, the data used here was selected based on the historical variance score calculated with the Qwen2.5-Math-1.5B model, indicating that some data is applicable to different models.
Ablation Studies and Analysis
The paper further analyzes the main reasons for the improvements achieved by 1-shot RLVR. By removing other loss functions, the paper found that the improvement to the model from 1-shot RLVR primarily comes from the policy gradient loss and is not significantly related to KL divergence loss or weight decay. Therefore, although the saturation-then-generalization phenomenon has similarities with the “grokking” phenomenon (both show good generalization on downstream tasks even after overfitting), there are still significant differences as “grokking” is largely influenced by regularization methods (such as weight decay).
In addition, the paper also found the importance of encouraging exploration. For example, adding an appropriate amount of entropy loss on top of the policy gradient loss can further improve the performance of 1-shot RLVR, especially for saturation-then-generalization. As an additional observation, the paper found that adding only entropy loss for a small number of training steps can also magically improve model performance, and this leads to 1-shot RLVR still partially improving performance even if the data label is incorrect. The authors are still exploring the reasons for this phenomenon.
Summary and Discussion
The performance of 1-shot RLVR on mathematical tasks supports the conclusions of many previous papers, namely that the base models used for RLVR often already possess good reasoning capabilities, and this paper further demonstrates that this capability can possibly be unleashed with very little data.
The paper believes that these phenomena can encourage people to further reflect on the recent progress of RLVR and consider the internal mechanisms of RLVR. Furthermore, they provide some insights into questions such as how to design better RLVR data selection algorithms, how to understand 1-shot RLVR and the saturation-then-generalization phenomenon, how to better encourage exploration, and how to explore few-shot RLVR for other tasks and its applications, among others.
Technical Exchange Group Invitation
△Long press to add assistant
Scan the QR code to add assistant's WeChat
Please note: Name-School/Company-Research Direction
(e.g., Xiaozhang-Harbin Institute of Technology-Dialogue Systems)
to apply to join technical exchange groups such as Natural Language Processing/Pytorch
About Us
MLNLP community is a non-governmental academic community jointly established by machine learning and natural language processing scholars at home and abroad. It has now developed into a well-known machine learning and natural language processing community at home and abroad, aiming to promote progress among the machine learning and natural language processing academic community, industry, and enthusiasts.
The community provides an open exchange platform for relevant practitioners' further study, employment, and research. Everyone is welcome to follow and join us.