Reviewing the Progress of RL-Reasoning

MLNLP community is a well-known machine learning and natural language processing community at home and abroad, covering NLP master's and doctoral students, university teachers, and enterprise researchers at home and abroad.

The vision of the community is to promote communication and progress among domestic and international natural language processing, machine learning academia, industry, and enthusiasts, especially the progress of beginner students.

Source | Zhihu

Author | 还可以

Let's summarize the progress of RL-reasoning in the past few months, along with some minor tidbits concerning the release dates of important papers.

This summary certainly has shortcomings and omissions, and I hope everyone can supplement and point them out.

We roughly divide this period into the rise, calm, and setbacks of RL-reasoning. We might focus more on the descriptions of the calm and setbacks!

Rise

1. GRPO's Make-RL-Great-Again

Through judging the outcome reward as a feedback condition, GRPO achieved exciting results using rule-based methods and abandoning PRM (intermediate process reward). (Considering some reinforce++ reproductions, Remax, Prime, and other works)

Subsequent related work is the modification of the GRPO function:

2. DAPO's simple and effective trick stacking

DAPO added four minor tricks based on GRPO.

Clip-Higher:

The author encourages higher clips, increasing the exploration space for low-probability tokens by decoupling the upper and lower clipping ranges.

Dynamic Sample:

Existing RL algorithms often encounter the vanishing gradient problem when faced with prompts with an accuracy of 1. DAPO filters out prompts with an accuracy of 1 and 0 through a dynamic sampling strategy, ensuring that prompts in each batch have valid gradient signals.

Token-Level:

GRPO uses sample-level loss, leading to lower contributions from tokens in long responses to the overall loss. DAPO introduces a Token-Level policy gradient loss.

But in fact, this technology has basically been used by everyone in advance.

Overlong Reward Shaping:

DAPO proposes a soft overlong penalty mechanism, which gradually increases penalties for overlong responses through length-aware penalty intervals, thereby reducing reward noise and stabilizing training.

3. DR.GRPO's function modification

The author of DR.GRPO, through derivation, believes that deriving GRPO from policy-gradient should not include std and .

Here, I'd like to comment that although the author provided some proofs, personal reproduction actually led to a decline in effectiveness, especially after removing Std.

4. GPG's simplification operation

GPG completely uses a policy-based method, removing other PPO mini-tricks

As you can see, it's very simplified. Of course, it still requires some minor operations, such as GRPO's zero advantage and the difficult operation of std.

The final effect is shown in the figure below. At the same time, the author also italicized a note that Dr.GRPO's operation had no effect.

others

Research on more efficient reasoning (competition is fierce, once 10+ papers were released on Arxiv in 2 days): reasoning length optimization, think or nothink.

High-quality sample screening.

Calm: Some research finds limitations of RL-reasoning, and the unbearable aspects brought by some improvement methods.

(Limitations) RL-reasoning does not bring additional capabilities to the model.

Here, we first introduce this paper from Tsinghua University, "Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?"

The authors found that the reasoning paths generated by models trained with RLVR already exist with a considerable probability density in the output distribution of the base model. Problems that RL models can solve can also be solved by base models, only requiring multiple sampling.

Furthermore, under the Pass@k metric (intuitively, it calculates the probability of passing at least once when given multiple chances), the upper limit of the capability of RL-trained models is lower than that of the original Base model. It can be understood that RL only improves sampling efficiency.

I was also surprised and a little dejected by this phenomenon when I first entered RL-reasoning, but only after being reminded by others did I realize that Deepseekmath had already reported this situation (as shown in the figure below).

In any case, at this point, RL as a means to improve sampling efficiency is still very valuable. The following progress is a bit....

(label-free training) No external labels needed

Let's explain in chronological order.

EMPO

EMPO does not rely on true Ground-truth, but instead clusters model responses and rewards all responses belonging to the same class.

The model performance is shown in the figure below. I personally feel that the baseline was not well-trained, but the purpose of method effectiveness was achieved. However, the paper is actually somewhat similar to a form of entropy minimization. It's a pity that it wasn't directly transformed into entropy minimization, otherwise, several papers might not have been published by now.

TTRL:

Live preview! From TTS to TTRL: Exploring and Outlook of Reinforcement Learning with Unlabeled Data

Through limiting major@k, which is less effective than EMPO's classification approach, because it's difficult to implement in cases without fixed answers, EMPO can. But it's almost the same.

I don't like his emphasis on test-time, because in terms of effectiveness, the computation used is less than direct major@k, and I'm more curious about how it performs when trained on the training set.

At the same time, DPO's idea earlier on used major@k as pseudo-positive labels, but did not mention test-time.

Interestingly, there is a paper "Can Large Reasoning Models Self-Train?" which is basically identical to TTRL but trains on the training set and tests on the test set. I wonder how reviewers will handle Li Kui and Li Gui...

However, at this point, some understanding has emerged: our model seems not to need external answers; can it rely on its own content? But at this time, the situation was not yet clear.

Entropy Minimization:

This paper studies using entropy as an objective for training, dividing it into three models:

• EM-FT: Minimizing token-level entropy, but based on unlabeled outputs extracted from the model (biased towards SFT).

• EM-RL: Reinforcement learning that maximizes negative entropy as the sole reward.

• EM-INF: Inference-time logit adjustment to reduce entropy, without any training data or parameter updates.

It's strange that the author sampled 4, which is less...

Summary:

The successive appearance of these papers dealt a heavy blow to my enthusiasm for RL-reasoning, but I still maintained a considerable degree of interest. However, the appearance of the next few articles was indeed a bit unbearable.

Attack: What exactly did our RL learn?

Single sample

One-shot-RL:

Another very famous paper. The author achieved good results by training with only a single sample over multiple rounds. The sample selection was based on variance. This can also be understood as selecting data with higher model entropy to reduce the model's entropy.

When I first read the paper, I was also surprised. Then I wondered if it was a formatting issue? The author then conducted corresponding experiments a week later and added them to the latest Arxiv version.

At the same time, it can be seen that entropy loss plays a significant role. But unfortunately, the author did not check what would happen with entropy only, leaving an opportunity for later work.

One-shot entropy minimization, only 10 steps:

Similar to the previous paper's idea, one-shot sample training for 10 steps, but directly using entropy as a reward. The model performance comparison chart is as follows, and the effect is still very good (whispering, it feels a bit unstable, the average improvement mostly comes from AMC, but AMC is too random).

A small note is that the generated length is too short, isn't it?

At the same time, the author investigated the distribution of logits. Entropy increased the overall model's confidence, concentrating probability mass on a subset of tokens. Therefore, previously high probability regions in the original logits were extended to long-tail high probability intervals.

Error rewards improve model performance:

This paper is more comprehensive. It studies the different effects of various rewards: random rewards, error rewards, format rewards, major, and correct rewards. This article surprisingly demonstrates the effects of random rewards and error rewards.

Random rewards and error rewards, these spurious rewards, can also improve our model's performance, which makes it quite clear. It's simply about increasing the model's own confidence, as the model's output itself is already a part the model is quite confident in.

The best aspect of this paper is the wide range of base model families chosen. At first glance, it seems to strike at the effectiveness of RL-reasoning. Although spurious rewards can improve Qwen or Llama, they have no effect on some clean models like Olmo, whereas our RL is useful.

At the same time, this paper conducted experiments with all rewards under the same author settings, avoiding comparisons between different papers. We can also see that even with Qwen models, using correct RL-reasoning can achieve a 4-5 point improvement compared to simply strengthening confidence.

Summary

The overall research trend found is: exploring and learning with external answers - using alternative external answers exploration and learning - no external answers for exploration and learning - changing the model without knowledge learning.

Of course, no matter what, the main model's own iteration process must exist; it must sample itself.

Our rollout indeed has problems (sarcasm). All our rollout samples can be regarded as the model's own relatively confident generated output, which is then revised upon. Therefore, adopting some strong confidence measures can also achieve some results.

But don't be disheartened, many of the problems discovered above are explored in simple MATH scenarios and within the model's internal capabilities.

Technology Exchange Group Invitation Letter

Scan the QR code to add a little assistant on WeChat

Please note: Name-School/Company-Research Direction

(e.g., Xiao Zhang-Harbin Institute of Technology-Dialogue System)

You can apply to join Natural Language Processing/Pytorch and other technical exchange groups

About us

MLNLP community is a civilian academic community jointly established by machine learning and natural language processing scholars at home and abroad. It has now developed into a well-known machine learning and natural language processing community at home and abroad, aiming to promote progress among academia, industry, and enthusiasts in machine learning and natural language processing.

The community can provide an open exchange platform for relevant practitioners in further studies, employment, and research. Welcome everyone to follow and join us.

Reviewing the Progress of RL-Reasoning

Share Short URL