NVIDIA (ProRL) | Can RL truly enhance the reasoning capabilities of LLMs?

Image

Today, we share a research paper from NVIDIA titled "ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models."

This article explores whether Reinforcement Learning (RL) can truly expand the reasoning capabilities of Large Language Models (LLMs), or if it merely optimizes the sampling efficiency of high-reward outputs already present in their base models. It also investigates whether continuous expansion of RL computation can reliably improve reasoning performance. The authors introduce the ProRL (Prolonged Reinforcement Learning) training method, demonstrating that effective RL methods can continuously enhance the reasoning limits of LLMs.

Key features of this method are summarized as follows:

1. Training Stability and Efficiency: ProRL achieves long-term stable training and continuous performance improvement by introducing KL divergence control, reference policy resetting, and a diverse set of tasks.

2. Outstanding Performance: The trained Nemotron-Research-Reasoning-Qwen-1.5B model consistently outperforms the base model in various Pass@k evaluations, including scenarios where the base model completely failed. On multiple benchmarks, its performance even surpassed or matched that of the larger DeepSeek-R1-7B model.

3. Strong Generalization Capability: The model continues to improve after more than 2000 training steps, indicating that RL training can effectively utilize more computational resources and generalize well to unseen out-of-distribution (OOD) tasks and more challenging tasks.

4. Proof that Effective RL Can Elevate LLM Reasoning Limits: It demonstrates that prolonged RL training (ProRL) can discover novel reasoning strategies that are not attainable even through extensive sampling in the base model, thereby truly expanding the model's reasoning capabilities, rather than merely optimizing existing ones.

I. Overview

Title: ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

URL: https://arxiv.org/abs/2505.24864v1 (Please note, according to OCR content, this URL points to a future date in May 2025, which may be an OCR placeholder for a preprint or specific formatting. The actual URL might differ when the paper is published.)

Authors: Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, Yi Dong

Institution: NVIDIA

Code: https://huggingface.co/nvidia/Nemotron-Research-Reasoning-Qwen-1.5B

1 Motivation

• There is an ongoing debate in the research community regarding whether Reinforcement Learning (RL) truly expands the reasoning capabilities of language models, or merely enhances the sampling efficiency of high-reward outputs already inherent in the base model.

Existing RL research has limitations: excessive reliance on specialized domains like mathematics for evaluation, potential overfitting of models during pre-training and post-training stages, limiting exploration potential; and premature termination of RL training, typically only for a few hundred steps, meaning the model has not fully explored and developed new reasoning capabilities.

• This paper aims to demonstrate that through prolonged, stable RL training, models can learn entirely new reasoning strategies that are unattainable even through extensive sampling from the base model.

2 Methods

Summary:

The paper proposes ProRL (Prolonged Reinforcement Learning), a novel training method designed to extend the reasoning capabilities of large language models through prolonged, stable RL training. Its core lies in addressing entropy collapse and instability issues in RL training, and enabling deeper exploration and learning through diverse tasks and policy optimization techniques.

ProRL allows for over 2000 steps of prolonged training, continuously improving performance on diverse tasks, ultimately developing Nemotron-Research-Reasoning-Qwen-1.5B, a model that significantly surpasses its base model (DeepSeek-R1-1.5B) in reasoning capabilities and matches or even exceeds DeepSeek-R1-7B.

Detailed Methods and Steps:

RL Algorithm Selection: It still adopts DeepSeek's GRPO. Unlike PPO, GRPO removes the value model and instead estimates baselines based on group scores, optimizing by maximizing its objective function.

What is Entropy Collapse? How to Mitigate Entropy Collapse Strategies? To address the common problem of entropy collapse in RL training (where model output distribution converges too early, limiting exploration), ProRL adopts several measures:

High Exploration Temperature: Use a higher sampling temperature during the rollout phase to encourage initial exploration.

Decoupled Clipping (referencing DAPO): Introduces DAPO algorithm's decoupled clipping mechanism, treating the upper and lower clipping bounds in the PPO objective as independent hyperparameters. Increasing the value can raise the probability of previously less likely tokens, encouraging wider exploration, helping maintain entropy, and reducing premature mode collapse.

Dynamic Sampling (referencing DAPO): Filters out prompts that the model consistently succeeds or fails on (accuracy 1 or 0), focusing training on moderately difficult examples to maintain diverse learning signals.

KL Regularization: Introduces a KL divergence penalty term into the GRPO objective function. This not only helps maintain entropy but also prevents the online policy from deviating too far from the stable reference policy, thereby stabilizing learning and mitigating overfitting to spurious reward signals.

Reference Model Reset (update ref model when validation performance degrades): To address the issue where the KL term might dominate the loss in later stages of training, leading to diminished policy updates, ProRL periodically hard resets the reference policy model to the latest snapshot of the online policy (i.e., reducing the difference between the latest online model and the reference model to lower the KL term's impact), and reinitializes the optimizer state. This strategy allows the model to continue improving while retaining the benefits of KL regularization, encouraging prolonged training.

image-20250616234113747

Diverse Training Dataset Construction: A diverse and verifiable training dataset comprising 136K questions was constructed, covering five major task domains: mathematics, code, STEM, logical puzzles, and instruction following. Each task type is equipped with clear reward signals (binary or continuous) to enable reliable feedback during training and encourage generalization.

What is DAPO? What key techniques does it employ?

Clip-Higher: This technique aims to enhance system diversity and prevent entropy collapse. Traditional PPO's clipping mechanism restricts policy exploration; Clip-Higher allows for more freedom to increase the probability of low-probability tokens by decoupling the upper and lower clipping bounds, thereby encouraging exploration.

Dynamic Sampling: Dynamic sampling aims to improve training efficiency and stability. It oversamples and filters out prompts with accuracy equal to 1 or 0, retaining prompts with effective gradients, and maintaining a stable number of prompts in the batch. Before training, sampling continues until the batch is filled with samples that have non-0 or non-1 accuracy.

Token-Level Policy Gradient Loss: Token-level policy gradient loss is crucial for long CoT (Chain-of-Thought) RL scenarios. The original GRPO algorithm uses sample-level loss calculation, where the contribution of tokens in long responses to the total loss might disproportionately decrease. Token-Level Policy Gradient Loss allows longer sequences to have more impact on gradient updates and respond to reward changes for each token.

Overlong Reward Shaping: Overlong reward shaping aims to reduce reward noise and stabilize training. For truncated overlong samples, a punitive reward is defaultly assigned, but this introduces noise. The paper proposes an Overlong Filtering strategy to mask the loss of truncated samples, and a Soft Overlong Punishment mechanism to apply length-aware penalties to responses exceeding a predefined maximum length, thereby guiding the model to avoid overly long responses.

What are the experiment setup details?

• RL training was conducted using the verl framework.

• AdamW optimizer was used with a learning rate of 2e-6.

• Training was performed on 48 NVIDIA H100-80GB nodes, totaling approximately 16k GPU hours.

• Training progress was closely monitored through a mixed validation set. When validation performance stagnated or declined, a hard reset of the reference model and optimizer was executed.

• For most of the training, response length was limited to 8k tokens to maintain conciseness and stable generation. In the final stage, the context window was increased to 16k tokens.

3 Conclusion

RL Indeed Expands Reasoning Boundaries: Prolonged, stable reinforcement learning (ProRL) enables language models to learn novel reasoning strategies and solutions that do not exist in their base models.

image-20250616232903418

Effectiveness of ProRL: The ProRL-trained model (Nemotron-Research-Reasoning-Qwen-1.5B) significantly outperforms its base model on various tasks including mathematics, coding, STEM, logical puzzles, and instruction following, and in some cases, achieves or exceeds the performance of larger or domain-specific models.

image-20250616233957332

Reasoning Improvement Correlates with Initial Capability and Training Duration: The extent of improvement in the model's reasoning boundaries is closely related to the base model's initial capability on the task and the duration of RL training. RL yields greater improvements in areas where the base model performs weakly, and continued training allows RL to explore and fill new solution spaces.

image-20250616234200252

4 Limitation

High Computational Resource Requirements: The prolonged RL training process involved in ProRL requires significant computational resources, which may pose a barrier for smaller organizations or researchers with limited budgets.

Scalability Issues: While successful on a 1.5B parameter model, it is unclear whether the method can effectively scale to much larger models (e.g., tens or hundreds of billions of parameters), where the demand for computational resources would be even more significant.

Training Process Complexity: ProRL relies on periodic hard resets of the reference policy and optimizer to maintain training stability, which adds complexity to the training process and may lead to inconsistent results compared to more stable training methods.

Limited Task Scope: Although the evaluation covers diverse domains, the training dataset still represents only a subset of all possible reasoning tasks. While the model shows promising generalization to some out-of-distribution tasks, there is no guarantee of similar improvements across all reasoning domains not explicitly trained on.

II. Summary

Conclusion 1: ProRL proves that RL effectively expands LLM reasoning boundaries. Through prolonged, stable RL training, it is demonstrated that the model can discover novel reasoning strategies not present in the base model, achieving performance beyond the base model on multiple tasks, including strong generalization capabilities on OOD tasks.

Conclusion 2: ProRL ensures RL training stability and efficiency through innovative techniques. To address common entropy collapse and instability issues in RL training, ProRL introduces mechanisms such as KL divergence control, periodic reference model resets, decoupled clipping, and dynamic sampling. These techniques enable the model to continuously improve over prolonged training (over 2000 steps), effectively utilizing computational resources and laying the foundation for long-term RL application in reasoning tasks.

Main Tag:Large Language Models

Sub Tags:Reinforcement LearningProRLModel TrainingReasoning


Previous:The Brain's Biggest Fear: Frequent Switching – Deep Work, Focus Mechanisms, and Counter-Intuitive Truths Explained

Next:LLMs Can Now Self-Update Weights, Significantly Enhancing Self-Adaptation and Knowledge Integration Capabilities – Has AI Awakened?

Share Short URL