10 Lines of Code, 15% Improvement in AIME24/25! Unveiling the Entropy Mechanism in Large Language Model Reinforcement Learning

Image

The authors of this article are from Tsinghua University, Peking University, Shanghai AI Lab, and other institutions. The co-first authors Cui Gan-qu, Zhang Yu-chen, and Chen Jia-cheng are from Shanghai AI Lab, focusing on enhancing reasoning in large models. Corresponding authors are Professor Cheng Yu and Professor Zhou Bowen from Shanghai AI Lab, and Assistant Professor Ding Ning from Tsinghua University.

Nature never undertakes any change unless her interests are served by an increase in entropy.

Any change in nature only occurs when an increase in entropy serves its interests—Max Planck

In reinforcement learning, how can we make entropy increase serve our interests?

Recently, researchers from Shanghai AI Laboratory, Tsinghua University, Peking University, UIUC, and other institutions published work revealing the mechanism of entropy change in large model reinforcement learning. The main research content is as follows:

Defined the entropy collapse problem in reinforcement learning, and summarized an empirical conversion formula between entropy and performance across 4 model families and 11 models, proving the importance of policy entropy in reinforcement learning.

From theoretical and practical perspectives, identified the driving force behind policy entropy changes during reinforcement learning: the covariance between the probability of actions (model output tokens) and their corresponding obtained advantages.

Based on this, the study proposed two simple (10 lines of code modification) but highly effective (AIME24/25 + 15%) entropy-enhanced reinforcement learning schemes, Clip-Cov and KL-Cov, achieving continuous exploration during model reinforcement learning training.

Image

Paper Title: The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Paper Link: https://huggingface.co/papers/2505.22617

Code Repository: https://github.com/PRIME-RL/Entropy-Mechanism-of-RL

1. Entropy Collapse Problem in Large Model Reinforcement Learning

The core challenge in reinforcement learning lies in the exploitation-exploration trade-off, which is balancing the repeated validation of existing policies with the discovery of new ones. For exploration, a key metric to gauge a policy's exploratory potential is policy entropy, which reflects the uncertainty in the policy's action selection process. In reinforcement learning research, suppressing the decay of policy entropy is considered crucial for most algorithms. In traditional reinforcement learning, researchers often actively regulate policy entropy through regularization methods.

For large language models, although the typical behavior of policy entropy has not been fully studied, we observed an interesting and consistent pattern in extensive experiments: policy entropy sharply drops to near zero within just a few training steps, indicating that the policy becomes extremely deterministic. This lack of exploratory capacity directly leads to performance stagnation, with validation set performance similarly hitting a bottleneck. Quantitative analysis further reveals that without entropy intervention (such as entropy loss or KL regularization), downstream performance (R) is entirely determined by policy entropy (H), with its fitted curve conforming to a simple exponential function R = -a exp (H)+ b, as shown in the figure below. Essentially, the policy is predictably trading uncertainty (entropy) for rewards.

Image

Figure 1 illustrates the entropy collapse problem in large model reinforcement learning

We verified this on Qwen, Mistral, LLaMA, and Deepseek Model families:

Image

Figure 2: Entropy Collapse Phenomenon Across Different Model Families

This empirical rule leads to two important implications: (1) Similar to the Scaling Law, the exploitation-exploration curve is determined given a policy model and training data. This allows us to predict policy performance early in reinforcement learning and extrapolate large model performance from smaller ones. (2) More importantly, this equation shows that when policy entropy is depleted (H = 0, R = −a + b), the upper bound of policy performance is also determined, meaning that simply increasing training computational power may yield extremely limited benefits for reinforcement learning. Therefore, in short, to achieve scalable reinforcement learning, the entropy bottleneck must be broken.

Image

Figure 3: Predicting Final Model Performance During Early Training

Image

Figure 4: Small Model Predicting Large Model Performance

2. Relationship Between Entropy and Covariance in Large Model Reinforcement Learning

The key to solving this problem lies in understanding the underlying mechanism: why does policy entropy monotonically decrease? To this end, we analyzed the dynamic characteristics of policy entropy from both theoretical and experimental dimensions. The core finding shows that for LLMs employing softmax policies, the entropy change between two consecutive steps is proportional to the covariance between the logarithm probability of actions and the corresponding logit change. Furthermore, in policy gradient and natural policy gradient algorithms, the logit difference is proportional to the action advantage.

Intuitively, actions with high advantage and high probability reduce policy entropy, while rare actions with high advantage increase entropy. This theoretical conclusion was experimentally verified: in the early stages of training, the policy exhibited high covariance on the training data, indicating good policy confidence, thus enabling safe exploitation of high-confidence trajectories, strengthening confidence, and minimizing entropy (this also aligns with recent work on minimizing entropy to improve performance); as training progressed, covariance gradually decreased but remained positive, continuously dragging policy entropy to lower levels.

Image

Formula 1: Theoretical Analysis of Entropy and Covariance

Image

Figure 5: Empirical Analysis of Entropy and Covariance

3. Covariance-Based Entropy Enhancement Reinforcement Learning Schemes

We first experimentally verified that traditional entropy / KL regularization methods have little effect in large models.

Image

Figure 6: Failure of Traditional Regularization Methods

Analysis of entropy dynamics indicates that high covariance hinders the scalability of reinforcement learning, which provides a direction for increasing policy entropy—limiting the update step size of high-covariance tokens. Based on this, we designed two entropy control strategies, Clip-Cov and KL-Cov, replacing the clip and PPO-KL methods in the loss function, respectively. Clip-Cov randomly selects a small number of high-covariance tokens and detaches their gradients:

ImageImage

Formula 2: Clip-Cov

KL-Cov is simpler, directly applying KL penalty to tokens with the largest covariance:

ImageImage

Formula 3: KL-Cov

Experiments show that by adjusting the threshold parameter, policy entropy can be actively controlled, allowing the model to escape low-entropy traps:

Image

Figure 7: Controlling Entropy with Clip-Cov and KL-Cov

Experiments showed superior performance in tasks such as mathematical reasoning. On Qwen2.5-32B, we achieved a 6.4% improvement, and a remarkable 15% improvement on challenging datasets like AIME24/25.

Image

Figure 8: Training Dynamics of Entropy, Output Length, and Performance under Clip-Cov and KL-Cov Methods

Image

Figure 9: Performance of Clip-Cov and KL-Cov

This study aims to solve the policy entropy collapse problem in reinforcement learning for large language model reasoning tasks. Through empirical analysis, we found that performance improvement often comes at the cost of sacrificing exploration capability, and this trade-off sets a foreseeable performance ceiling for model improvements. To deepen our understanding of this phenomenon, we analyzed the dynamic laws of entropy from a theoretical perspective and proposed two simple regularization techniques—Clip-Cov and KL-Cov—to effectively curb entropy collapse by directly regulating high-covariance tokens.

Looking ahead, computational power for training will gradually shift from the pre-training phase to the post-training phase, especially reinforcement learning. On the path to scaling reinforcement learning by increasing computational power, maintaining exploratory capacity, discovering new paths, and achieving continuous improvement are crucial for more efficient use of computational resources. However, achieving scalable reinforcement learning requires breaking through the limitations of simple entropy minimization. We hope this research provides new insights into the mechanism of entropy's role, promoting the understanding, analysis, and optimization of the underlying mechanisms of LLM reinforcement learning, and pushing reinforcement learning towards higher levels of intelligence.

Image

© THE END

Please contact this official account for reprinting authorization

Submissions or inquiries for coverage: liyazhou@jiqizhixin.com

Main Tag:Reinforcement Learning

Sub Tags:Large Language ModelsMachine Learning ResearchDeep LearningArtificial Intelligence


Previous:Enabling AI to 'Weigh Pros and Cons'? DecisionFlow Makes Large Language Models Smarter for High-Risk Decisions!

Next:Sam Altman: Codex Made Me Feel AGI! Latest Talk Rarely Reveals Next-Gen "Perfect Model," Boldly Predicts Agents Will Break Boundaries Next Year!

Share Short URL