ByteDance Breaks the 'Entropy Curse' in LLM RL Training, Enabling Models to Learn with Certainty!

In short, the paper argues that errors made by LLM agents are not entirely their fault but rather a flaw in the learning algorithm. It proposes a novel "entropy-modulated" credit assignment method, which amplifies rewards for correct decisions and penalizes incorrect ones more severely. (The original paper title is at the end of this article, Published on arXiv on 11 Sep 2025, by ByteDance)

First Stage: Identifying Core Concepts

Analysis of the Paper's Motivation

Imagine you're teaching a robot to complete a complex task, like "booking the cheapest flight from Beijing to Shanghai on a specific date online." This task involves many steps: opening a browser, searching airline websites, entering departure and destination, selecting dates, comparing prices, filling in personal information, and finally clicking to pay.

Traditional training methods (reinforcement learning) suffer from a huge problem: you only give the robot a "well done" reward if it successfully buys the ticket. If it fails at any step, such as clicking the wrong date or a website failing to load, it ultimately receives a "failure" penalty.

This "winner-takes-all" reward mechanism is highly inefficient. The robot might have performed perfectly for the first 9 steps, only making a mistake in the last one, but it will perceive all 10 steps as wrong. Conversely, it might have stumbled into success by chance, with some steps being very risky and uncertain, yet it will consider all steps equally correct.

The motivation behind this paper is to solve this "coarse credit assignment" problem. The authors discovered that standard learning algorithms (policy gradients) have an inherent flaw:

When the model is very confident about an action (e.g., it is very sure it should click the "search" button next), the learning signal is actually weak. This means that even if this confident action is correct, it doesn't get enough reinforcement, leading to slow learning.
When the model is very uncertain about an action (e.g., it hesitates between multiple links), the learning signal is actually strong. If this uncertain exploration happens to lead to a good outcome, the model will over-reinforce this "gamble," leading to a very unstable learning process.

The authors' goal is to design a smarter "coach" that can dynamically adjust the strength of rewards and penalties based on the model's "confidence" at each step, thereby achieving more efficient and stable learning.

Analysis of the Paper's Main Contributions

List the paper's claimed main innovations

Identified and formalized a fundamental problem: The paper for the first time explicitly pointed out and mathematically proved that in LLM agents, the magnitude of the policy gradient is intrinsically coupled with the model's output "entropy" (which can be understood as uncertainty). This is a crucial theoretical finding.
Proposed the Entropy-Modulated Policy Gradient (EMPG) framework: This is a brand new learning framework designed to decouple the aforementioned problem and achieve smarter credit assignment.
Introduced two core technologies: The EMPG framework comprises two key parts: "Self-Calibrating Gradient Scaling" and "Future Clarity Bonus."

Identify key techniques or methods supporting these innovations

Self-Calibrating Gradient Scaling: This technique adjusts the strength of the learning signal based on the agent's confidence (entropy) in the current step. If the agent is confident and correct, the reward is amplified; if confident but wrong, the penalty is amplified; if uncertain, the learning signal is weakened to prevent the learning process from being disturbed by unstable exploratory behaviors.
Future Clarity Bonus: This technique is an additional intrinsic motivation. It encourages the agent to take actions that make the next step clearer and less ambiguous. This guides the agent to seek more robust and predictable paths to success, not just any path that might succeed.

What are the significant results of the paper? The most significant result of the paper is that their method not only numerically outperforms existing baseline models but, more importantly, it solves the performance bottleneck encountered by baseline models. Experimental charts show that traditional methods' performance stagnates after training to a certain extent, as if hitting a wall. EMPG, however, helps the agent break through this bottleneck, continuously learning and improving, ultimately reaching a significantly higher performance level. This indicates that EMPG fundamentally improves learning dynamics, not just by making minor optimizations.

Identifying Difficulties in Understanding

Analyze which concepts/methods are key to understanding the paper

Policy Gradients: This is the foundation of reinforcement learning; understanding it is necessary to grasp what the paper improves.
Entropy: In this paper, entropy is the core metric for measuring model uncertainty. Understanding what high and low entropy represent is crucial.
Coupling relationship between Entropy and Gradients (Proposition 1): This is the theoretical cornerstone of the paper and key to understanding its motivation.
Advantage Function: This is a core component in policy gradients, and EMPG primarily operates on it.

Identify the most challenging parts of these concepts The most core and challenging part is the design of the "Entropy-Modulated Advantage Function," which is formula (8) in the paper. This formula integrates both "Self-Calibrating Gradient Scaling" and "Future Clarity Bonus," forming the technical core of the entire EMPG framework. Understanding how this formula transforms the model's "uncertainty" into concrete, computable reward adjustments is key to comprehending this paper.
Determine core concepts that require detailed explanation We will focus on explaining the Entropy-Modulated Advantage Function (The Modulated Advantage). This is because it perfectly embodies how the paper utilizes the concept of "entropy" to intelligently adjust learning signals, serving as the ultimate technical culmination of all ideas.

Concept Dependency Relationships

The logical chain for understanding EMPG is as follows:

Starting Point: The "policy gradient" method in standard reinforcement learning suffers from uneven credit assignment when dealing with long-horizon tasks.
Root Cause of the Problem: Its learning signal (gradient) magnitude is naturally tied to the model's "uncertainty (entropy)," leading to inefficient and unstable learning (theoretical finding).
Solution: We must break this binding and actively "modulate" the learning signal.
Core Mechanism: This is achieved by designing a new "entropy-modulated advantage function." This function consists of two parts:

Scaling the original success/failure signal based on the entropy of the current step.
Giving an additional reward based on the entropy of the next step.

Final Effect: Achieving a smarter, more efficient, and more stable learning process that can break through performance bottlenecks.

Our entry point will be this core mechanism—the entropy-modulated advantage function—because it connects theory and practice, serving as their intersection.

Second Stage: In-Depth Explanation of Core Concepts

Designing a Real-Life Analogy: The Smart Rock Climbing Coach

Imagine you're a novice rock climber learning with a very smart coach. Your goal is to climb a complex rock face (complete a long-horizon task).

Standard Coach (Traditional RL): This coach only watches you from the bottom of the mountain with binoculars. There are only two types of feedback:

If you successfully reach the summit, he shouts: "Great job! Every step you took was amazing!"
If you fall halfway, he yells: "Terrible! Every step you took was wrong!" This feedback is obviously not very useful because it doesn't tell you which steps were critical and which were lucky.

Smart EMPG Coach: This coach climbs with you and observes every move you make. He not only cares if you ultimately succeed but also about your state when performing each action.

This smart coach has two unique guiding principles:

Confidence-Based Feedback Adjustment: He assesses your "confidence" when grasping each handhold.
Encouraging a "Clear Next Step": He rewards actions that make your next route clearer and less ambiguous.

This is the core idea of EMPG.

Establishing the Correspondence Between Analogy and Actual Technology

Mapping Analogy Elements to Technical Concepts:

Analogy Element: You (the climber) | Corresponding Technical Concept: LLM Agent | Explanation: Both are subjects performing complex multi-step tasks.

Analogy Element: Climbing to the top of the rock face | Corresponding Technical Concept: Task Success (receiving positive reward) | Explanation: Represents the ultimate, sparse positive feedback.

Analogy Element: Falling from the rock face | Corresponding Technical Concept: Task Failure (receiving negative reward) | Explanation: Represents the ultimate, sparse negative feedback.

Analogy Element: Your every climbing action (grasping/stepping on a handhold) | Corresponding Technical Concept: An agent's "thought-action" step | Explanation: The task is composed of a series of discrete steps.

Analogy Element: Your confidence in a handhold | Corresponding Technical Concept: Model's "certainty" for the current step | Explanation: A solid, large handhold gives you full confidence; a slippery, small handhold makes you hesitate.

Analogy Element: Quantitative measure of confidence (higher confidence, lower value) | Corresponding Technical Concept: Step-level Entropy (H_t) | Explanation: Low entropy means the model is very certain, with a concentrated probability distribution of outputs (like grasping a large handhold); high entropy means the model is confused, with scattered output probabilities (like facing multiple uncertain handholds).

Analogy Element: The smart EMPG coach | Corresponding Technical Concept: EMPG algorithm | Explanation: Responsible for intelligently adjusting learning signals based on process information.

Analogy Element: Coach's guide book | Corresponding Technical Concept: Entropy-Modulated Advantage Function (A_t^{EMPG}) | Explanation: This is the core rule for the EMPG algorithm's decision-making and feedback.

In-Depth Technical Details

Now, let's transition from the analogy to the technology itself and see how the coach's "guide book"—the entropy-modulated advantage function—is actually written.

This core formula appears as formula (8) in the paper:

Original Mathematical Form:

A_t^{EMPG} = f(H_t) * A_hat_t + lambda * (h_next - E[h_next])

This formula looks complex, but it's essentially the mathematical expression of the coach's two guiding principles. Let's translate it:

Symbol-Replaced Version:

A step's final score = (Overall task success/failure result × Current step's confidence multiplier) + (A fixed weight × Next step's clarity bonus)

Now let's break down this "coach's manual" piece by piece:

Part One: Confidence-Based Feedback Adjustment (Self-Calibrating Gradient Scaling)

A_hat_t (Overall task success/failure result): This is the standard coach's feedback. Success in reaching the summit is +1; falling down is -1.
f(H_t) (Confidence Multiplier for Current Step): This is the EMPG coach's first secret weapon. Its calculation is roughly e^(-alpha * H_t), where H_t is the normalized entropy. Specifically: if you grasp a very solid handhold (low entropy), the value of this f(H_t) function will be greater than 1, thereby amplifying the final success or failure result; conversely, if you grasp a very uncertain handhold (high entropy), the value of this f(H_t) function will be less than 1, thereby reducing the final success or failure result.

Part Two: Encouraging a "Clear Next Step" (Future Clarity Bonus)

(h_next - E[h_next]) (Next step's clarity bonus): This is the coach's second secret weapon. Its calculation is similar to f(H_t), e^(-beta * H_next), but it uses the entropy of the next step. If your current action makes your next step's route very clear and choices very certain (next step has low entropy), the h_next function will provide a higher reward value.
lambda (A fixed weight): This is a hyperparameter used to control the importance of this "future clarity bonus."

Mapping Technical Details to the Analogy

Now, let's perfectly combine the coach's behavior with this formula.

Scenario 1: Confident and Correct Critical Step

Climbing Action: You successfully reach the summit (A_hat_t = +1). Along the way, you make a very decisive move, grasping a large and solid handhold (current step low entropy).
Coach's Feedback (Formula Calculation): f(H_t) is greater than 1 due to low entropy, so the first part of the final score f(H_t) * A_hat_t is an amplified positive number.
Coach's Words: "Excellent! Not only did you succeed, but this step was firm and precise, a key to your success! We must firmly remember this action!"
Effect: Greatly reinforces this "confident and correct" behavior.

Scenario 2: Confident but Wrong Disastrous Step

Climbing Action: You fall down (A_hat_t = -1). The reason is that you confidently reached for a seemingly solid handhold, but it came loose (current step low entropy, but led to a bad outcome).
Coach's Feedback (Formula Calculation): f(H_t) is greater than 1 due to low entropy, so the first part of the final score f(H_t) * A_hat_t is an amplified negative number.
Coach's Words: "This was a serious mistake! You were too confident in choosing a wrong path, which directly led to failure. We must deeply reflect and never make this mistake again!"
Effect: Greatly penalizes this "blindly confident" error, which the paper refers to as "hallucinated confidence."

Scenario 3: Uncertain but Lucky Exploration

Climbing Action: You successfully reach the summit (A_hat_t = +1). But at one point, you hesitated greatly over several slippery handholds, finally taking a risky leap and fortunately succeeding (current step high entropy).
Coach's Feedback (Formula Calculation): f(H_t) is less than 1 due to high entropy, so the first part of the final score f(H_t) * A_hat_t is a reduced positive number.
Coach's Words: "Although you succeeded, this step was too risky, with a lot of luck involved. We celebrate the success, but let's not learn this fluke as a standard operating procedure."
Effect: Avoids over-rewarding unstable exploratory behaviors, making the learning process more stable.

Scenario 4: Farsighted Planning

Climbing Action: You make a move that might be difficult itself, but it puts you in a very good position, with several excellent, clear handholds available for the next step (next step has low entropy).
Coach's Feedback (Formula Calculation): h_next generates a positive reward value because of the low entropy of the next step, so the final score adds this positive "future clarity bonus" lambda * (h_next - E[h_next]).
Coach's Words: "I like this step! It's not only safe, but more importantly, it paves the way for your next move, giving you a clear view of the upcoming route. This is thoughtful climbing!"
Effect: Encourages the agent to plan and seek sustainable, predictable solution paths.

Limitations of the Analogy: The rock climbing coach analogy is very intuitive, but it simplifies the calculation of entropy. In actual technology, "entropy" is calculated by analyzing the probability distribution of all possible tokens generated by the model; it is a precise mathematical quantity, not just a "feeling" of confidence.

Summary

Through the "smart rock climbing coach" analogy, we can summarize the core idea of EMPG as follows:

It is no longer just a "referee" who only looks at results, but a "coach" who delves into the process. It uses entropy as a stethoscope to diagnose the agent's "health status" (certainty) at each step, then uses the entropy-modulated advantage function as a tool to prescribe personalized reward and punishment formulas. This formula not only treats current "symptoms" (through the confidence multiplier) but also encourages "strengthening the body" (through the future clarity bonus), ultimately cultivating a more powerful and robust LLM agent.

Third Stage: Detailed Description of the Process Steps

The entire process can be viewed as a "refinement" of raw, coarse feedback signals, ultimately yielding fine-grained, step-specific learning signals.

Input: A batch (e.g., 16) of complete interaction records (called "trajectories") between the agent and the environment. Each trajectory contains a series of "thought-action" steps and the final task result (success or failure).

Output: Updated LLM agent model parameters.

Detailed process steps are as follows:

Step One: Collect Raw Data and Compute Initial Feedback

First, the algorithm iterates through the interaction records of this batch of 16 tasks.
For each task (trajectory), it checks the final result. If the task is successful, it assigns an initial, uniform positive "Advantage" value to all steps in that trajectory, e.g., +1. If the task fails, it assigns a uniform negative advantage value to all steps, e.g., -1.
So far, this is exactly the same as traditional, coarse feedback. We obtain a raw score that treats every step "equally."

Step Two: [First Pass] Calculate Uncertainty for Each Step

Next, the algorithm performs the first pass of refinement. It sequentially examines every "thought-action" step within these 16 tasks.
For a specific step, such as the agent generating text like "Thought: I should click the 'Next Page' button. Action: Click 'Next Page'," the algorithm calculates the average entropy when generating this text. Entropy is derived by analyzing the model's probability distribution for each token generated. If the model is very certain about the token to be generated at each sub-step (probabilities are highly concentrated), then the total entropy for this step is low; conversely, if the model is hesitant, the entropy is high.
The algorithm collects all step entropy values H_t, forming a large list containing hundreds or even thousands of entropy values.

Step Three: Compute "Modulation Tools"

With the list of all step entropy values, the algorithm now prepares two key "modulation tools": confidence multiplier f(H_t) and future clarity bonus h_next.
Normalized Entropy: The algorithm first performs "min-max normalization" on all collected entropy values, scaling them to the range of 0 to 1. This ensures that regardless of the model's overall confidence level, subsequent calculations have a consistent scale.
Calculate Confidence Multiplier: Using the normalized entropy, the algorithm calculates the corresponding confidence multiplier f(H_t) for each step. According to the f(H_t) formula, low-entropy steps receive a multiplier greater than 1, while high-entropy steps receive a multiplier less than 1. Note that f(H_t) also undergoes "self-calibration," meaning that over an entire batch, the average of all confidence multipliers is exactly 1. This prevents the learning signal from being globally amplified or reduced, merely reallocating it among steps.
Calculate Future Clarity Bonus: Similarly, the algorithm also calculates a potential future clarity bonus h_next for each step. This value will be used by its "previous step" in the next pass.

Step Four: [Second Pass] Apply Modulation, Generate Refined Feedback

Now, the algorithm performs the second, and most crucial, pass. It again examines all steps one by one, this time with the goal of updating each step's "advantage value."
Processing the t-th step—the algorithm performs the following three key operations: it retrieves the original uniform advantage value (e.g., +1 or -1) obtained in the first step; it finds the confidence multiplier corresponding to that step, then multiplies the two: f(H_t) * A_hat_t, so the advantage value has now been adjusted by the current step's confidence level; next, the algorithm checks if step t+1 exists, and if so, it retrieves step t+1's future clarity bonus, multiplies it by a weight lambda, and then adds it to the current t-th step's advantage value.
After this process, the original +1 or -1 shared by all steps is now transformed into a unique, refined new advantage value A_t^{EMPG} for each step. This value simultaneously incorporates the confidence assessment of the current step and consideration for future planning.

Step Five: Final Processing and Model Update

Centering Treatment: To further stabilize the training process, the algorithm calculates the average of all A_t^{EMPG} values in this entire batch, then subtracts this average from each A_t^{EMPG}. This ensures that the final advantage values are both positive and negative, averaging to zero, which is a standard variance reduction technique.
Execute Policy Update: Finally, the algorithm uses these multilayer-processed, extremely refined final advantage values as learning signals to update the LLM agent's model parameters via the policy gradient algorithm. Behaviors corresponding to steps with high positive advantage values are strongly encouraged; behaviors corresponding to steps with high negative advantage values are strongly suppressed.

At this point, a complete EMPG training iteration is finished. Through this process, the agent no longer learns blindly based on ultimate success or failure but receives guidance from a "smart coach" who can observe the process, evaluate confidence, and encourage long-term planning.

Fourth Stage: Experimental Design and Verification Analysis

Interpretation of Main Experimental Design: Verification of Core Argument

Core Claim: EMPG, through intelligent, uncertainty-based credit assignment, can significantly improve the performance of LLM agents on long-horizon, sparse reward tasks and overcome the performance bottleneck of existing methods.
Experimental Design: To verify this claim, the authors adopted a very direct and convincing design: applying EMPG as an "enhancement module" directly on top of two current powerful baseline methods (GRPO and DAPO). Experiments were conducted on three recognized and challenging agent tasks.
Analysis of Design Rationality:

Datasets: Including WebShop (simulating an online shopping website environment, with complex tasks requiring following instructions, browsing web pages, extracting information, serving as a gold standard for testing long-horizon decision-making), ALFWorld (a text-based virtual home environment combining instruction following and common-sense reasoning, testing the agent's understanding and planning abilities), Deep Search (multi-step information retrieval and integration tasks, not only testing basic capabilities but also divided into In-domain (ID) and Out-of-domain (OOD) parts, which is crucial for verifying the method's generalization ability). These choices cover various typical agent scenarios such as web navigation, embodied interaction, and information retrieval, and are all recognized benchmarks in the field, possessing sufficient challenge and representativeness.
Evaluation Metrics: The main metrics are Success Rate and Task Score. For these clearly defined tasks, success rate is the most direct and fair indicator of whether the agent can complete the task.
Baseline Methods: The comparative methods are GRPO (Group Relative Policy Optimization) and DAPO (Decoupled Clip and Dynamic Sampling Policy Optimization). Both are methods that have recently shown excellent performance in the field of LLM reinforcement learning. The authors did not choose a weak baseline to "bully" but rather chose to improve upon strong ones. This "strong-on-strong" design makes any performance improvement more likely attributable to the EMPG module itself, rather than due to a weak baseline.

Main Experimental Results and Conclusions:

Experimental Results—as shown in Tables 1 and 2 (in the original paper), performance showed consistent and significant improvement across almost all tasks, model sizes (from 1.5B to 32B), and baseline combinations when the EMPG module was added. For example, on ALFWorld, the Qwen2.5-7B model combined with DAPO saw its success rate increase from 90.0% to 91.6%; on the more challenging WebShop, the success rate increased from 79.6% to 82.7%.
Conclusion—The main experiments strongly demonstrated the effectiveness and universality of EMPG. It is not a "one-off solution" that only works under specific conditions, but a reliable performance enhancer that can be widely applied to different policy optimization algorithms.

Ablation Study Analysis: Contributions of Internal Components

Ablation Study Design: To understand the roles played by EMPG's two core components—"Gradient Scaling" and "Future Bonus"—the authors conducted an ablation study on the Deep Search task (see the lower half of Table 2 in the original paper). They tested: ① using only gradient scaling; ② using only future bonus; ③ using both (i.e., complete EMPG).
Correspondence of Components to Innovations:

Removing "Future Bonus" and retaining only "Gradient Scaling" aims to verify the effectiveness of the innovation "adjusting feedback based on current step confidence."
Removing "Gradient Scaling" and retaining only "Future Bonus" aims to verify the effectiveness of the innovation "encouraging the agent to find a clear next step."

Experimental Results and Conclusions:

Using only Gradient Scaling: Model performance improved, with the most significant gains on OOD (Out-of-domain) tasks. This indicates that by decaying updates for uncertain steps, this mechanism taught the model to be more "prudent" when facing unknown situations, thereby enhancing the model's generalization ability and robustness.
Using only Future Bonus: Model performance also improved, performing particularly well on ID (In-domain) tasks. This shows that by rewarding predictable paths, this mechanism helped the model better learn and exploit known successful patterns in the training data.
Complete EMPG: Achieved the largest performance improvement, surpassing any single component.
Conclusion—The ablation study clearly revealed the complementarity of the two components. Gradient scaling acts like a "regularizer," responsible for exploration and generalization; the future bonus acts like an "accelerator," responsible for exploitation and mastery. The combination of the two achieves a delicate balance between exploration and exploitation, proving the completeness and synergistic effect of the EMPG design.

In-Depth/Innovative Experiment Analysis: Insights into the Method's Intrinsic Characteristics

In addition to proving "I can do it" and "all my parts are useful," the authors also designed two very clever experiments to answer "why I can do it" and "why my design is this way and not that way."

Experiment 1: Training Stability Analysis (KL Loss Dynamics, Figure 2 in the original paper)

Experimental Purpose: To intuitively demonstrate that EMPG can improve training process stability, preventing "policy collapse" phenomena in later training stages.
Experimental Design: The authors tracked and plotted the change curve of KL loss during training. KL loss measures the magnitude of policy changes before and after each model update. A stable, healthy training process should have a smooth KL loss that remains at a low level. Intense, frequent spikes indicate that the model is making very aggressive and unstable updates.
Experimental Conclusion: Figure 2 shows that the baseline DAPO model experienced drastic KL loss spikes in later training stages, indicating its policy became extremely unstable. In contrast, the EMPG-enhanced model's KL loss curve remained very smooth throughout. This strongly proves that the "self-calibrating gradient scaling" mechanism in EMPG (especially the update decay for high-entropy steps) played an effective regularization role, acting like a "stabilizer" to ensure the agent robustly converges to a high-performance policy.

Experiment 2: Step Entropy vs. Token Entropy Dynamics Analysis (Figure 3 in the original paper)

Experimental Purpose: To provide a theoretical basis for one of the paper's core design choices—calculating and using entropy at the step level of "thought-action" rather than the finer-grained token level.
Experimental Design: This design is very clever. The authors grouped all "steps" based on their initial entropy values (e.g., 0-5%, 5-10%, etc., for lowest entropy). Then, they calculated how much the average step entropy changed for each group after one round of RL updates. If the hypothesis "low-entropy steps do not need updates" holds true, then the entropy change for low-entropy groups should be close to zero.
Experimental Conclusion: Figure 3's results are surprising but significant: even steps with very low initial entropy (e.g., 15-20% quantile) showed significant changes in their entropy values after learning updates. This disproved the simple assumption that "confident steps = already learned steps." It indicates that a step that currently appears very certain might still not be optimal and requires adjustment. This finding eloquently demonstrates that one cannot simply focus only on high-entropy parts but must, like EMPG, dynamically modulate steps across the entire entropy spectrum, which is the fundamental reason for EMPG's "step-level" design.

Paper Title: Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents