DeepSeek-GRPO Importance Weight Design Flaw? Explaining Qwen3's New Reinforcement Learning Algorithm GSPO

Original source: https://zhuanlan.zhihu.com/p/1932829167801574272

1. Introduction

When using GRPO on larger language models, training instability can occur [1][2]. In this paper, the authors argue that this phenomenon stems from a design flaw in GRPO's importance weights. GRPO's importance weights for next-tokens tend to introduce high-variance noise, which, exacerbated by increasing response lengths and clipping mechanisms, ultimately leads to training collapse.

To address this issue, the paper proposes GSPO (Group Sequence Policy Optimization), which replaces token-distribution-specific weights with sequence-specific importance weights. It also calculates gradients from the sequence dimension rather than the token dimension, aligning with the reward definition itself.

Ultimately, GSPO resolved stability issues in RL training of MoE models, eliminating the need for separate complex tricks to maintain stability and simplifying the RL architecture.

2. Motivation

During the RL phase, we first sample a large rollout batch. To improve sampling efficiency, we typically divide it into several mini-batches for gradient updates. This process inevitably leads to off-policy scenarios, which also, to some extent, explains how PPO and GRPO's clipping mechanisms prevent excessively off-policy samples from participating in gradient calculations.

Although mechanisms like clipping help mitigate issues caused by off-policy data, GRPO misapplies importance weights.

Importance Sampling. Importance sampling estimates the expectation of a function under a target distribution by weighting samples drawn from a behavior distribution:

[Mathematical Formula]

This requires sampling multiple examples from the behavior distribution, not just one.

However, the importance weight in GRPO is designed as , which primarily considers . Evidently, this distribution has only one sample (i.e., ) in the current setting, which violates the definition of importance weights in expectation form.

If importance weights were applied to , i.e., considering the entire optimization problem at the sequence-level, GRPO would at least have a set of importance weights for the same distribution. Furthermore, the sequence-level approach better matches the reward function design (rewards are generally scored for the entire response).

3. Algorithm

When considering the problem at the sequence-level, according to the definition of importance sampling, the RL objective is:

[Mathematical Formula]

This naturally aligns with the sequence-level reward definition and clarifies the meaning of the clipping mechanism (filtering out gradients from excessively off-policy sequences).

Based on these observations, the paper proposes the GSPO algorithm, which uses the following objective:

[Mathematical Formula]

Rewards are estimated using a group-based approach:

[Mathematical Formula]

Importance weights are defined using the averaged sequence likelihood:

[Mathematical Formula]

4. Analysis

We compare the gradients of the GSPO and GRPO objectives:

[Mathematical Formula]

The gradient of GRPO is:

[Mathematical Formula]

The main difference between the two lies in how weights are assigned to the gradients of token likelihoods. GRPO assigns importance sampling weights from the distribution to each token. However, such a corrected distribution is inconsistent for each token, and clipping them all with a single standard seems inappropriate. In contrast, GSPO assigns sequence-level importance sampling weights to tokens.

4.1 Token-level Variant

In certain scenarios (multi-turn RL), we might still want fine-grained adjustments to advantages at the token level. For this purpose, the paper proposes a variant of GSPO, GSPO-token, which has gradients consistent with GSPO.

[Mathematical Formula]

where

[Mathematical Formula]

Here, represents the stop gradient, participating in gradient calculations as a constant. When , GSPO-token is numerically consistent with GSPO in terms of optimization objective, clipping conditions, and theoretical gradients.

5. Experiments

5.1 Experimental Results

Using a cold-start model fine-tuned with SFT based on Qwen3-30B-A3B-Base, both GSPO (without routing replay strategy) and GRPO + Routing Replay training strategy were employed. The figure below shows the reward change curve during training and performance on AIME24, LiveCodeBench, and CodeForces (Elo Rating).

Image

Compared to GRPO, GSPO exhibits higher training efficiency on Qwen3.

5.2 Observations on Clipping Ratio

GSPO clips tokens for the entire response, while GRPO clips some excessively off-policy tokens. The figure below shows the clipping ratios for both experiments during training:

Image

GSPO clips more tokens but achieves higher training efficiency, suggesting that GSPO may provide more reliable and efficient learning signals than GRPO.

5.3 Effects in MoE Training

Background. In MoE model training using the GRPO algorithm, instability in expert activation can prevent reinforcement learning training from converging properly. After a single gradient update, even for the same response, the activated experts might change significantly. This instability leads to greater fluctuations in token-level importance weights, which, as previously discussed, ultimately causes model collapse.

Previous Method. Using the Routing Replay strategy, during training, the activated experts are stored. When calculating importance weights, the previously stored routing strategy is replayed. In this case, it ensures that for each , the numerator and denominator of the importance weight are calculated using the same activation network. The figure below illustrates the benefits of this strategy:

Image

Effect of GSPO. However, the aforementioned heuristic strategy incurs additional memory and communication overhead, and can also limit the actual knowledge capacity of MoE models. Experimentally, GSPO shows higher efficiency and stable training compared to GRPO + routing replay strategy; theoretically, MoE models still retain the capabilities of language models, and the sequence likelihood is more stable in the output of MoE models than the likelihood of individual tokens.

References

1. Team Qwen. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025a.

2. MiniMax. Minimax-m1: Scaling test-time compute efficiently with lightning attention. arXiv preprint arXiv:2506.13585, 2025.

Main Tag:Reinforcement Learning

Sub Tags:Large Language ModelsAlgorithm OptimizationMoE ModelsQwen3


Previous:New Book Recommendation: "Reshuffle: Who Wins When AI Restacks the Knowledge Economy"

Next:Why Can't Language Models Directly Output Answers with Confidence?

Share Short URL