New Breakthrough in Large Model Reinforcement Learning – SPO New Paradigm Boosts Large Model Reasoning Capability!

Currently, reinforcement learning (RL) shows great potential in improving the reasoning capabilities of large language models (LLMs). Models like DeepSeek R1, Kimi K1.5, and Qwen 3 fully demonstrate the effectiveness of RL in enhancing LLM's complex reasoning abilities.

However, to achieve effective reinforcement learning, a fundamental challenge needs to be addressed: the credit assignment problem. In the context of large language models, how can the final evaluation result of an entire sequence (LLM's response) be attributed to specific decision actions (tokens) within that sequence?

The difficulty of this problem lies in the sparse reward signals — explicit success or failure feedback is only obtained at the end of the sequence.

Current Main Approaches

In reinforcement learning, advantage estimation methods are typically used to solve the credit assignment problem. Currently, reinforcement learning methods for large language models are mainly divided into two categories, differing in the granularity of advantage estimation.

Coarse-grained trajectory-level methods, such as GRPO used by DeepSeek R1, calculate a single advantage value for the entire sequence based only on the final reward. While efficient, this method provides overly coarse feedback. LLMs cannot reward correct parts of incorrect answers or penalize redundant parts of correct answers.

Another extreme is fine-grained token-level methods, such as the classic PPO. These methods estimate advantage values for each token, requiring an additional critic model to predict the state value (V-value) of each token. However, in reinforcement learning tasks for large language models, the trajectory distributions corresponding to different prompts vary greatly, and the number of model responses sampled for each prompt during training is very limited. This makes the critic model difficult to train effectively, leading to large errors in token-level advantage estimation.

New SPO Framework

To overcome this bottleneck, a research team from the Institute of Software, Chinese Academy of Sciences, and City University of Hong Kong innovatively proposed the Segment Policy Optimization (SPO) framework.

Paper Title: Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models

Authors: Yiran Guo, Lijie Xu, Jie Liu, Dan Ye, Shuang Qiu

Link: https://arxiv.org/abs/2505.23564

Code Link: https://github.com/AIFrameResearch/SPO

SPO uses a medium-grained segment-level advantage estimation method. Unlike trajectory-level methods that only calculate advantage at the final step, and unlike token-level methods that calculate advantage at every step, SPO divides the generated sequence into several contiguous segments and calculates the advantage value for each segment.

This segment-level advantage estimation method has several significant advantages:

(1) Better Credit Assignment: Compared to trajectory-level methods, segment-level methods provide more localized advantage feedback, allowing the model to reward valuable parts within incorrect answers and penalize redundant or ineffective segments within correct answers.

(2) More Accurate Advantage Estimation: Compared to token-level methods, segment-level methods require fewer estimation points, thus enabling effective use of Monte Carlo (MC) sampling to obtain more accurate and unbiased advantage estimates, without needing an additional, unstable critic model.

(3) More Flexible and Adjustable: The segment division method can be arbitrarily defined and does not require semantic completeness, allowing for flexible adjustment of granularity between token-level and trajectory-level, adapting to different tasks and application scenarios.

The SPO framework mainly consists of three core components: (1) Flexible segment partitioning strategy; (2) Segment-level advantage estimation based on Monte Carlo sampling; (3) Policy optimization using segment-level advantages.

This modular design provides high flexibility to the framework, allowing different components to have different implementation strategies to suit various application scenarios.

The team further proposed two specific instances of the SPO framework for different reasoning scenarios: for short chain-of-thought (CoT) scenarios, SPO-chain was introduced, which uses cutpoint-based segment partitioning and chain-based advantage estimation; for long CoT scenarios, a tree-structured advantage estimation method was proposed to greatly improve MC sampling efficiency.

Additionally, the team proposed a token probability-mask policy optimization method, selectively calculating loss for low-probability tokens within a segment rather than all tokens. The authors believe these tokens are where the model's reasoning trajectory might branch, and are the primary cause of segment-level advantages. This method can be used with both SPO-chain and SPO-tree, further enhancing credit assignment.

Framework and Core Technologies

The SPO framework is primarily designed around the following three challenging problems: (1) How to partition the generated sequence into multiple segments? (2) How to accurately and efficiently estimate the advantage value for each segment? (3) How to use segment-level advantages to update the policy? SPO's three core modules address these three questions, with each module containing multiple optional strategies to suit different scenarios:

1. Segment Partition:

a) Cutpoint-based Partition: Designed for short chain-of-thought scenarios, this method places segment division points where the state value (V-value) is more likely to change. Segment boundaries are dynamically determined based on token probabilities, prioritizing division at key points (cutpoints) where the model "hesitates" or might change its reasoning path, making credit assignment more precise. For example, in the image below, tokens marked in red are key points, and blue vertical bars indicate segment boundaries.

b) Fixed Token Count Partition: Divides the sequence into segments of fixed length, facilitating the organization of tree structures and advantage estimation, designed for SPO-tree.

2. Segment Advantage Estimation:

a) Chain-based Method: In short chain-of-thought scenarios, MC sampling cost is low. The team uses a direct segment-level advantage estimation method, independently estimating the state value (V-value) at each segment boundary, and then calculating the segment-level advantage. The following formulas show the chain-based advantage estimation method.

b) Tree-based Advantage Estimation: In long chain-of-thought scenarios, MC estimation is costly. The team proposed an efficient tree-based estimation method: organizing sampled trajectories into a tree structure, calculating state values (V-values) through bottom-up reward aggregation, where child nodes of the same parent form a group, and segment advantages are calculated within the group. This method uses samples for V-value estimation simultaneously for policy optimization, greatly improving sample efficiency. The following formula shows the tree-based advantage estimation method.

3. Policy Optimization Using Segment Advantages with Token Probability-mask:

After obtaining segment-level advantage values, to further improve credit assignment, the team innovatively proposed the token probability-mask policy optimization method. During policy updates, segment-level advantages are assigned only to low-probability (critical) tokens within that segment, rather than all tokens. This method can more precisely assign rewards/penalties to key decision points, improving learning efficiency and effectiveness. The optimization objectives for SPO-chain and SPO-tree are shown below, respectively.

a) SPO-chain Optimization Objective:

b) SPO-tree Optimization Objective:

Comparison with Baseline Methods

As shown in the figure below, in short chain-of-thought scenarios, using RhoMath1.1B as the base model and training with the GSM8K dataset, SPO trained models achieved higher test set accuracy when compared to various training algorithms.

For long chain-of-thought scenarios, as shown in the figure below, using DeepSeek-R1-Distill-Qwen-1.5B as the base model and training with the MATH dataset, SPO achieved higher test set accuracy than GRPO under the same training time.

The table below shows more comparison results for long chain-of-thought scenarios: Compared to models trained concurrently with the same base model (DeepSeek-R1-Distill-Qwen-1.5B) using the GRPO method (DeepScaleR, STILL-3), SPO-tree performed excellently across various context length evaluations, despite SPO only using the MATH dataset and training with a maximum context length of 4K. It is noteworthy that while DeepScaleR performed best in the 32K context length evaluation, it performed worst at shorter context lengths (2K and 4K), even falling below the original base model. This indicates that the GRPO training method may not effectively optimize the model's token efficiency, leading to more redundant outputs and thus a decrease in accuracy when context length is limited.

Impact of Segment Granularity

Experiments showed that very fine granularity (int2, segmenting every two cutpoints) provided only a minor improvement compared to medium granularity (int5), but overly coarse granularity (int100) led to a significant decrease in accuracy compared to medium granularity (int5). This proves the effectiveness of SPO's medium-grained advantage values.

Impact of Segment Partitioning Method

Experiments showed that in short chain-of-thought scenarios, the proposed cutpoint-based segment partitioning method performed best, superior to partitioning by newlines (VinePPO) and fixed token count partitioning (Fixed-token-count).

Token Probability Mask Ablation

Experiments showed that removing the token probability mask led to a decrease in SPO-chain accuracy. More notably, applying the token probability mask to GRPO resulted in a significant increase in its accuracy.

Impact of Different Tree Structures

Experiments showed that smaller tree structures achieved higher accuracy in the early stages, possibly because they covered more data samples faster. However, as training progressed, larger tree structures yielded better accuracy because larger tree structures provided more accurate segment-level advantage estimates.

Summary

This work proposes an RL training framework, SPO, based on intermediate-granularity segment-level advantages, striking a better balance between token-level and trajectory-level approaches. It offers better credit assignment than trajectory-level methods and requires only a few advantage estimation points, allowing for effective and unbiased estimation using Monte Carlo methods without requiring an additional critic model.

The paper also presented two instances of SPO: SPO-chain designed for short chain-of-thought scenarios and SPO-tree designed for long chain-of-thought scenarios, experimentally demonstrating the effectiveness of the SPO framework and its two instances.

Please contact this official account for reprint authorization.

For submissions or coverage requests: liyazhou@jiqizhixin.com

New Breakthrough in Large Model Reinforcement Learning – SPO New Paradigm Boosts Large Model Reasoning Capability!

Share Short URL