SRO Architecture Empowers Qwen-2.5-VL's Reasoning Capability, Boosting Performance by 16.8%

This public account mainly focuses on cutting-edge AI technologies such as NLP, CV, LLM, RAG, and Agent, freely sharing industry practical cases and courses to help you fully embrace AIGC.

Reasoning models in the text domain have achieved great success, but extending similar reasoning capabilities to Multimodal Large Language Models (MLLMs) encounters some resistance:

Insufficient Cold Start Initialization: The cold start phase of traditional multimodal models typically relies on simple visual and text pre-training datasets, which often fail to provide sufficient preparation for complex problem-solving. This initial deficiency severely hinders the activation of complex reasoning patterns in subsequent reinforcement learning stages.

Gradient Stagnation in Multimodal Reinforcement Learning: Standard Group Relative Policy Optimization (GRPO) algorithms suffer from gradient stagnation in multimodal RL, leading to unstable training and performance degradation.

Bottlenecks in Inference Capability Improvement: After multimodal RL, the model's inference capability improvement is limited and difficult to optimize further.

I. ReVisual-R1's Solution Approach

A three-stage training framework, Staged Reinforcement Optimization (SRO), is proposed. Specifically, it is divided into:

Cold Start Stage: Initialize with pure text data to establish basic language understanding capabilities.

Multimodal RL Stage: Train with multimodal samples from the GRAMMAR dataset, optimizing the training process through PAD technology.

Text RL Stage: Fine-tune with pure text data to further enhance the model's language fluency and reasoning capabilities.

1.1 Cold Start Stage

Collect 40k pure text entries, focusing on establishing basic language understanding capabilities.

Train Qwen-2.5-VL-7B-Instruct using LLaMA Factory to provide the model with basic reflective capabilities and extended Chain-of-Thought (CoT) reasoning capabilities.

1.2 Multimodal RL Stage

Extract 26k diverse multimodal samples from the GRAMMAR dataset.

Ensure that the multimodal RL stage effectively improves the model's reasoning capabilities.

Reinforcement learning with GRPO algorithm

Sample Grouping: Divide training samples into multiple groups, each containing multiple samples.

Policy Optimization: Optimize the policy relative to the reference model within each group to improve the model's performance in complex reasoning tasks.

Train with Easy R1, omitting the KL divergence constraint to encourage broader policy exploration.

Principles of PAD Technology

PAD effectively mitigates the gradient stagnation problem and improves training efficiency by calculating the absolute advantage value of each sample and filtering out samples with near-zero advantage based on a set threshold. Then, prioritized sampling is performed based on the advantage values, giving preference to samples with higher advantage values for training. This process can effectively alleviate the gradient stagnation problem and improve training efficiency.

For example: Assume a batch has 10 samples with advantage values of [0.1, 0.2, 0.3, 0.01, 0.02, 0.4, 0.5, 0.6, 0.001, 0.7]. Set thresholds Tlow = 0.1 and Thigh = 0.6. The filtered effective samples are [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]. Then, prioritize sampling based on these advantage values, preferring samples with higher advantage values for training.

Efficient Length Reward Function

This function adjusts the reward by calculating the deviation between the generated sequence's length and the target length. The closer the generated sequence is to the target length, the higher the reward. This mechanism effectively controls the length of generated responses, preventing overly long or short responses from negatively impacting the training process.

For example: Assume a target length of 100 tokens, a generated sequence length of 120 tokens, a penalty factor α = 0.005, and a baseline reward δ = 0.5. Calculate the reward value according to the formula:

The final reward value is:

1.3 Text RL Stage

Collect 30k pure text entries, focusing on optimizing the model's language fluency and reasoning capabilities.

Train with Easy R1 for text RL, freezing the visual module, focusing on improving text reasoning capabilities, and further enhancing the model's language fluency and reasoning capabilities.

Ensure that the text RL stage effectively improves the model's language fluency and reasoning capabilities.

Results Display

In MathVerse, MathVision, DynaMath, WeMath, LogicVista, AIME24, AIME25, GPQA, and MATH-500 benchmarks, ReVisual-R1 achieved an average performance of 53.1%, an improvement of 16.8 percentage points over previous open-source models.

ReVisual-R1 showed particularly significant performance improvements in challenging benchmarks like AIME24 and AIME25, reaching 44.6% and 15.4% respectively.

Summary

Although the multimodal RL stage is important, relying solely on multimodal RL can lead to "textual capability decay." The subsequent text RL stage can effectively mitigate this problem and further enhance the model's reasoning capabilities.

PAD effectively mitigates the gradient stagnation problem and improves training efficiency and model performance by filtering zero-advantage samples and prioritizing informative trajectories. It performs better than strategies using only GRPO baseline, only sample filtering, or random sampling.

The efficient length reward function controls the length of generated responses, preventing overly long or short responses from negatively impacting the training process, maintaining stable reward accuracy and low entropy values, thereby improving the model's stability and performance.

https://huggingface.co/csfufu/Revisual-R1-final

https://arxiv.org/pdf/2506.04207

SRO Architecture Empowers Qwen-2.5-VL's Reasoning Capability, Boosting Performance by 16.8%

Share Short URL