The First Multimodal Dedicated Slow-Thinking Framework! Outperforms GPT-o1 by Nearly 7 Percentage Points, Reinforcement Learning Teaches VLM to "Think Twice"

The MLNLP community is a well-known machine learning and natural language processing community at home and abroad, reaching NLP graduate students, university professors, and corporate researchers. The community's vision is to promote exchange and progress among the academic community, industry, and enthusiasts of natural language processing and machine learning, especially for beginners.

Source | QbitAI

Author｜VL-Rethinker Team

In the field of text reasoning, "slow-thinking" models represented by GPT-o1 and DeepSeek-R1 have shown significant advantages over "fast-thinking" models (such as GPT-4o) in mathematics and science tasks, thanks to their explicit self-reflection mechanisms.

However, when the battleground shifts to multimodal reasoning scenarios, these "masters of thought" perform indifferently: GPT-o's performance on multimodal reasoning benchmarks such as MathVista and MathVerse is on par with fast-thinking models, and even surpassed by Qwen2.5-VL-72B.

Why is it that the slow-thinking ability, so effective in text reasoning, is difficult to apply in multimodal scenarios?

A research team from HKUST, University of Waterloo, INF.AI, and Vector Institute delved into this problem, revealing two core obstacles in building slow-thinking capabilities for Visual Language Models (VLMs): "Vanishing Advantages" and "Rethinking Inertia," and proposed an innovative solution—VL-Rethinker.

This model successfully activates VLM's deep reasoning and self-calibration capabilities through two key technologies: "Selective Sample Replay" and "Forced Rethinking."

1 Dual Challenges in Multimodal Reasoning: Vanishing Advantages and Rethinking Inertia

The research team found that when training large-scale visual language models like Qwen2.5-VL-72B, the classical GRPO faces two core challenges when used for reinforcement training of multimodal models:

1.1 Vanishing Advantages in GRPO

In the GRPO algorithm, the advantage signal is calculated by comparing the rewards of different candidate responses within the same query group. When all answers within the same question group receive the same reward (e.g., all correct or all incorrect), the calculated advantage signal becomes zero. The research team found that during the GRPO training of multimodal models, as training progresses, the proportion of samples with zero advantage signals significantly increases. This phenomenon is defined as "Vanishing Advantages."

Compared to pure text reasoning, which uses more high-quality reasoning data, Vanishing Advantages is particularly prominent when strong multimodal models are trained with reinforcement learning.

For example, when training the Qwen2.5-VL-72B model (as shown in the figure), the proportion of effective queries with non-zero advantage signals was about 40% in the initial stage, but after only about 256 gradient update steps (16x16 steps), this proportion rapidly dropped to below 20%.

This significant Vanishing Advantages stems from two reasons: the current open-source multimodal datasets are still inferior in quality and quantity compared to pure text reasoning data; at the same time, the limitations in data quality and difficulty lead to a faster saturation and premature convergence for models with higher capabilities.

Vanishing Advantages also brings dual negative impacts: a sharp decrease in effective samples leads to increased gradient variance, disrupting training stability; and the model is confined to shallow reasoning paths, thus inhibiting the exploration of complex reasoning strategies.

1.2 Rethinking Inertia of Multimodal Models

Unlike pure text models that spontaneously generate long reasoning chains after reinforcement training, existing VLM base models are limited by the perception-driven nature of visual modalities and the scarcity of reflection patterns in pre-training corpora. They tend to perform "fast thinking" (direct mapping of perceptual input to language output), lacking the ability to actively review and correct their reasoning process.

This "rethinking inertia" makes it difficult for standard reinforcement training to activate VLM's slow-thinking potential, becoming the second major bottleneck in advancing multimodal reasoning capabilities.

2 VL-Rethinker: Dual Engines Unlocking Multimodal Slow-Thinking Capabilities

To address the challenge of scarce high-quality open-source data, the research team curated the ViRL39K reinforcement training dataset.

The dataset selects existing multimodal reasoning data and new reasoning data, which are then cleaned, validated, and rewritten to obtain 38,870 high-quality multimodal reasoning questions.

This 39K dataset covers eight major themes, including logical reasoning, chart reasoning, spatial reasoning, and scientific Q&A.

It also includes fine-grained model capability labels and provides a uniform difficulty distribution for models of different capability levels.

Based on the ViRL39K training data, the research team developed VL-Rethinker—the first slow-thinking reinforcement framework specifically designed for multimodal scenarios, whose core consists of two innovative technologies:

2.1 Selective Sample Replay (SSR)

To address Vanishing Advantages, the research team proposed Selective Sample Replay (SSR) to dynamically focus on high-value training samples.

SSR introduces an experience replay mechanism that dynamically stores non-zero advantage training samples and designs a value-sensitive replay strategy: prioritizing the reuse of "key samples" with larger absolute advantage values (e.g., correct solutions to difficult problems, incorrect solutions to easy problems).

This design offers dual advantages: it effectively mitigates Vanishing Advantages, ensuring a consistent and stable number of effective training samples. It enables online active learning. Samples with larger advantages are usually located near the model's decision boundary, e.g., correct answers to more difficult questions. By re-weighting these samples (as shown in the bottom right figure), SSR dynamically orchestrates the samples used for model training, guiding the model to focus on key samples, thereby improving training efficiency (as shown in the bottom left figure).

Currently, SSR technology has been applied in Pixel Reasoner and SkyR1V2.

2.2 Forced Rethinking

To overcome VLM's "rethinking inertia," the research team proposed the "Forced Rethinking" mechanism: after the model generates an initial answer, a specific "rethinking trigger" text is artificially appended, forcing the model to initiate a secondary reasoning process. The research team designed various types of rethinking triggers, including self-validation, self-correction, and self-questioning, to guide the model to learn and generate diverse rethinking behaviors (as shown in the word cloud). In the training samples, only the correct parts of the forced rethinking answers are retained.

The research team found that this rejection sampling, combined with simple correctness rewards, enables the model to selectively trigger the rethinking process instead of blindly performing redundant secondary thinking for every problem, thereby achieving more efficient and intelligent "slow thinking."

Interestingly, the rethinking ability acquired by VL-Rethinker is not limited to scrutinizing the model's own answers; it even helps the model realize errors in the questions. In the example below, while rethinking its own reasoning process, the model became aware of the contradiction between its reasoning and the question, thereby realizing the error in the problem setting.

3 VL-Rethinker Experimental Results

In mathematical reasoning tasks, it achieved 80.4% on the MathVista dataset and 63.5% on the MathVerse dataset, both surpassing the GPT-o1 model (73.4% and 57.0% respectively); it maintained a leading position in the MathVision task with a score of 44.9%.

In multidisciplinary understanding capability tests: the overall MMMU-Pro test score reached 55.9%, and the full EMMA test score was 38.5%, which not only set a new state-of-the-art (SOTA) performance for open-source models but also approached the level of the OpenAI-o1 model.

Significant model iteration effects: VL-Rethinker-72B improved by 5.6% on MathVista and 6.3% on MathVerse compared to the base model Qwen2.5-VL-72B; VL-Rethinker-7B significantly outperformed same-scale 7B-class reinforcement learning VLMs in all benchmark tests.

The experimental results validate the effectiveness of SSR and the potential of the "slow thinking" mode in the multimodal domain.

Paper address: https://arxiv.org/pdf/2504.08837

Project homepage: https://tiger-ai-lab.github.io/VL-Rethinker/

High-quality dataset: https://huggingface.co/datasets/TIGER-Lab/ViRL39K

Model demo: https://huggingface.co/spaces/TIGER-Lab/VL-Rethinker

Invitation to Technical Exchange Group

Long press to add assistant

Scan QR code to add assistant's WeChat

Please note: Name-School/Company-Research Direction (e.g., Xiao Zhang-Harbin Institute of Technology-Dialogue System) to apply to join Natural Language Processing/Pytorch and other technical exchange groups

About Us

The MLNLP community is a civilian academic community jointly built by machine learning and natural language processing scholars at home and abroad. It has now developed into a well-known machine learning and natural language processing community both domestically and internationally, aiming to promote progress among the academic community, industry, and enthusiasts of machine learning and natural language processing.

The community provides an open exchange platform for relevant practitioners in further education, employment, and research. We welcome everyone to follow and join us.

The First Multimodal Dedicated Slow-Thinking Framework! Outperforms GPT-o1 by Nearly 7 Percentage Points, Reinforcement Learning Teaches VLM to "Think Twice"

Share Short URL