Think or Not Think: A Study of Explicit Thinking in Rule-Based Visual Reinforcement Fine-Tuning

Introduction

In recent years, Rule-Based Reinforcement Fine-Tuning (RFT) has achieved significant progress in its application to Multimodal Large Language Models (MLLMs), yielding results superior to Supervised Fine-Tuning (SFT) on some models.

RFT trains using verifiable rewards, encouraging models to think before answering. Explicit thinking is considered a key factor in RFT's success, and many studies on multimodal RFT are dedicated to reproducing the "aha moment" effect.

However, for some simple visual tasks, such as spatial understanding, complex reasoning may not be necessary. Recent studies show that overthinking can negatively impact the reasoning performance on certain tasks.

Furthermore, compared to SFT, RFT typically requires more fine-tuning time because it involves generating multiple longer responses. Therefore, the role of the explicit thinking process warrants further investigation, especially from the perspectives of accuracy and training efficiency.

Against this backdrop, a research team from Shanghai AI Lab conducted an in-depth study on the thinking process in Multimodal Large Language Models. The team first explored the challenges of using MLLMs for closed-form classification tasks. Due to limitations in pre-training data, current MLLMs perform poorly in classification capabilities.

Paper Title:

Think or Not Think: A Study of Explicit Thinking in Rule-Based Visual Reinforcement Fine-Tuning

Paper Link:

https://arxiv.org/abs/2503.16188

Code Link:

https://github.com/minglllli/CLS-RL/tree/main

Although SFT has achieved significant results in aligning MLLMs with state-of-the-art classification models, obtaining large-scale labeled datasets still incurs substantial costs and computational overhead. Thus, few-shot fine-tuning has become a promising alternative, especially in the application to contrastive vision-language models, but its application in autoregressive MLLMs is still underexplored and requires deeper research.

Inspired by the recent successes of rule-based RFT, the team proposed CLS-RL, a reinforcement learning framework for classification fine-tuning. Unlike SFT, which guides the fine-tuning process with token-level loss, CLS-RL uses verifiable reward loss to fine-tune MLLMs and guides the model to explore diverse reasoning thoughts.

Experimental results show that CLS-RL outperforms SFT in both few-shot and base-to-new class settings across 11 datasets, achieving better results in both in-domain learning and new class generalization.

Furthermore, the research team also discovered a “free lunch” phenomenon in CLS-RL fine-tuning. In few-shot contrastive VLM fine-tuning, previous studies have shown that when a VLM is fine-tuned on a specific dataset, its performance on other datasets significantly drops, a phenomenon known as catastrophic forgetting.

However, when using CLS-RL to fine-tune on a specific dataset, the model's performance on other datasets also improved, a phenomenon referred to as the “free lunch” phenomenon. The study shows that rule-based RFT can not only alleviate the catastrophic forgetting problem in few-shot fine-tuning but also effectively teach the model fundamental knowledge of image classification.

Next, the research team revisited and discussed the role of the thinking process in rule-based RFT. Unlike the gradual increase in response length for mathematical problems in Deepseek-R1, the response length in CLS-RL drops sharply at certain steps, while the accuracy reward increases significantly.

This indicates that the thinking process in classification tasks might not be as crucial as in mathematical problems, causing the model to gradually lean towards the simplest thinking method to reach the final answer. Additionally, the negative impact of overthinking on certain tasks was further verified.

Based on this, the research team proposed a new method, No-Thinking-RL, aimed at suppressing the thinking process. In No-Thinking-RL, the model is directly asked to answer questions, and the reward mechanism is adjusted accordingly, providing rewards only when the model's output exactly matches the true label.

Surprisingly, No-Thinking-RL outperformed CLS-RL in many tasks and significantly reduced training time.

The research team hypothesized that No-Thinking-RL's superiority over CLS-RL is because explicit thinking before a verifiable answer might inhibit learning. Therefore, they further proposed the “Think-After-Answer” method, which places thinking after the answer to mitigate this negative impact. Subsequent experiments will further verify this hypothesis.

Finally, the research team evaluated No-Thinking-RL on a variety of tasks, including mathematics, spatial reasoning, and puzzle tasks, covering 2B and 7B model sizes.

The results show that for the 2B model, No-Thinking-RL outperforms RFT with thinking on all tasks, especially in mathematical tasks. For the 7B model, the performance of the three methods is similar on spatial understanding tasks, but on mathematical problems, RFT with thinking is significantly better than No-Thinking-RL.

These results indicate that smaller models (like 2B) cannot generate high-quality thinking during fine-tuning, and low-quality thinking might reduce performance; for simple visual tasks, thinking is not crucial, and No-Thinking-RL performs better than RFT with thinking on smaller models and similarly on 7B models.

Method

To optimize the model, Group Relative Policy Optimization (GRPO) was used as the Reinforcement Learning (RL) algorithm. Unlike SFT, which optimizes the model through token-level loss, RL methods like GRPO use policy gradients from reward loss for optimization, encouraging the model to explore a broader solution space for reasoning.

In this method, a set of responses is used to estimate relative advantages, and regularization is applied to control the model's deviation. This method calculates relative advantages by sampling responses and normalizing their rewards, avoiding the critic model in PPO, resulting in higher computational efficiency.

2.1 CLS-RL

The CLS-RL method introduces instruction prompts and reward functions. Instruction prompts encourage the model to think before giving an answer and output the thinking process and the final answer. The reward function consists of format rewards and accuracy rewards. Format rewards encourage the model to correctly display the thinking process, while accuracy rewards are rule-based and check if the model's output matches the true label.

2.2 No-Thinking-RL

Unlike CLS-RL, the No-Thinking-RL method encourages the model to output the answer directly, avoiding the thinking process. The reward function focuses solely on accuracy; the model's output must precisely match the true label. This method significantly reduces training and inference time and is suitable for simple visual tasks that do not require complex reasoning.

Classification Experiment Section

In this section, the authors present the results of their classification experiments, focusing on evaluating few-shot learning ability and the “free lunch” phenomenon, and analyzing the transfer performance from base tasks to new tasks and open-set classification performance.

3.1 Experimental Setup

Experimental Goal: The primary goal of this experiment is to perform classification using a closed-form approach, meaning the model needs to select the correct category from a given set of category labels. The question format in the experiment is “What object is in this photo? {instruction prompt}”, where the instruction prompt is adjusted according to the different methods.

Datasets: To comprehensively evaluate the three methods (SFT, CLS-RL, and No-Thinking-RL), the authors selected 11 public classification benchmark datasets, including ImageNet, Caltech101, OxfordPets, StanfordCars, Flowers102, Food101, FGVCAircraft, SUN397, DTD, EuroSAT, and UCF101.

For the closed-form classification task, 40% of the labels were randomly selected from these datasets (80% for base-to-new tasks), and the true label was included.

Implementation Details: All experiments were conducted on 8 A100 GPUs using the Pytorch framework. The authors used Qwen2-VL-2B-Instruct as the base model and fine-tuned all parameters. The batch size per GPU was set to 1, and 2-step gradient accumulation was used. Image resolution was uniformly adjusted to 328×328.

3.2 Few-Shot Learning Results

Few-shot learning aims to test whether a model can effectively learn task-relevant knowledge with a very small number of samples. The authors trained SFT and CLS-RL under a 4-shot setting and compared their performance on different datasets.

From the experimental results, it can be seen that CLS-RL performs significantly better than SFT on most datasets, with a higher overall average accuracy. This indicates that rule-based reinforcement fine-tuning can help models achieve better results on downstream tasks. Although SFT outperformed CLS-RL on some datasets, CLS-RL was generally more advantageous.

Furthermore, No-Thinking-RL surpassed CLS-RL on 10 datasets, ultimately achieving an average accuracy 3.14% higher than CLS-RL. This shows that reinforcement learning fine-tuning without the thinking process can effectively improve the model's performance and is superior to fine-tuning methods that include the thinking process in downstream tasks.

3.3 “Free Lunch” Phenomenon

In few-shot learning, the authors also discussed the “free lunch phenomenon”. Previous research has shown that when a model is fine-tuned on a specific dataset, its performance on other datasets can drop significantly, a phenomenon known as catastrophic forgetting.

However, experimental results indicate that fine-tuning with CLS-RL and No-Thinking-RL can improve the model's performance on other datasets, even if the data distributions and even class lists are significantly different.

For example, after fine-tuning the model on the SUN397 dataset, its performance on ImageNet, StanfordCars, and UCF101 datasets increased by 16.98%, 15.88%, and 11.10%, respectively. This suggests that rule-based reinforcement learning fine-tuning can help the model acquire broader classification knowledge, not just memorize information from a specific dataset.

3.4 Convergence Comparison

The authors also compared the convergence speed of CLS-RL and No-Thinking-RL. Experimental results show that No-Thinking-RL converges faster than CLS-RL. In most training steps, No-Thinking-RL has higher accuracy rewards, and its test accuracy is also significantly ahead in the early stages of training (first 30 steps).

The authors believe that CLS-RL, due to the introduction of the reward loss format, may generate some noise in the initial training phase, leading to instability in accuracy rewards.

3.5 Efficiency Comparison

Finally, the authors compared the training and inference efficiency of CLS-RL and No-Thinking-RL. The results show that CLS-RL has significantly higher time consumption during training and inference phases compared to SFT and No-Thinking-RL, because it needs to generate multiple longer responses during fine-tuning and inference.

In contrast, SFT only optimizes labels during fine-tuning, while No-Thinking-RL optimizes the model through accuracy rewards, thus significantly reducing training and inference time.

Experiments and Analysis on More Diverse Tasks

In this section, the authors present experimental results on more diverse tasks, covering spatial understanding, mathematical problems, and puzzle tasks. Experiments were conducted on 2B and 7B models. The authors first introduced the “Think-After-Answer” method and reported the corresponding experimental results.

4.1 Think-After-Answer

The authors investigated why No-Thinking-RL performs better than CLS-RL. As shown, CLS-RL's convergence speed is slower than No-Thinking-RL. Therefore, the authors hypothesized that explicit thinking before a verifiable answer might hinder learning and convergence. To verify this hypothesis, the authors proposed the “Think-After-Answer” method, which first asks the model to answer the question and then provides a brief reasoning process.

This can mitigate the negative impact of explicit thinking during the RFT process. The prompt for “Think-After-Answer” is: texttt{Question} Please output the answer first in the format <answer> </answer>, then output a brief reasoning process in the format <reason> </reason>}. The accuracy reward remains unchanged.

If the hypothesis holds, Think-After-Answer's convergence speed should be faster than RFT with thinking, and its final performance on some tasks should be better. Subsequent experimental results verified this hypothesis.

4.2 Results on CVBench

After 2 rounds of fine-tuning on the SAT dataset, the authors tested the models' performance on the CVBench dataset. The table summarizes the results for VisualThinker-R1-Zero, Think-After-Answer, and No-Thinking-RL models.

For the 2B model, No-Thinking-RL improved accuracy by 6.4% compared to VisualThinker-R1-Zero and performed well on all subtasks. Think-After-Answer's performance was between No-Thinking-RL and VisualThinker-R1-Zero. For the 7B model, the results of the three methods were similar.

The experimental results show that in spatial understanding tasks, not thinking during the RFT process can improve performance, while RFT with thinking performs even worse on smaller models.

Furthermore, the authors also visualized the accuracy reward curves (see figure). From this, it can be seen that Think-After-Answer converges faster than RFT with thinking. The accuracy results and convergence speed verify the authors' hypothesis that explicit thinking before a verifiable answer hinders learning and convergence.

4.3 Experimental Results on Mathematical Problems

In this subsection, the authors conducted experiments on mathematical problems. Obtaining the final answer requires generating complex intermediate steps. The authors used the Math-40K dataset for fine-tuning and tested the fine-tuned models on MathVista and MathVision.

The results are shown in the table. For the 2B model, No-Thinking-RL performed better than RFT with thinking. This result is quite surprising because mathematical problems usually require complex intermediate steps to arrive at the final answer. This means that when the base model's capability is weak, generating reasoning chains during the RFT process leads to performance degradation, and RFT performs worse compared to RFT without thinking.

The authors further divided MathVista into multiple subtasks and found that No-Thinking-RL outperformed RFT with thinking on all tasks except mathematical word problems (MWP).

Furthermore, the authors also found that RFT with thinking can surpass No-Thinking-RL in MWP tasks. Since problems in MWP and MathVision require extensive calculations to arrive at the final answer, directly outputting the correct answer is very difficult. For other subtasks, the computational requirements might not be as high, thus No-Thinking-RL can also perform well.

For the 7B model, RFT with thinking significantly outperformed No-Thinking-RL, indicating that when the model's reasoning capability is strong enough, the reasoning chain in RFT enhances the model's reasoning ability, thereby improving final performance.

However, the authors noted that in textbook question answering (TQA) and visual question answering (VQA) tasks, the performance of RFT with thinking and No-Thinking-RL was similar. Since these tasks typically do not require complex reasoning, similar results further indicate that thinking is not a necessary condition for RFT in certain visual tasks.

4.4 Experimental Results on Puzzle Problems

In this section, the authors present the experimental results for puzzle problems. The authors generated a training dataset containing 6.5k data points and tested the fine-tuned models on PuzzleVQA (as in-domain test) and AlgoPuzzleVQA (as out-of-domain test).

The experimental results are shown in the table. The authors found that in both 2B and 7B models, No-Thinking-RL performed better than RFT with thinking on both in-domain and out-of-domain tests. The reason might be that the 2B and 7B base models have weaker reasoning capabilities on puzzle tasks, thus the reasoning chains generated during the RFT process hinder the learning process, while Think-After-Answer performed significantly better than regular RFT with thinking.

Furthermore, the figure shows that the accuracy reward convergence speed of Think-After-Answer is much faster than No-Thinking-RL. All these results further validate the authors' hypothesis that premature explicit thinking during the RFT process hinders learning.

Conclusion

This paper systematically investigates the role of explicit thinking in Rule-Based Reinforcement Fine-Tuning (RFT), proposing three different training paradigms: CLS-RL, No-Thinking-RL, and Think-After-Answer, and conducting empirical analysis on multiple visual tasks. The study found:

1. CLS-RL can effectively guide Multimodal Large Language Models (MLLM) to perform verifiable reasoning, significantly outperforming traditional Supervised Fine-Tuning (SFT), and exhibits good transfer ability, achieving “free-lunch” style generalization on unseen datasets.

2. No-Thinking-RL further challenges the assumption of whether thinking is necessary. By directly outputting the answer instead of generating a chain of thought, it not only surpasses CLS-RL in performance but also significantly reduces training and inference costs.

3. Experiments on more complex tasks show that the low-quality thinking generated by smaller models (like 2B) hinders RFT convergence and performance, while in simple visual tasks, “no thinking” can even lead to better results.

4. The introduction of Think-After-Answer verifies a key hypothesis: explicit thinking before generating a verifiable answer interferes with model learning.

In summary, this research not only challenges the intuitive understanding that “thinking is rational” but also provides new theoretical basis and practical paths for designing more efficient visual reinforcement learning paradigms across different tasks and model sizes. It enlightens us that in multimodal reasoning, the “timing” and “manner” of thinking are more important than whether to think at all, offering new ideas for the design of subsequent RFT paradigms.

Think or Not Think: A Study of Explicit Thinking in Rule-Based Visual Reinforcement Fine-Tuning

Share Short URL