In the rapid development of large-scale multimodal models, the R1 series of multimodal reasoning models have consistently overcome the performance bottlenecks of traditional “fast thinking” paradigms in complex tasks, thanks to their explicit long-chain reasoning mechanisms.
However, research has found that as the reasoning chain lengthens, the visual perception ability of these models shows a clear decline, gradually relying on linguistic priors for “brain completion,” and the generated content increasingly tends to deviate from the image itself, even leading to fabricated hallucinations.
This “reasoning enhancement – perception weakening” paradox highlights the balancing challenge faced by current multimodal reasoning models between reasoning capability and perceptual accuracy.
To further verify this phenomenon, research teams from the University of California, Santa Cruz, UC Santa Barbara, and Stanford University conducted a systematic analysis.
By introducing a reasoning length control mechanism and an interpretable attention visualization method, researchers found that as the reasoning chain extends, the model’s attention to image content significantly decreases, while its reliance on linguistic prompts continuously strengthens, highlighting a language-dominant visual deviation trend.
Paper Link: https://arxiv.org/pdf/2505.21523
Project Link: https://mlrm-halu.github.io
Code Link: https://github.com/MLRM-Halu/MLRM-Halu
Building on this, the team proposed a new evaluation metric, RH-AUC, and constructed a complementary diagnostic benchmark dataset, RH-Bench, systematically quantifying for the first time the balance between multimodal reasoning models’ reasoning capability and visual perceptual stability.
This tool not only enhances the measurability of model hallucination risks but also provides an important reference for the robust evaluation and improvement of future multimodal systems.
Visual Hallucination Amplification Caused by Reasoning Enhancement
In the evolution of current large multimodal models, R1-class reasoning models, by introducing an explicit long-chain language reasoning process (Reasoning Chain), have demonstrated powerful expressive capabilities in complex tasks.
However, researchers have systematically observed a widely overlooked phenomenon: as the length of the reasoning chain deepens, the model’s visual alignment capability in perceptual tasks significantly declines, and the risk of hallucination is amplified accordingly.
This trend has been clearly observed in multiple empirical comparisons.
For example, in Figure (b), researchers compared the performance of several 7B-scale multimodal models in reasoning and perceptual tasks: although models like R1-OneVision-7B showed some advantages in reasoning accuracy, their accuracy in perceptual tasks dropped to the lowest level, significantly lower than non-reasoning models of comparable scale (e.g., Qwen2.5-VL-7B).
This indicates that the deepening of the reasoning chain is not a “cost-free” enhancement, but rather comes at the expense of image perception ability, amplifying hallucinations.
Specifically, when a model progressively extends its linguistic chain in image-text tasks, the image evidence signals that should support the answer are quietly marginalized.
Taking a typical visual question answering task as an example, the lengthy outputs generated by reasoning models often do not genuinely refer to the image content but rely on linguistic common sense to “fill in the blanks” with an answer that sounds reasonable but is not present in the image. This phenomenon repeatedly appears in multiple perceptual evaluation benchmarks (e.g., MMVP, MMHAL).
As shown in the figure, in the comprehensive evaluation of multiple visual perception tasks, R1-class models generally performed worse than Base models of the same scale, especially in MMHAL and MMVP, where detailed image alignment capabilities are required, the gap was even more significant.
This further confirms that: the enhancement of the reasoning chain not only does not improve perceptual quality but exacerbates the model’s tendency to hallucinate by “answering without seeing the image.”
In summary, the enhancement of the reasoning chain is not without cost; “smarter” reasoning models may paradoxically “see less” in perceptual tasks.
Smarter Yet More Prone to Error?
To deeply understand why multimodal reasoning models are more prone to hallucinations, the research team systematically analyzed the internal attention distribution of the models, revealing a structural mechanism: reasoning enhancement is not a free lunch; it comes at the cost of sacrificing visual attention to gain improved linguistic reasoning capabilities.
Specifically, compared to non-reasoning models, R1-class reasoning models significantly reduce their attention to visual tokens during generation, instead allocating a large amount of attention to instruction tokens and linguistic context (Figure a).
More critically, this “attention shift” is not a fixed bias but progressively intensifies as the reasoning chain extends—the deeper the layer, the more the model tends to ignore image input and rely entirely on linguistic signals for reasoning.
As shown in Figure (b), in visual focus tasks, non-reasoning models (Qwen2.5-VL) consistently showed stable attention to key areas in the image (e.g., cheese) across multiple layers; whereas R1 models (R1-OneVision), for the same problem, exhibited obvious visual degradation in their attention heatmaps, with deeper layers almost completely out of focus.
This structural shift causes the model, even when faced with questions that explicitly rely on images, to often “guess based on language,” ultimately generating hallucinatory answers severely detached from the image.
Furthermore, the study found that this phenomenon is particularly evident when the model enters the “Overthinking” stage.
As the reasoning chain lengthens, the model’s attention to visual tokens continuously weakens, while its attention to instruction-related language tokens significantly increases, leading the generation process to rely more and more on linguistic cues rather than image content.
Reasoning Chain “Length Paradox”: More Thinking, Bigger Hallucinations?
Is a longer reasoning chain always better for models? The research team compared three different reasoning length control strategies across multiple benchmark tests (Token Budget Forcing, Test-Time Scaling, and Latent State Steering), systematically revealing a key phenomenon for the first time: the relationship between the length of the reasoning chain and model performance exhibits a non-monotonic “inverted U-shape.”
As shown in the figure, in reasoning-dominant tasks (left two figures), model accuracy initially increases with the extension of the reasoning chain but then declines when the chain becomes too long, indicating that “overthinking” does not necessarily lead to stronger reasoning ability.
Conversely, in perception-dominant tasks (right two figures), as the reasoning length increases, the hallucination rate continuously rises, suggesting that redundant language generation systematically interferes with visual alignment.
This trend emphasizes that reasonable control of reasoning length is key to improving model robustness and the balance between perception and reasoning capabilities.
The introduction of metrics like RH-AUC also provides a more interpretable quantitative description of this non-linear relationship.
RH-AUC: Dynamic Trade-off Evaluation of Reasoning and Hallucination
Facing the dilemma of reasoning enhancement and hallucination amplification in multimodal models, the research team proposed a new evaluation metric: RH-AUC (Reasoning-Hallucination Area Under Curve).
Unlike traditional metrics that evaluate accuracy or hallucination rate at a single reasoning length, RH-AUC takes a holistic view, measuring the model’s dynamic balance between “thinking power” and “seeing clarity” at different reasoning depths.
The specific approach is to: in the newly constructed RH-Bench dataset (containing 1000 samples spanning perception and reasoning), separately calculate the model’s reasoning accuracy and hallucination risk at different reasoning lengths, and then compute the area under the curve formed by these two values.
A higher RH-AUC indicates that the model maintains its visual alignment capability better while reasoning is enhanced—it can both “think deeply” and “see clearly.”
Experimental results reveal three key trends:
1. Larger models are more robust: As shown in Figure (a), 7B models exhibit a smoother RH-AUC curve at different thinking depths and achieve higher scores at the peak, indicating stronger reasoning-perception integration capabilities.
2. RL-only training paradigm outperforms SFT+RL: As shown in Figure (b), under different training strategies, models trained with pure RL have a higher average RH-AUC than hybrid paradigms, especially with longer reasoning chains (0.57 vs 0.50).
This suggests that RL-only is more inclined to adaptively generate high-quality reasoning paths, while SFT+RL is more prone to redundant imitation, thereby interfering with perceptual judgments.
3. Data “type” is more important than scale: Experiments found that instead of blindly expanding the training set size, introducing a small number of samples with domain-perceptual features (such as mathematical reasoning or image perception tasks) is more helpful in guiding the model to achieve a balance between “seeing the image” and “understanding the problem.”
RH-AUC not only fills a gap in evaluation dimensions but also provides a clearer reference direction for future multimodal model training objectives: more reasoning is not always better; maintaining the tension between “seeing the image” and “solving the problem” is the superior paradigm.
References: