OPA-DPO: An Efficient Solution for the Hallucination Problem in Multimodal Large Models

(Reading time for this article: 13 minutes)

Editor's Note: In the rapid development of visual multimodal large language models, the hallucination problem has always been a focus of researchers. Models generating content inconsistent with or even false to the input images not only affects user experience but also hinders the practical application of multimodal technology. To address this, a joint research team from Microsoft Research Asia and The Chinese University of Hong Kong proposed the On-Policy Alignment (OPA)-DPO algorithm, starting from Direct Preference Optimization (DPO). This algorithm effectively solves the hallucination problem by ensuring the consistency of training data with the reference policy. This achievement has been selected as an Oral paper for CVPR 2025, a top conference in computer vision.

In the field of visual multimodal large language models, the phenomenon of “hallucination” where models generate content inconsistent with or even false to the input images, is a core challenge that urgently needs to be overcome. As a simple and effective solution, Direct Preference Optimization (DPO) [1] is attracting increasing attention. Researchers construct preference data pairs directly based on the degree of hallucination by comparing different responses from the model to the same prompt and image, for DPO training.

However, researchers at Microsoft Research Asia noticed that different data construction methods in existing research lead to significant performance differences. Therefore, they conducted a comprehensive analysis of algorithms “based on DPO to solve the hallucination problem in multimodal large models,” summarized their performance and limitations, and theoretically revealed the fundamental reasons behind the performance differences of various algorithms. They pointed out that the most critical factor determining model performance is “whether the data used to construct preference pairs is on-policy relative to the strategy before DPO (reference policy).”

Image

DPO: A Dawn for Hallucination, or a New Challenge?

The researchers divided previous research into three categories:

The first category is Hallucination Injection, such as HALVA [2] and POVID [3], which construct preference pairs by manually injecting hallucination fragments into standard responses to existing images and prompts;

The second category is Hallucination Identification, such as RLHF-V [4], HA-DPO [5], and HSA-DPO [6], where the model first generates responses based on images and prompts, and then uses expert feedback (human or GPT-4/4v) to identify and modify hallucinations, thereby constructing preference pairs;

The third category is Self-Evolution, such as RLAIF-V[7], which has the model generate multiple responses for the same image and prompt, and a mentor model with stronger hallucination identification capabilities judges and ranks the severity of hallucinations in these responses to construct preference pairs.

Image

Figure 1: Three categories of previous research work

According to experimental results, the performance of these three algorithms is summarized as: Self-Evolution > Hallucination Identification > Hallucination Injection.

For hallucination injection, hallucinations usually do not come from the model itself, so DPO training often does not bring much gain to the model. For self-evolution, theoretically due to the curse of dimensionality, it is very difficult for the model to explore and find completely correct responses on its own, so stubborn hallucinations present in multiple responses usually cannot be eliminated by this method.

Intuitively, hallucination identification methods should be the most efficient solution for hallucinations, so why did they fail in practice? To understand the reasons behind this, researchers started their study from the details of the DPO algorithm.

Similar to the initial objective of the most commonly used RLHF algorithm PPO, DPO's initial objective is also (where π_θ is the current policy of the model, π_ref is the initial/reference policy of the model, x is the prompt, m is the image, y is the response, r(x,y,m) is the reward function trained by the Bradley-Terry model):

Image

That is, while maximizing rewards, it constrains the KL divergence between the model's current policy and its initial policy. However, researchers re-examined the definition of KL divergence and found that given any prompt and image (x,m), if there exists a response (y) such that π_θ(y|x,m)>0, but π_ref(y|x,m)→0, then the KL divergence will tend to infinity. This property indicates that for any algorithm starting from objective function (1), responses with extremely low sampling probability relative to the original policy (π_ref) (according to reinforcement learning naming conventions, such data is called off-policy data, conversely on-policy data) will have no chance of being learned by the model.

Image

If these off-policy preferred responses are forcibly used to construct DPO preference pairs, the gradient will almost disappear in the next update.

Revisiting the DPO training optimization objective:

Image

Where y_w is the preferred response, and y_l is the rejected response, its gradient can be expressed as (σ(⋅) is the sigmoid function):

Image

Before training begins, π_θ=π_ref, so the value inside the sigmoid function should be 0, meaning the current policy will perform a max-loglikelihood update on y_w with a coefficient of 0.5β. However, after this update, logπ_ref(y_w∣x,m)π_θ(y_w∣x,m) will tend to a very large value (because the numerator > 0, and the denominator tends to 0), causing σ(−r_w+r_l)→0. Therefore, the gradient will almost disappear in the next update.

Looking back at the hallucination identification methods, most of the responses modified by experts are off-policy for the original model, and even minor changes are of no avail, so it is impossible to expect these expert feedbacks to be learned by the model. In contrast, even though self-evolution methods have potential problems with learning efficiency, the preference pairs they construct all come from the model itself, meaning they are all on-policy, thus yielding the best results.

Image

OPA-DPO: Breaking Conventions, Reshaping Alignment Strategy

Is there a method that can both utilize precise expert feedback and completely avoid the KL divergence constraint problem caused by off-policy data?

Addressing the limitations of existing methods, Microsoft Research Asia, in collaboration with The Chinese University of Hong Kong, proposed a simple and efficient algorithm, On-Policy Alignment (OPA)-DPO, which aligns precise expert feedback data with the model policy before DPO training. Using only 4.8k data, OPA-DPO can achieve SOTA performance, while previous SOTA algorithms required 16k data. This achievement has been selected as an Oral paper for CVPR 2025, a top conference in computer vision.

Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key

Paper link:

https://arxiv.org/abs/2501.09695

Image

Figure 2: Specific implementation method of OPA-DPO

The specific implementation method of OPA-DPO is as follows: First, given an image and a prompt, the model generates a corresponding response; then, expert feedback (e.g., GPT-4v) is used to fine-grained modify the generated content, retaining the correct response parts while correcting any existing hallucinations; subsequently, the true responses from the dataset and the expert-modified responses are fine-tuned using LoRA-SFT to obtain a new model (which the researchers call the OPA model); finally, based on the OPA model, subsequent DPO training is performed, where the researchers referenced the mDPO setup, constructing both language preference pairs and image preference pairs as well as anchor pairs. Although these elements are important, the OPA operation has the greatest impact on the final result.

Image

Figure 3: OPA-DPO achieves alignment in four steps

The researchers comprehensively compared various DPO-based algorithms fine-tuned on LLaVA-1.5-7B and 13B models. OPA-DPO achieves SOTA results across multiple metrics using only 4.8k data.

Image

Table 1: For a fair comparison of various RLAIF/RLHF enhanced LVLM algorithms, researchers uniformly used greedy sampling evaluation across multiple benchmarks, noted the source to distinguish official reproductions from paper results, and bolded the best performance in each metric group.

Image

The True Power of OPA-DPO

To verify the importance of the OPA operation and the impact of data volume on the final results, the researchers conducted detailed ablation experiments.

Image

Figure 4: Impact of training data volume and OPA operation on OPA-DPO (Ablation Experiment)

Furthermore, the researchers also conducted experiments using the recently proposed LLaVA-OneVision as the base model. It was observed that LLaVA-OneVision's output is detailed but slightly redundant and often exhibits severe hallucination phenomena. In such cases, the effect of OPA-DPO is even more significant, achieving a notable improvement in hallucination metrics with only 2.4k data for training.

Image

Table 2: Experimental results of OPA-DPO on LLaVA-OneVision

The researchers found that models trained with OPA-DPO tend to adopt a slightly conservative strategy, especially in description tasks, where they typically output only significant and certain observations, ignoring some unimportant details.

Image

Figure 5: Impact of OPA operation on DPO-trained model output in image description tasks

Moreover, the researchers observed an interesting phenomenon: the base model often defaults to assuming that the language in the query is accurate, and even if there are severe hallucinations in that text, the model will follow it to describe the image. This might be understood as a form of textual inertia. However, models trained with OPA-DPO exhibited the ability to discern hallucinations in the query text.

Image

Figure 6: In erroneous premise query tasks, OPA-DPO trained models showed the ability to discern hallucinations within the query.

The introduction of OPA-DPO not only improved algorithm performance but also promoted the development of multimodal alignment methods. Its concept of "generating on-policy data through expert feedback" has become a significant breakthrough in current multimodal alignment training.

References:

[1] Rafailov R, Sharma A, Mitchell E, et al. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 2023, 36: 53728-53741.

[2] Zhou Y, Cui C, Rafailov R, et al. Aligning Modalities in Vision Large Language Models via Preference Fine-tuning. ICLR 2024 Workshop on Reliable and Responsible Foundation Models.

[3] Sarkar P, Ebrahimi S, Etemad A, et al. Data-augmented phrase-level alignment for mitigating object hallucination. arXiv preprint arXiv:2405.18654, 2024.

[4] Yu T, Yao Y, Zhang H, et al. RLHF-V: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 13807-13816.

[5] Zhao Z, Wang B, Ouyang L, et al. Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization. arXiv preprint arXiv:2311.16839, 2023.

[6] Xiao W, Huang Z, Gan L, et al. Detecting and mitigating hallucination in large vision language models via fine-grained ai feedback. arXiv preprint arXiv:2404.14233, 2024. (AAAI 2025)

[7] Yu T, Zhang H, Yao Y, et al. Rlaif-v: Aligning mllms through open-source ai feedback for super gpt-4v trustworthiness. arXiv preprint arXiv:2405.17220, 2024. (CVPR 2025)

[8] Wang F, Zhou W, Huang J Y, et al. mDPO: Conditional Preference Optimization for Multimodal Large Language Models. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024: 8078-8088.

Microsoft Research Asia's AI and Machine Learning Group (Shanghai) is recruiting interns. Students interested in RL for Large Models are welcome to join us! Please send resumes to: xufluo@microsoft.com

Microsoft Research Asia's new book "Unbounded" is released

When facing collective challenges, the collision of ideas and the resonance of wisdom become particularly important. "Unbounded: Insights into Microsoft's Innovative Research Realm," a book meticulously crafted by Microsoft Research Asia over two years, is precisely a guide for exploration in this era.

More than ten top researchers, including Dean Lidong Zhou, participated in the writing of this book. They discussed the latest advancements in artificial intelligence, computer science, and their interdisciplinary fields from different perspectives, sharing cutting-edge insights, viewpoints, and valuable research experience.

This book has received recommendations from over ten top global scholars, including Turing Award winners, academicians, leaders of renowned universities, scholars renowned in their respective fields, and distinguished alumni of Microsoft Research Asia.

Now, "Unbounded: Insights into Microsoft's Innovative Research Realm" is officially available across all platforms! The first batch of readers will receive a limited edition Microsoft 50th-anniversary bookmark, with random author autographs for a blind box surprise!

Click the link below now to start your exclusive reading journey!

You might also want to read:

Image

Image

Image

Main Tag:Artificial Intelligence

Sub Tags:Machine LearningMultimodal AIDirect Preference OptimizationComputer VisionLarge Language Models


Previous:AI Learns Reasoning Solely by "Confidence": Zhejiang University Alumnus Replicates DeepSeek's Long Chain-of-Thought Emergence, Reinforcement Learning Needs No External Reward Signals

Next:Reviewing the Progress of RL-Reasoning

Share Short URL