In recent years, large language models like GPT have excelled in tasks such as Q&A, search, and healthcare. However, a persistent problem remains: hallucination, where models confidently output information that deviates from facts. To mitigate hallucination, the academic community proposed the RAG (Retrieval-Augmented Generation) framework, aiming to reduce 'fabrication' by incorporating external data to aid generation.
But is reality as we wish? Research teams from Hong Kong Polytechnic University and Sichuan University discovered that RAG itself can introduce biases, even leading to 'Hallucination on Hallucination'.
To address this, the team proposes a new framework—DRAG (Debate-Augmented RAG)—which introduces a multi-agent debate mechanism to rigorously control every step, from 'finding information' to 'writing answers,' thereby enhancing the authenticity and reliability of the outputs.
Paper Title: Removal of Hallucination on Hallucination: Debate-Augmented RAG
Paper Link: https://arxiv.org/abs/2505.18581
Code Link: https://github.com/Huenao/Debate-Augmented-RAG
Conference: ACL 2025 Main
Research Background: Hallucination on Hallucination in RAG
To address the hallucination problem in generative AI, the RAG framework enhances the factual accuracy of generated results by 'looking up information before speaking.' However, reality is often harsh: what if the retrieved information itself is incorrect?
For example:
Question: 'Who is the female keyboardist of Guns N' Roses?' RAG retrieves the plot of the movie 'Guns and Roses,' and then the model confidently answers: 'Gu Xixi'.
This is not putting out a fire, but adding fuel to it.
Furthermore, even if the retrieved information is correct, models might still misinterpret it, pick the wrong focus, or even fabricate content.
This paper attributes this 'Hallucination on Hallucination' phenomenon to issues in two stages of the RAG system:
Retrieval Stage: Insufficient or biased retrieval can lay a 'cognitive trap' for subsequent generation;
Generation Stage: Models may still produce factually incorrect answers due to noise interference or misunderstanding of context.
Solution: Let the Models "Debate"
The core idea of DRAG is to leverage a Multi-Agent Debate (MAD) mechanism. It introduces a 'proponent-challenger debate + judge' mechanism in both the information retrieval and answer generation stages, simulating an AI debate court that 'finds facts + questions each other + collectively evaluates,' leading to more accurate and well-reasoned outputs.
1. Phase 1: Retrieval Can Also "Argue Logically"
Traditional RAG uses a 'one query, one search' model for retrieval: the model initiates a query based on the question and then answers using that information. However, if the query keywords are imprecise or only partial content is retrieved, the model will naturally 'find answers from incorrect information'.
To address this, DRAG introduces a 'multi-agent debate mechanism' in the retrieval stage, akin to multiple agents holding a meeting to 'discuss the most reliable way to find information'.
Specifically, each retrieval round involves three types of agents:
Proponent Agent: argues that 'the current retrieval strategy is fine and doesn't need modification';
Challenger Agent: believes 'the retrieval isn't precise enough' and proposes optimization suggestions, such as changing keywords or expanding the query;
Judge Agent: compares the arguments of both sides and decides whether to adjust the query strategy for the next round.
2. Phase 2: "Contention-Based Reasoning" in the Generation Stage
Even with good data, models may still provide irrelevant answers, especially when information is conflicting or the reasoning chain is long.
DRAG introduces a second major mechanism: enabling AI to perform 'contention-based reasoning' during the generation phase. Additionally, to prevent models from uncritically trusting biased retrieved information, this paper designs an information asymmetry mechanism, where two agents with unequal information sources engage in 'debate'
Proponent Agent: relies on the retrieved information to answer;
Challenger Agent: answers solely based on its own knowledge, without consulting the retrieved information;
Judge Agent: synthesizes both answers and selects the version that is more factually accurate and logically rigorous.
Experimental Performance: Stable, Stronger in Multi-hop Tasks
This paper comprehensively evaluates DRAG on six QA datasets, including open-domain QA (TriviaQA, NQ, PopQA), multi-hop QA (2WikiMultihopQA, HotpotQA), and common sense reasoning (StrategyQA).
As shown in Table 1, DRAG achieved strong performance in multi-hop reasoning tasks and was also competitive in single-hop tasks.
Furthermore, this paper conducts a more detailed analysis of the retrieval debate and generation debate stages:
Table 3 summarizes the average number of debate rounds and query counts for DRAG across different tasks, indicating that DRAG can dynamically adjust its retrieval strategy to adapt to task complexity.
Table 4 investigates DRAG's performance with and without generation debate when correct information is not retrieved, demonstrating that generation debate enhances robustness to retrieval defects.
Table 5 presents ablation studies on DRAG's various modules, showing that both major stages of DRAG are indispensable. Additionally, the asymmetric information setting plays a crucial role in preventing agents from over-relying on retrieved content and promoting factual consistency.
Below is an instance analysis of DRAG, showing that retrieval debate can effectively exclude incorrect retrieval targets and guide the system to formulate more accurate retrieval strategies.
Conclusion and Outlook
DRAG innovatively optimizes the RAG framework in both retrieval and generation stages through 'multi-agent debate,' effectively mitigating the hallucination on hallucination problem. It has achieved leading performance in multi-hop question answering and open-domain question answering tasks, verifying the generality and effectiveness of this method.
However, DRAG also has certain limitations. In simple single-hop tasks, DRAG might cause 'problem drift' due to 'over-debating.' Therefore, future work could explore adaptive stopping strategies to improve cost-effectiveness.