Google | Tracing RAG System Errors: Proposing a Selective Generation Framework to Boost RAG Accuracy by 10%

Click "AINLPer" below to follow and get more useful content first-hand.

More exciting content -> Focusing on cutting-edge sharing in Large Models, Agents, RAG, and more!

Introduction

Current RAG technology is widely used, but due to the involvement of many technical nodes, troubleshooting is generally very difficult. To address this, the authors of this paper conducted an in-depth analysis of errors occurring in RAG systems, "introducing the concept of Sufficient Context and pointing out that hallucination phenomena in RAG systems may be caused by insufficient context." They then proposed a selective generation framework to improve the accuracy of RAG systems, and experimental results show that this method can increase RAG system accuracy by up to 10%.Link: https://arxiv.org/pdf/2411.06037

Background

Retrieval-Augmented Generation (RAG) is one of the most significant technological breakthroughs in the current NLP field. It combines Large Language Models (LLMs) with dynamic information retrieval mechanisms, effectively solving three core problems of traditional language models: knowledge ossification, factual hallucination tendencies, and insufficient coverage of long-tail knowledge. This method is widely used in both open-source and commercial applications, such as search Q&A, intelligent customer service, and medical auxiliary diagnosis.

Despite RAG's excellent performance in many tasks, "hallucination" phenomena frequently occur, where the model confidently generates incorrect answers even when provided with incomplete or irrelevant documents. At this point, you might wonder: "Are these errors due to the retrieval system failing to provide enough information, or the model itself failing to correctly use the context?"

To analyze and solve this problem, Google proposed the concept of "Sufficient Context" and conducted in-depth research around it, aiming to clearly delineate responsibility for RAG system errors and provide several strategies for improving RAG generation quality.

Context Sufficiency Evaluation Tool

What is Sufficient Context? The authors define it as: whether the retrieved content "contains all the information necessary to support the correct answer." This standard does not require the context to explicitly contain the answer, but it should enable an LLM familiar with the task to reasonably derive the correct answer based on its common sense and reasoning abilities.

To quantify this concept, the authors built a new evaluation task: "Given a question, an answer, and a context, determine whether the context is sufficient to support the answer." Simply put: if the context contains all necessary information, allowing a large model to generate the correct answer, it is defined as "sufficient"; if the context lacks necessary information, is incomplete, uncertain, or contains contradictory information, it is defined as "insufficient."Based on the above definition, the authors first developed a large-model-based Context Sufficiency Evaluator (using Gemini 1.5 Pro, implemented by constructing prompts) to automatically determine if the context is sufficient. If the context is sufficient, it outputs "True"; if insufficient, it outputs "False". Experimental results show that the 1-shot Context Sufficiency Evaluator achieved an accuracy of up to 93% in evaluating context sufficiency.

RAG Error Traceability Analysis

The authors utilized the context sufficiency evaluation tool to analyze the performance of various large language models (LLMs) and datasets, drawing the following key findings:

State-of-the-art large models (such as Gemini, GPT, and Claude) generally perform excellently in answering questions when sufficient context is provided, but fail to identify and avoid generating incorrect answers when context is insufficient.

Smaller open-source models have specific issues, easily hallucinating even when the context is sufficient to answer the question correctly.

Sometimes, even when the context is judged insufficient, the model can still generate correct answers, indicating that insufficient context may still be useful, for example, by filling gaps in the model's knowledge or clarifying ambiguities in the query.

Based on these findings, the authors proposed recommendations for improving RAG systems: 1) add a sufficiency check before generation; 2) retrieve more context or reorder retrieved context; 3) adjust the abstention threshold based on confidence and context signals.

Context Sufficiency in Evaluation Benchmarks

The authors delved deeper into the circumstances behind sufficient context. Analysis revealed a significant number of cases with insufficient context in several standard benchmark datasets. The authors considered three datasets: FreshQA, HotPotQA, and MuSiQue. Datasets with a higher proportion of sufficient context instances, such as FreshQA, often had context derived from manually organized supporting documents.

Context Leading to Hallucinations

Surprisingly, although Retrieval-Augmented Generation (RAG) generally improves overall performance, it reduces the model's ability to choose not to answer questions at appropriate times. "Introducing additional context seems to increase the model's confidence, thereby leading it to be more prone to hallucinate."To understand this, the authors used Gemini to rate each model's answers, comparing them to possible true answers. They classified each answer as "correct," "hallucination" (i.e., incorrect answer), or "abstention" (e.g., saying "I don't know"). Using this method, the authors found that, for example, Gemma gave incorrect answers to 10.2% of questions without context, while this proportion rose to 66.1% when insufficient context was used.

Selective Generation Framework

Based on the above analysis, the authors proposed a "selective generation" framework that utilizes sufficient context information to guide abstention. The authors considered the following metrics: 1) "selective accuracy" measures the proportion of correct answers among the questions the model attempts to answer; 2) "coverage" is the proportion of questions answered.

The authors' selective generation method "combines sufficient context signals with the model's self-assessed confidence scores to make informed decisions about when to abstain." This is more refined than simply abstaining when context is insufficient, as the model can sometimes give correct answers even with limited context. The authors used these signals to train a logistic regression model to predict hallucinations. Then, they set a coverage-accuracy trade-off threshold to determine when the model should decline to answer.

The authors used two main signals to decide whether to abstain:

"Self-assessed confidence" adopted two strategies: P(True) and P(Correct). P(True) involves sampling answers multiple times and prompting the model to label each sample as correct or incorrect. P(Correct) is used for models with high query costs, involving obtaining the model's answer and its estimated probability of correctness.

"Sufficient context signal" uses the binary label from the self-evaluation tool model (FLAMe) to indicate whether the context is sufficient. Crucially, the authors do not need the true answer to determine the sufficient context label, so this signal can be used when answering questions.The authors' research results show that this method achieves a better selective accuracy-coverage trade-off compared to using only model confidence. By using the sufficient context label, the authors can improve accuracy on the questions the model answers, sometimes by as much as 10%.

More exciting content --> Focusing on academic frontiers in Large Models/AIGC, Agents, RAG, and more!

Google | Tracing RAG System Errors: Proposing a Selective Generation Framework to Boost RAG Accuracy by 10%

Share Short URL