Tsinghua University's New RAG Framework: DO-RAG Accuracy Soars by 33%!

Published on: May 17, 2025

RAG

Image

If unable to add, please add WeChat: iamxxn886

Add note DORAG

1. Current Status of RAG Research

Question Answering (QA) systems enable users to accurately retrieve information from vast data using natural language, primarily categorized into two types:

Open-domain QA relies on common sense for answers

Closed-domain QA requires support from specialized data

With breakthroughs in Large Language Models (LLMs) like DeepSeek-R1 and Grok-3, text fluency and semantic understanding have significantly improved. However, these models rely on parametric memory and may still "hallucinate" or provide irrelevant answers when encountering specialized terms or complex reasoning.

Retrieval-Augmented Generation (RAG) improves accuracy by fetching relevant passages before answering, while Knowledge Graphs (KG) support multi-step reasoning with structured relational networks.

However, existing solutions have clear drawbacks:

Complex associations in technical documents are often fragmented during retrieval, leading to disjointed answers;

Building high-quality domain-specific graphs is time-consuming and labor-intensive, and integrating them with vector search imposes a huge engineering burden.

To address this, Tsinghua University's team introduced the DO-RAG framework, achieving three major innovations:

Building Dynamic Knowledge Graphs: Automatically extracts entity relationships from text, tables, and other multimodal data through a multi-level agent pipeline.

Dual-Track Retrieval Fusion: Combines graph reasoning with semantic search to generate information-rich prompt templates.

Hallucination Correction Mechanism: Verifies answers against the knowledge base and iteratively corrects logical flaws.

In tests within specialized fields like databases, DO-RAG achieved 94% accuracy, outperforming mainstream solutions by up to 33 percentage points. Its modular design supports plug-and-play functionality, allowing transfer to new domains without retraining.

2. What is DO-RAG?

2.1 System Architecture Overview

Image

As shown in the figure, the DO-RAG system consists of four core modules:

Multimodal document parsing and chunking

Multi-level entity relationship extraction in Knowledge Graph (KG) construction

Hybrid retrieval mechanism combining graph traversal and vector search

Multi-stage generation engine for precise answers

The system first intelligently chunks heterogeneous data such as logs, technical documents, and charts. Text segments and their vectorized representations are simultaneously stored in a PostgreSQL database enhanced with pgvector.

Through a chain-of-thought driven agent process, document content is transformed into a structured multimodal knowledge graph (MMKG), precisely capturing multi-dimensional associations like system parameters and behavioral characteristics.

When a user initiates a query, the intent parsing module decomposes it into several sub-queries. The system first locates relevant entity nodes in the knowledge graph, extends retrieval boundaries through multi-hop reasoning, and obtains structured context rich in domain-specific features.

Subsequently, the system uses graph-aware prompt templates to semantically refine the original query, transforming it into an unambiguous and precise expression. The optimized query is then vectorized to retrieve the most relevant text segments from the database.

Finally, the system integrates the original query, optimized statements, graph context, retrieval results, and dialogue history to construct a unified prompt input for the generation engine.

Answer generation undergoes a three-stage refinement: initial generation, factual verification and semantic optimization, and final condensation. The system also intelligently predicts subsequent questions, providing a natural and fluid multi-turn dialogue experience.

2.2 Knowledge Base Construction

Document processing begins with multimodal input. Text, tables, and images are standardized and segmented into semantically coherent chunks, while retaining metadata like source file structure and section hierarchy to ensure traceability.

Image

A multi-agent hierarchical pipeline is used to parallelly extract structured knowledge. As shown in the figure, four specialized agents perform their respective duties:

High-level agent: Parses document skeleton (sections/paragraphs)

Mid-level agent: Extracts domain entities (system components/APIs/parameters)

Low-level agent: Mines fine-grained operational logic (thread behavior/error paths)

Covariate agent: Annotates node attributes (default values/performance impact)

Finally, a dynamic knowledge graph is generated, where nodes represent entities, edges represent associations, and weights represent confidence. Duplicate removal is achieved by comparing entity embedding vectors using cosine similarity, and similar entities are aggregated into summary nodes to simplify the graph.

2.3 Hybrid Retrieval and Query Decomposition

Image

As shown in the figure, when a user asks a question, DO-RAG uses a Large Language Model-based intent analyzer to structurally decompose the question, generating sub-queries to guide Knowledge Graph (KG) and vector library retrieval.

The system first extracts relevant nodes from the KG based on semantic similarity and constructs a context-rich subgraph through multi-hop traversal. With graph-aware prompts, these graph evidences optimize query formulation and eliminate ambiguity. After the optimized query is vectorized, semantically similar content fragments can be retrieved from the vector library.

Ultimately, DO-RAG integrates all information, including the original query, optimized statements, graph context, retrieval results, and user dialogue history, into a unified prompting framework.

2.4 Answer Generation and Delivery

Image

As shown in the figure, the final answer is generated through a staged prompting strategy.

First, a basic prompt requires the LLM to answer only based on retrieved evidence, avoiding unsubstantiated content.

Then, an optimized prompt is used to restructure and validate the answer.

The final condensation stage ensures the tone, language, and style of the answer are consistent with the question.

To enhance the interactive experience, DO-RAG also generates follow-up questions based on the optimized answer. The final delivered content includes:

(1) A refined, verifiable answer,

(2) Citations indicating sources,

(3) Targeted follow-up questions.

If insufficient evidence exists, the system will truthfully return "I don't know," ensuring reliability and accuracy.

3. Performance Comparison

Client Service International (CSII)'s SunDB distributed relational database was chosen as the test platform. Its heterogeneous dataset, composed of technical manuals, system logs, and specification documents, provided an ideal scenario for verifying DO-RAG's multimodal processing, entity relationship mining, and hybrid retrieval capabilities.

3.1 Experiment Configuration

3.1.1 Hardware Environment

Ubuntu workstation with 64GB RAM + NVIDIA A100 GPU

3.1.2 Software Stack

Tracking System: LangFuse (v3.29.0)

Cache Management: Redis (v7.2.5)

Document Storage: MinIO (latest version)

Analysis Engine: ClickHouse (stable version)

Vector Database: PostgreSQL + pgvector combination

3.1.3 Test Data

SunDB core dataset: Technical documents containing embedded code.

Electrical engineering auxiliary set: Technical manuals with circuit diagrams.

Each group of 245 professional questions was annotated with standard answers and precise sources.

3.1.4 Evaluation System

Four core metrics (passing line 0.7 points):

Answer Relevancy (AR) - Semantic matching degree

Contextual Recall (CR) - Information completeness

Contextual Precision (CP) - Result purity

Faithfulness (F) - Answer trustworthiness

3.1.5 Evaluation Toolchain

RAGAS responsible for metric calculation.

DeepEval for end-to-end verification.

LangFuse for full-link tracking.

3.1.6 Comparison Schemes

Horizontal comparison: FastGPT / TiDB.AI / Dify.AI three mainstream frameworks.

Vertical comparison: Knowledge Graph enhanced version vs. pure vector retrieval version.

3.2 External Benchmark Test

Image

As shown in the table above, in cross-model tests, SunDB.AI's overall score comprehensively surpassed FastGPT, TiDB.AI, and Dify.AI, the three baseline systems.

Image

The figure below visually presents SunDB.AI's continuous leading advantage through comparative visualization.

3.3 Internal Optimization Validation

Image

The table above indicates that after integrating the knowledge graph, DeepSeek-V3's answer relevancy increased by 5.7%, contextual precision by 2.6%, and both models achieved 100% contextual recall.

When the graph was not enabled, recall rate dropped to 96.4%-97.7%, and trustworthiness decreased due to reliance on unstructured search.

DeepSeek-R1 showed a slight decrease in trustworthiness of 5.6% after enabling the graph, presumed to be due to its creative output characteristics.

3.4 Domain-Specific Performance

Image

SunDB and electrical domain test data (Tables III/IV) show that the contextual recall rate for all models approached full marks. The differentiated performance in answer relevancy, precision, and trustworthiness reflects the strengths of different models.

Xiaoxiannv's Review:

It feels a bit like a gimmick; the test benchmarks did not include classic Graph+RAG frameworks like GraphRAG or lightRAG. However, the multi-agent design for graph construction is worth learning from. It's a pity the project is not open source.

Paper Original: https://arxiv.org/abs/2505.17058

Get more latest ArXiv paper updates: https://github.com/HuggingAGI/HuggingArxiv!

Join the community, +v: iamxxn886Image

Main Tag:Retrieval-Augmented Generation

Sub Tags:Large Language ModelsNatural Language ProcessingArtificial IntelligenceKnowledge Graphs


Previous:LLM + RL Questioned: Deliberately Using Incorrect Rewards Still Significantly Boosts Math Benchmarks, Causing a Stir in the AI Community

Next:Andrej Karpathy Praises Stanford Team's New Work: Achieving Millisecond-Level Inference with Llama-1B

Share Short URL