LLMs in Document Intelligence: Survey, Progress, and Future Trends

With the advent of the digital age, the volume of documents has increased dramatically, encompassing text files, web pages, slides, posters, spreadsheet data, and even scene text images. these documents not only encapsulate the details of internal and external business processes and accumulated knowledge across different industries but also cover a vast amount of industry-related instances and data, holding immeasurable value. In recent years, Large Language Models (LLMs), represented by the GPT series, have greatly promoted the development of the document intelligence field, leading us to believe that tasks like contract review or financial report Q&A could be directly handed over to AI. However, when we feed the model with a year's worth of invoices, contracts, and annual reports and ask, "What is the quarter's net profit year-over-year?", the model often struggles: the text is right, but the structure is lost; the answer is there, but the source tracing is gone; the context is lengthened, but hallucinations explode.

Image

Therefore, how to efficiently and automatically analyze, classify, extract, and query these documents to scale up the release of their value becomes crucial. This is the core problem addressed by this paper from researchers at Southeast University and the Beijing Institute of Computer Technology and Application. They analyze the trade-offs between pipeline-based and end-to-end approaches, the synergy between RAG and long context processing, and the "hard nuts to crack" like tables, layouts, and formulas, providing a practical engineering blueprint. This article has been accepted by ACM.

Image

Main Contributions Include:

  • Comprehensive Literature Review: Reviews a total of 322 papers, focusing on 265 published between 2021 and 2025, providing a deep perspective on the evolution of the field.

  • In-depth Analysis of Current Development Paradigms: Systematically compares Pipeline Parsing vs. End-to-End Parsing, covers document parsing, summarizes Document and Table-specific LLMs, refines the full RAG (Retrieval-Augmented Generation) pipeline, and organizes Long Context methods.

  • Summary of Practical Applications, Datasets, and Evaluation Standards: Compiles 20 real-world application tasks, 30 commonly used datasets, 6 benchmark suites, and 16 evaluation metrics.

  • Discussion of Challenges and Future Directions: Discusses the main challenges currently facing Document LLMs and future development directions.

Essential Summary: Cognitive Navigation Map for Document Intelligence

Facing the dilemma where AI's ideal solution for complex documents often clashes with reality, this major review paper provides us with a clear "battle map." This summary systematically breaks down the core content of this map, primarily focusing on three levels:

  • Two Major Paradigms: Deep comparison of the modular combination of the Pipeline approach versus the single-step processing of the End-to-End approach, analyzing their pros, cons, and trade-offs in engineering practice.

  • Four Core Technologies: Detailed analysis of the most crucial technological pathways, including Document Parsing, Dedicated LLM Fine-tuning, the popular RAG (Retrieval-Augmented Generation), and bottleneck-breaking Long Context Processing, examining how they cooperate to solve challenges like tables, layouts, and multi-page documents.

  • A Complete Ecosystem: Comprehensive organization of the complete ecosystem from datasets, open-source tools to industry benchmarks and evaluation metrics, providing a basis for technology assessment and deployment.

We believe that after reading this, you will have a comprehensive and systematic understanding of the current field of document intelligence.

8 Core Challenges in Document Intelligence

The researchers first summarized eight pervasive challenges (CH1-CH8) in document processing, which serve as the starting point for understanding subsequent technical solutions:

Image

  1. Document Parsing: How to accurately extract text, layout, tables, and other information from diverse formats (PDF, images), and handle scanning noise.

  2. Complex Layouts: Documents often contain complex formatting like headers, footers, multiple columns, and charts; models need to understand this visual layout to correctly interpret the content.

  3. Rich-detail Images: Images in documents (like charts, diagrams) have higher resolution and richer detail than natural scene images, requiring high demands on visual encoders.

  4. Multi-page Documents: How to maintain context continuity and associate cross-page information when processing multi-page documents.

  5. Tabular Recognition: Accurately identifying table rows, columns, and cell boundaries, especially for complex merged cells.

  6. Table Inference: Not only recognizing the table but also performing logical and mathematical reasoning on the data within (e.g., calculating financial statements).

  7. Multimodal Information Utilization: How to effectively integrate various modalities of information, such as text, images, tables, and layout.

  8. Long Context: Documents are typically long, often exceeding the context window limits of existing LLMs, leading to incomplete information processing.

Two Major Technical Paradigms

The researchers categorize current technical solutions into two major paradigms, whose main difference lies in whether they rely on traditional Optical Character Recognition (OCR) tools.

Image

  • Pipeline-based Paradigm (OCR-based):

  • Process: This is a modular, multi-stage processing flow: Document Image -> Image Preprocessing -> Layout Analysis -> OCR Recognition -> Semantic Understanding. Each stage uses specialized tools or models, e.g., using an OCR tool to extract text, then feeding the text into an LLM for understanding.

  • Advantages: Clear structure, modules can be optimized independently, and high interpretability.

  • Disadvantages: Long process, prone to error accumulation (errors from earlier stages propagate and affect subsequent stages), and high engineering overhead.

  • End-to-End Paradigm (OCR-free):

  • Process: Directly takes the document image and task instruction as input, generating the final result (e.g., structured data in JSON format) through a unified multimodal large model (MLLM). Representative models include Donut, Nougat.

  • Advantages: Avoids information loss in intermediate steps, stronger adaptability to complex layouts and non-standard documents.

  • Disadvantages: Requires extremely large models, massive training data, and huge computational resources, and is susceptible to the "hallucination" problem.

Key Technology One: Document Parsing

Document parsing is the entry point of the document intelligence workflow. Its core goal is to take various document formats (e.g., scanned copies, PDFs, web pages) as input and output a structured, machine-understandable representation or semantic information. This technology is mainly implemented through two different paradigms:

Pipeline-based Methods and End-to-End Methods.

1. Pipeline-based Methods

This approach inherits traditional document analysis concepts, breaking down complex parsing tasks into a series of independent, sequential, modular steps.

Image

Core Workflow

A typical pipeline workflow includes the following key stages:

  • Image Processing: This is the initial preprocessing stage aimed at improving the quality of the document image, laying the foundation for subsequent steps. Specific tasks include:

  • Preprocessing: Such as image denoising, contrast enhancement, binarization, etc.

  • Correction: Correcting image skew and distortion issues.

  • Interference Removal: Removing decorative elements like borders and watermarks.

  • Layout Analysis: This stage aims to identify and segment the physical structure of the document, understanding the position and relationship of various content elements (e.g., text blocks, titles, tables, images).

  • Technical Evolution: Early research used CNNs directly for layout unit detection, while more recent multimodal Transformer-based methods achieve better results by combining image and text embedding information. For example, representing the document as a graph structure and then using Graph Neural Networks (GNNs) for segmentation and classification.

  • Content Recognition: After layout analysis, this stage focuses on recognizing specific content.

  • Text Recognition (OCR): This is the core part, including the recognition of printed, handwritten, and scene text. Researchers use the Transformer architecture to unify text detection and recognition tasks or leverage self-supervised learning to enhance model robustness.

  • Mathematical Formula Recognition: Due to the complex structure of formulas (e.g., superscripts, subscripts, special symbols), recognition is much harder than ordinary text. Related methods usually detect formula entities first, then use multimodal Transformers for grouping and parsing.

  • Entity Normalization: After OCR, text may contain errors. This step aims to eliminate ambiguity in entities (like names, organizations) and convert them into standardized identifiers.

  • Semantic Understanding: This is the final step of the pipeline, designed to extract valuable information from the recognized content and understand its meaning. Tasks include:

  • Information Extraction: Extracting key entities and relationships from the text.

  • Document Q&A: Answering user questions based on document content.

  • Summary Generation: Automatically generating a summary of the document's core content.

Advantages and Disadvantages

  • Advantages: Each module can be optimized and replaced independently, providing the system with strong interpretability and controllability.

  • Disadvantages: The process is longer, and errors from preceding stages are propagated and accumulated in subsequent stages, potentially leading to overall performance degradation.

Related Tools

Many open-source tools and frameworks adopt the pipeline approach, such as:

  • PP-Structure: Integrates image correction, layout analysis, and various recognition tools for document parsing.

  • Docling: A Python package that integrates features like layout analysis and table structure recognition.

  • MinerU: Integrates multiple open-source tools for OCR, table recognition, and formula recognition, along with extensive engineering post-processing.

  • RagFlow: A RAG framework focused on document parsing, applying OCR technology and parsers to support different document formats.

2. End-to-End Methods

In contrast to pipeline methods, the end-to-end paradigm uses a unified Multimodal Large Model (MLLMs) to directly take the raw document image and task instructions (Prompts) as input, generating the final parsing result in a single step. These methods are often referred to as "OCR-Free" methods because they do not rely on external OCR tools to extract text.

Core Idea

The core of the end-to-end approach is training a large Vision-Language Model (LVLM) to directly understand the text and layout information within the image.

  • Model Training: It typically requires constructing a large number of <prompt, doc_image, ocr_md> triplets for specific training and fine-tuning.

  • Representative Models:

  • Donut: The first OCR-Free model proposed, mapping the input image directly to structured output. It learns to "read" text during pre-training and "understand" the entire document based on downstream tasks during fine-tuning.

  • Nougat: Uses a Swin Transformer encoder and an mBART decoder to directly convert academic documents in PDF format into machine-readable Markdown language.

Advantages and Disadvantages

  • Advantages:

  • Avoids the error accumulation issue caused by chaining multiple modules in the pipeline method.

  • Shows stronger adaptability when dealing with complex layouts and non-standard documents.

  • The process is complete and smooth.

  • Disadvantages:

  • Prone to Hallucination and insufficient generalization ability.

  • Requires extremely large model scale, massive training data, and huge computational resources.

  • Slow inference speed and high memory consumption limit its application in real-time scenarios.

Document parsing technology is evolving from traditional, modular pipeline methods towards more integrated and powerful End-to-End methods. Pipeline methods are mature and controllable, remaining a practical and necessary choice in many scenarios. End-to-end methods, however, represent the future direction, with enormous potential despite current performance and resource challenges.

Key Technology Two: Dedicated Document and Table LLMs

The second key technology is Fine-tuning Document LLMs.

The core idea of this technology is that although general multimodal large models (like BLIP, FlanT5) possess fundamental capabilities for understanding images and text, they are not optimized for documents—which are specialized "images" rich in text, complex in layout, and diverse in structure. Therefore, through Fine-tuning, the capabilities of these general models can be inherited and further developed into specialized models for document tasks, known as Document LLMs.

The paper divides this technological area into two categories: general Document LLMs and specialized Table LLMs.

1. Document LLMs

Document LLMs aim to fully understand the entire document in an end-to-end manner, effectively preserving visual layout, structural information, and multimodal cues, making them particularly suitable for tasks requiring precise layout retention and comprehensive multimodal reasoning.

Image

Typical Fine-tuning Frameworks

A typical fine-tuning process, as shown above, usually includes several key components:

  1. Frozen Backbones: Typically, two pre-trained, frozen-parameter models (not participating in training) are used, such as a visual encoder (like BLIP) to understand the image, and a large language model (like FlanT5) to process text and instructions.

  2. Trainable "Bridge" Structures: Trainable modules are introduced to align visual and language information. Examples mentioned in the paper include Document-former and Feed-Forward Networks (FFN). The Document-former's role is to map the visual information output by the visual encoder into the semantic space of the language model.

  3. Input and Output: Input typically includes the document image, OCR text and coordinates extracted from the image, and a task instruction (Prompt). After these inputs are fed into the model, the LLM ultimately generates the required result, such as classification or Q&A.

Key Challenges Solved and Corresponding Technologies

Fine-tuning Document LLMs primarily addresses the following core challenges:

  • Challenge 1: Complex Structure and Layout Understanding

  • Problem: Document semantics are determined not only by text but also closely related to layout (e.g., the positional relationship of headings, lists, and paragraphs).

  • Solution: Input layout information as an independent modality to the model.

  • DocLLM: Obtains the bounding box coordinates for each text token via OCR and inputs this spatial layout information alongside the text information as independent vectors to the model.

  • LayoutLLM: Uses an encoder like LayoutLMv3 to process the document image and explicitly represent its 2D positional features (e.g., top-left and bottom-right coordinates).

  • InstructDoc: Also uses OCR to extract text and bounding box coordinates, connecting the visual encoder, OCR coordinates, and LLMs via a Document-former.

  • Challenge 2: High-Resolution Image Processing

  • Problem: Document images have higher resolution and greater information density compared to natural images. Most visual encoders have limited input resolution, and direct scaling leads to the loss of key details.

  • Solution: Adopting special image processing strategies to handle high-resolution images in an OCR-Free manner.

  • mPLUG-DocOwl1.5: Uses a shape-adaptive slicing module to cut high-resolution images into multiple sub-images for processing.

  • TextMonkey: Uses a sliding window to partition high-resolution images and a token resampler to compress overly long token sequences, improving efficiency while preserving information.

  • Fox: Achieves high-efficiency fine-tuning for multi-page documents by compressing a 1024x1024 page into 256 image tokens using a high compression rate.

  • Challenge 3: Multi-Page Document Understanding

  • Problem: Real-world documents are mostly multi-page; the model needs to understand and associate information spanning different pages.

  • Solution:

  • Hierarchical Processing: Models like Hi-VT5 and InstructDoc first process each page independently, then aggregate the output of each page (e.g., via average pooling), and finally feed it into the LLM to generate the final answer.

  • Unified Embedding: Embed image blocks, OCR text, and coordinates from different pages into a unified space, allowing the model to better capture cross-page relationships.

  • Advanced Visual Modeling: Utilizing high-resolution document compression modules in models like DocOwl2 to efficiently handle multi-page documents by compressing image features while retaining key layout and textual information.

2. Table LLMs

Tables are a common and important form of structured data in documents, but their complex structures (like merged cells) pose a huge challenge to LLM understanding and reasoning. Table LLMs are specifically designed to address these challenges.

Image

Main Technical Pathways

  • Pathway 1: Tabular Data Training

  • Core Idea: Specially train LLMs by constructing large-scale training data covering various table tasks, enhancing their ability to understand tables.

  • Representative Models:

  • Table-GPT: Integrates and constructs training data for different table tasks (e.g., column lookup, error detection, table summarization), followed by "table fine-tuning."

  • TableLLM: Not only uses existing benchmark training data but also automatically generates new Q&A pairs from available tabular data, ensuring the quality of generated data through cross-validation strategies.

  • TableLlama: Constructs training data from Wikipedia spreadsheets covering tasks such as table interpretation, enhancement, Q&A, and fact verification.

  • Pathway 2: Prompt-Based Table Reasoning

  • Core Idea: Applying techniques like Chain of Thought (CoT) and in-context learning to decompose complex table reasoning problems into multiple steps, solving them incrementally.

  • Representative Models and Methods:

  • TableCoT: Uses a few-shots prompting format containing multiple examples to guide the model through complex table reasoning.

  • DATER: As shown above, it first uses an LLM to decompose the complex question into sub-questions and extract relevant sub-tables; then converts the sub-questions into executable queries (e.g., SQL), and finally performs reasoning to obtain the answer.

  • Chain-of-Table: Defines a series of table operations (e.g., adding columns, sorting). At each reasoning step, the model dynamically generates an operation to update the table, forming a clear chain of reasoning.

"Fine-tuning Document LLMs" is a key technology that, through specialized training built upon general large models, enables precise understanding of document-specific layouts, structures, and content. It shows stronger performance than general models, whether processing complex scanned documents or performing table-based logical reasoning.

Key Technology Three: Retrieval-Augmented Generation (RAG)

RAG, Retrieval-Augmented Generation, is a powerful framework designed to address the challenges faced by Large Language Models (LLMs) when dealing with information-dense, lengthy, or specialized domain documents. Its core idea is not to rely entirely on the knowledge stored inside the LLM, but to dynamically retrieve relevant information from an external knowledge base (in this case, the document being processed) using a Retriever. This information is then provided along with the original user question to the Generator (the LLM) to produce more accurate, factually grounded, and contextually relevant answers.

Image

1. Preprocessing

Data Cleaning

Effective data cleaning must be performed before storing documents in the knowledge base, as a large amount of irrelevant information in the raw documents can interfere with subsequent retrieval effectiveness.

  • Basic Text Cleaning: Unifying document formats, removing special characters, irrelevant details, and redundant information. For example, HtmlRAG will automatically clean CSS styles, JavaScript code, and unnecessary tag attributes from HTML documents.

  • Data Augmentation: Expanding and enriching the knowledge base through methods like synonym replacement, paraphrasing, or multilingual translation, which is particularly effective in scenarios with limited data resources.

Chunking

Since LLMs have fixed context window limits and cannot process long documents all at once, chunking technology becomes a necessary solution. It divides long documents into multiple segments that fit within the model's window size.

  • Simple Chunking: Dividing text into fixed-size segments, a straightforward and common strategy. Overlap can be set to mitigate issues where semantic units are cut off.

  • Rule-based Chunking: Utilizing document structural features or special symbols (like newline characters) for segmentation. For example, recursive chunking uses a series of delimiters (like , ) to iteratively split the text.

  • Semantic-based Chunking: Identifying and combining semantically meaningful elements in the document, such as tables, multi-level headings, and their related content, to generate more contextually coherent blocks.

2. Retrieval

Retrieval is the core of RAG, and its accuracy directly affects the quality of the final generated content. This process typically involves three phases.

Pre-retrieval

Optimizing the query before formal retrieval to improve retrieval efficiency and quality.

  • Query Rewriting: Improving the user query, addressing potential ambiguity, spelling errors, or lack of specificity to align it better with the knowledge base. For example, the HyDE method generates a "hypothetical" document from the user query, which is then used to guide retrieval.

  • Metadata Utilization: Using document metadata (such as author, document type, chapter title) to provide additional context or act as filters to narrow the retrieval scope and increase relevance.

Formal Retrieval

The goal of this stage is to find the document blocks that best match the user query.

  • Retriever Types:

  • Sparse Retrievers: Primarily rely on lexical analysis, encoding text into high-dimensional sparse vectors. The classic BM25 algorithm is representative, evaluating similarity based on term frequency and inverse document frequency.

  • Dense Retrievers: Encode text into low-dimensional dense vectors, better capturing semantic information. DPR is a famous dense retriever that uses a dual-tower BERT encoder to encode queries and documents separately.

  • Retrieval Strategies:

  • Iteration-based Retrieval: Iterating multiple times on the generated result, performing retrieval and generation at each iteration to progressively optimize output quality.

  • Multipath-based Retrieval: Hierarchically decomposing the original query into multiple sub-queries, retrieving from different angles to enrich the retrieved content and broaden the context for the generation task.

Post-retrieval

Further filtering the results after initial retrieval (usually top-k selection) to ensure that only highly relevant content is provided to the LLM.

  • Reranking: Reordering the retrieved document blocks, placing the most relevant blocks to the query at the beginning. For example, the Reranking module in the TrustRAG framework merges results from multiple retrieval paths, conducting comprehensive evaluation and optimization.

  • Filtering: Removing document blocks that do not meet a specific relevance threshold.

Multimodal Retrieval

For documents containing non-textual content such as images and tables, the retrieval strategy must be adjusted accordingly.

  • OCR-based Retrieval: This is the mainstream approach, where OCR tools first convert visual content in the document into machine-readable text, and then semantic retrieval is performed. However, this method often ignores images and graphic content, and table conversion can lead to the loss of spatial and structural information.

  • VLM-based Retrieval: Utilizing Vision-Language Models (VLM) to process multimodal information, encoding both text and images into a unified vector space. For instance, the M3DocRAG system uses a visual encoder to process document pages, then calculates the similarity between the query and the page to retrieve the most relevant page.

3. Retrieval-Augmented Prompting (RAP)

After retrieving relevant document blocks, they must be combined with the user's original query to form a new, information-rich input called "Retrieval-Augmented Prompting" (RAP).

  • Simple Concatenation: The most direct method is simply splicing the retrieved document content with the user query.

  • Structure Preservation: When retrieving structured documents like JSON files, tables, or knowledge graphs, preserving their original structure is crucial for enhancing semantic information.

4. Inference

Finally, the LLM performs inference based on the augmented prompt and generates the final answer. To handle the complex semantic and structural relationships in documents, the inference process also needs optimization.

  • Chain of Thought (CoT) Reasoning: Systems like EvidenceChat utilize CoT to guide the retrieval, extraction, and generation processes.

  • Multi-agent Framework: ViDoRAG introduces a framework containing multiple specialized agents (e.g., search agent, checking agent) to improve answering accuracy for rich-visual documents through iterative reasoning.

RAG is a highly modular and scalable technology. By combining the "instant" retrieval of external knowledge with the powerful generative capabilities of LLMs, it significantly improves performance in document intelligence tasks, showing great advantages, especially when dealing with long, complex, and multimodal documents.

Key Technology Four: Long Context Processing

In the field of document intelligence, many tasks (such as analyzing legal contracts, or academic papers) require models to understand and process ultra-long texts spanning thousands or even tens of thousands of words. However, the Transformer architecture, which forms the basis of modern LLMs, faces inherent challenges when handling long contexts. Long Context Processing technology has been developed to break through these limitations.

Why is Long Context Processing So Difficult?

Researchers first pointed out the three core challenges the Transformer architecture encounters when processing long text:

Image

  1. Text Length Encoding Limitation: Transformers use positional encoding to provide position information for each token in the sequence. The length of this encoding is fixed during training. If the input text exceeds the maximum length trained, the model cannot effectively locate and process the information in the exceeding portion.

  2. Attention Mechanism Resource Consumption: The standard self-attention mechanism requires calculating the relationship between every token and all other tokens in the sequence. This means that computational complexity and memory requirements grow quadratically with sequence length, leading to huge resource consumption and low efficiency when processing long text.

  3. Inadequate Handling of Long-Range Dependencies: Although the self-attention mechanism can theoretically capture any dependency in the sequence, it tends to focus more on local information, resulting in poor performance when capturing ultra-long-range semantic associations.

To tackle these challenges, researchers have proposed innovative solutions from multiple angles. These techniques are categorized as follows:

1. Optimization of Positional Encoding

These methods aim to modify or extend positional encoding to accommodate text sequences longer than those used during training.

  • Position Interpolation (PI): This technique "slows down" the rotation speed of positional encoding, smoothly "stretching" the encoding originally designed for shorter text to cover a longer context.

  • NTK-Aware Interpolation: This method considers the characteristics of different frequency components during interpolation, handling high-frequency and low-frequency parts differently to achieve better extrapolation results.

  • YARN (Yet another RoPE extensioN method): This method introduces the concept of "temperature scaling," performing non-uniform interpolation across different dimensions of Rotational Positional Embedding (RoPE) to find the optimal interpolation scheme that minimizes perplexity (a measure of model performance).

  • LongRoPE: This method uses a progressive expansion strategy, performing a second interpolation on already fine-tuned models to further extend the context window.

2. Optimization of Attention Mechanism

The core of these methods is to reduce computation and memory costs by approximating or sparsifying the attention matrix while retaining critical information as much as possible.

  • Sliding Window Attention: A representative model is Longformer. It does not calculate global attention but allows each token to focus only on other tokens within a fixed-size neighboring window.

  • Retaining Initial Tokens (Attention Sinks): StreamingLLM found that during LLM inference, most attention scores concentrate on the first few tokens of the sequence. Therefore, based on the sliding window, this method additionally retains the Key-Value (KV) pairs of these initial tokens, allowing the model to remain stable while processing infinitely long text streams.

  • Combining Grouped Attention with Sliding Window: LongLoRA divides the long context into multiple groups during fine-tuning, performs complete self-attention calculation within the group, and uses a sliding window mechanism for information exchange between groups.

  • Other Sparse Attention Methods:

  • LongNet: Introduces the concept of "dilated attention," segmenting the input and progressively allocating sparse attention based on the increasing distance between tokens, with parallelization.

  • Unlimiformer: Uses kNN search before each decoder layer to select the top-k most relevant hidden states from the entire input sequence for each attention head, thus focusing on global information without truncating the input.

3. Memory Management

These techniques introduce external memory modules that allow the model to store and retrieve information beyond the current context window, simulating "long-term memory."

  • Landmark Attention: Sets "landmarks" in the input sequence, allowing the model to retrieve relevant memory blocks based on these landmarks.

  • KV Cache-Based Memory: LongMEM uses a memory cache library to maintain recently input attention key-value pairs. During inference, the model can simultaneously focus on local context and historical context retrieved from memory.

  • Hierarchical Memory System: MemGPT, inspired by operating system hierarchical memory systems, achieves the management and retrieval of massive information through a virtual context management system.

4. Prompt Compression

Unlike methods that change the model architecture, these techniques focus on compressing the long text before inputting it into the model, identifying and removing redundant content, and retaining only the most valuable parts.

  • Token Pruning/Merging:

  • Power-BERT reduces computation by eliminating redundant information in word embeddings.

  • Token Merging (ToMe) does not delete tokens but merges similar redundant tokens in batches, thus shortening the sequence length without losing too much information.

  • Small Model-Based Compression:

  • LLMLingua: Trains a small language model specifically for prompt compression. It performs two passes of compression—coarse-grained and fine-grained—on the input to retain key information while significantly shortening the prompt length.

  • LongLLMLingua: Further optimizes LLMLingua, aiming to enhance the LLM's perception of critical information within the prompt.

5. Engineering Approaches

In addition to algorithmic optimizations, many industry-leading models combine hardware-level engineering optimizations to achieve ultra-long context.

  • Flash Attention: Utilizes the characteristics of GPU hardware to keep computations in the faster SRAM as much as possible, reducing read/write operations to GPU memory and significantly boosting the speed and efficiency of attention calculation.

  • Ring Attention: In multi-machine, multi-card scenarios, each hardware unit stores only a portion of the attention matrix, performs partial calculation, and then aggregates the results, thereby breaking through the memory limits of a single graphics card.

Long context processing is a multi-dimensional, cross-level technical field that combines various strategies, from low-level hardware optimization to high-level algorithmic design. Its ultimate goal is to break the length limitations of the Transformer architecture, enabling LLMs to truly handle deep understanding and analysis tasks on massive documents.

Datasets, Implementations, Benchmarks, and Metrics

The final part of the paper covers Datasets, Implementations, Benchmarks, and Metrics. These four aspects collectively form the foundation of document intelligence research, providing a complete framework for model training, deployment, evaluation, and comparison.

1. Datasets

Datasets are the basis for model training and validation; their quality and diversity directly affect the model's learning outcomes and generalization ability. Researchers highlight the following four key categories of datasets:

Image

  • Document QA Datasets: These datasets support the understanding of visual content.

  • DocVQA: Contains over 50,000 questions derived from images of various documents (e.g., invoices, reports), particularly suitable for navigation and visual layout reasoning tasks.

  • QASPER: Focuses on the scientific literature domain, containing 1,585 papers and 5,049 related questions, useful for deep analysis of academic papers.

  • InfographicVQA: Focuses on basic reasoning over visual information, containing 5,485 documents and over 30,000 questions.

  • ChartQA and PlotQA: Extend Q&A capabilities to chart information, containing large amounts of questions and summaries related to charts, respectively.

  • Image

  • Document Layout Analysis Datasets: These datasets focus on the structured analysis of documents.

  • Publaynet: A large-scale document layout analysis dataset, with over 330,000 training images, providing detailed annotations for elements like text, titles, and tables.

  • DocLayNet: Contains 80,863 annotated PDF pages, supporting precise training for various documents and their layouts.

  • DocBank: Derived from scientific papers, it captures fine-grained semantic categories, adding depth and breadth to document analysis.

  • Image

  • Table Recognition Datasets: These datasets focus on extracting table information.

  • TableBank: Contains rich table images from Word and LaTeX, used to support and enhance LLM capabilities in table detection and recognition.

  • PubTabNet: A large-scale image-based table recognition resource, containing over 568,000 table images and their corresponding HTML representations.

  • XFUND: A multilingual tabular dataset covering seven languages, crucial for information extraction tasks.

  • Reasoning Datasets: These datasets focus on semantic understanding and logical inference within tables.

  • TabFact: Built on Wikipedia tables, it contains 118,000 statements and their truthfulness annotations, focusing on verifying logical consistency and factual reasoning within table content.

  • WikiTableQuestions: Provides 22,033 Q&A pairs requiring multi-step reasoning, covering core tasks such as numerical calculation, temporal reasoning, and entity relationship inference.

2. Implementation

The implementation section covers the practical strategies, tool selection, and system design principles required to build efficient document intelligence systems.

Image

  • Tool Selection:

  • OCR-Free Models: Layout-aware vision-language models like mPLUG-DocOwl 1.5 and DocLLM can directly process document images, replacing traditional OCR pipelines and improving robustness.

  • Unified Prompt Frameworks: Tools like OmniParser v2 allow processing multiple tasks such as structured parsing, key-value extraction, and visual text understanding through a single generalized interface.

  • Image

  • Integration Strategies:

  • Long Document Handling: The DocOwl2 model integrates visual token compression and sequence alignment techniques to efficiently process multi-page documents without compromising structural integrity.

  • Commercial Platforms: Azure Document Intelligence offers modular APIs for layout parsing, field extraction, and document classification, allowing flexible combination of traditional and modern components.

  • RAG Frameworks: RAG has become central to document Q&A, with related research emphasizing the importance of chunking strategies, evidence selection, and source tracing mechanisms.

  • Image

  • Best Practices:

  • Interpretability: Tools like DLaVA enhance user trust by providing visual evidence (e.g., locating the source of the answer on the document image).

  • Modularity: Both commercial tools and academic research emphasize the importance of modular design, including fallback mechanisms and "human-in-the-loop" verification.

3. Benchmarks

Benchmarks are crucial tools for evaluating model performance and comparing different methods. Researchers highlighted six important benchmark studies:

  • UDA (Unstructured Document Analysis): Contains real-world documents and expert-annotated Q&A pairs from three domains—finance, academia, and world knowledge—aimed at reflecting genuine application scenarios.

  • OHRBench: The first benchmark used to understand the cascading impact of OCR noise on RAG systems, evaluating the influence of semantic and format noise produced by OCR on RAG performance.

  • OCRBench (v1/v2): Designed to evaluate the performance of multimodal large models in OCR tasks, covering aspects such as text recognition, document Q&A, and key information extraction.

  • OmniDocBench: Contains various document types (e.g., academic papers, textbooks) and rich annotations for layout, content, and attributes, used to evaluate model performance across multiple tasks involving text, tables, and formulas.

  • CC-OCR: A comprehensive and challenging OCR benchmark covering four major tasks: multi-scene text reading, multilingual text reading, document parsing, and key information extraction.

4. Metrics

To comprehensively evaluate model performance across various document processing tasks, a diversity of evaluation metrics is required.

Image

  • Localization and Recognition Metrics:

  • IoU (Intersection over Union): A core metric measuring the overlap between predicted and true bounding boxes, widely used for text and table detection.

  • F1-score: Balances precision and recall, used to evaluate the overall accuracy of localization and recognition tasks.

  • CER (Character Error Rate): Measures character-level differences, used for high-precision OCR task evaluation.

  • Structure and Semantic Similarity Metrics:

  • SSIM (Structural Similarity Index): Measures image similarity by evaluating luminance, contrast, and structural information, often used for mathematical formula recognition and chart structure extraction.

  • TEDS (Tree-Edit-Distance-Based Similarity): Uses tree editing distance to measure the similarity of table structure, particularly suitable for evaluating complex tabular logical structures.

  • Table and Chart Specific Metrics:

  • Purity and Completeness: Used respectively to measure the noise level in table detection results and the coverage rate of the detected area.

  • CAR (Cell Adjacency Relations): Focuses on analyzing the accuracy of cell boundary detection and relative positioning within tables.

  • Mathematical Expression Recognition Specific Metrics:

  • CDM (Character Detection Matching): Provides a reliable evaluation method for the structured analysis of mathematical expressions by addressing issues that different LaTeX representations might cause.

Conclusion: Challenges and Future Outlook

The researchers conclude by summarizing the challenges facing document intelligence and pointing out future research directions:

Major Challenges:

  • Noise in Retrieval Results: Document parsing can introduce errors, leading to noisy or contradictory retrieved information.

  • Integrity of Chunking Results: How to re-segment parsed documents into coherent semantic blocks remains a crucial problem.

  • Complexity of RAG Systems: Relying on multiple tools and API interfaces increases engineering overhead and system complexity.

  • Heterogeneity of Document Features: Significant structural and content differences exist between domain documents like academic papers and financial reports, limiting the widespread applicability of technology.

Future Work:

  • More Flexible RAG Architectures: Developing recursive or adaptive RAG architectures to accommodate diverse document structures and user requirements.

  • Advanced Error Correction Mechanisms: Implementing sophisticated error detection and correction mechanisms to address noise issues in retrieval results.

  • Expansion to More Application Domains: Applying document intelligence technology to more fields such as education, healthcare, law, and scientific research to unleash its enormous potential.

Reviewing the entire paper, it is clear that its greatest value lies not only in its comprehensiveness but also in its strong orientation toward "engineering practice." It moves beyond theoretical discussion, presenting a clear path for implementing document intelligence. Whether it's the noise in RAG or the hallucinations in end-to-end models, these are not endpoints for technology but rather the starting point for innovation and a point of opportunity for business value. The trade-off between "pipeline vs. end-to-end" is a balance of cost and accuracy; the synergy between "RAG vs. long context" is a contest between generality and specialization. For every developer, product manager, and researcher, this paper is a valuable "navigation manual." It tells us precisely which technical nodes must be refined and deepened to evolve document intelligence products from "usable" to "excellent." The future is here, and this blueprint is the starting point for building the next era of intelligent applications with our own hands.

Image

<End of Article>

Main Tag:Artificial Intelligence

Sub Tags:Large Language ModelsNatural Language ProcessingRetrieval-Augmented GenerationDocument Intelligence


Previous:Can LLMs Handle the Real-World "Overflow" of Inference and Prediction, Supported by Prior and Posterior Mechanisms?

Next:Google Enters the CUA Battleground, Launches Gemini 2.5 Computer Use: Allowing AI to Directly Operate the Browser

Share Short URL