35% Accuracy Evaporates! ByteDance & HUST's WildDoc Reveals Robustness Shortcomings in Multimodal Document Understanding

Image

In the field of document understanding, Multimodal Large Language Models (MLLMs) are evolving at an astonishing pace. From basic document image recognition to complex document understanding, they perform excellently on scanned or digital document benchmarks (such as DocVQA, ChartQA), which seems to indicate that MLLMs have largely solved the problem of document understanding. However, existing document understanding benchmarks suffer from two core shortcomings:

Lack of real-world scenarios: Documents in reality are often paper files or screenshots taken with phones/cameras, facing complex interferences such as uneven lighting, physical distortions (creases/bends), varied shooting angles, blur/shadows, and inaccurate focus;

Inability to evaluate robustness: Existing benchmarks do not simulate the complexity and diversity of real environments, leading to doubts about model performance in practical applications;

Image

These shortcomings raise a key question: How far are current MLLMs from achieving comprehensive and robust document understanding capabilities in natural environments?

To uncover this mystery, ByteDance OCR team collaborated with Huazhong University of Science and Technology to create WildDoc – the first benchmark dataset for real-world document understanding.

WildDoc selects 3 commonly used representative document scenarios as benchmarks (Document/Chart/Table), containing over 12,000 manually captured images, covering five factors affecting real-world document understanding performance: environment, lighting, viewing angle, distortion, and shooting effects. Its performance can also be compared with existing electronic benchmark datasets.

To rigorously evaluate the robustness of models, WildDoc constructed a Consistency Score evaluation metric. Experiments found that mainstream MLLMs experienced significant performance degradation on WildDoc, revealing performance bottlenecks of existing models in real-world document understanding and providing verifiable directions for technical improvements.

This work not only fills the gap in real-world benchmarks but also pushes document understanding research a crucial step towards "practicality and generalization."

Image

Paper Link:

https://arxiv.org/abs/2505.11015

Project Homepage:

https://bytedance.github.io/WildDoc/

Github:

https://github.com/bytedance/WildDoc

Image

WildDoc Data Construction and Composition

WildDoc data includes over 12,000 manually collected real-world document images, simulating complex challenges in natural environments, and introduces a Consistency Score metric to quantitatively evaluate model robustness across scenarios. WildDoc has currently open-sourced all 12K+ images and 48K+ Q&A pairs. Its construction process is as follows:

1. Data Collection:

Scenario diversification: Manually capture documents in natural environments (e.g., outdoor, indoor with different lighting conditions) to ensure coverage of multi-dimensional interference factors such as environment, lighting, and viewing angle.

Benchmark alignment: Reuse electronic documents from existing benchmarks, print them physically, and then photograph them to ensure comparability with traditional benchmarks.

2. Multi-condition Shooting:

Shoot the same document four times, changing environmental parameters (e.g., light intensity, shooting angle, paper distortion) each time, to obtain comparative samples with various effects.

3. Annotation and Verification:

Manually verify key information in the images, such as text, layout, and answerability of questions, to ensure accuracy.

Evaluate model stability under different conditions through Consistency Score calculation to assist in filtering high-quality data.

ImageImage

Experimental Results

The research team tested numerous representative MLLMs, including general MLLMs (e.g., Qwen2.5-VL, InternVL2.5), MLLMs specialized in document understanding (e.g., Monkey, TextHarmony), and leading closed-source MLLMs (e.g., GPT4o, Doubao-1.5-pro). The experimental results revealed numerous shortcomings of current multimodal large models in real-world scenarios.

Image

Firstly, the performance of existing MLLMs on WildDoc significantly decreased compared to traditional document benchmarks (e.g., DocVQA). For instance, GPT-4o's average accuracy dropped by 35.3%, with the ChartQA subset dropping by as much as 56.4%; the open-source model Qwen2.5-VL-72B achieved an average accuracy of 70.6%, which is the best among open-source models, but still approximately 15% lower than the original benchmark.

Currently, the optimal closed-source model, Doubao-1.5-pro, performed best (average accuracy 73.7%), but its consistency score was only 55.0, meaning it failed to maintain accurate answers under different conditions in more than half the cases. This indicates that current MLLMs lack sufficient stability and adaptability when facing real-world scenario variations.

Experimental results revealed the performance of MLLMs in real-world document understanding, with the following findings:

Physical distortion is most challenging: Physical deformations such as wrinkles, creases, and bends led to the most significant performance degradation (e.g., GPT-4o dropped by 34.1-34.7%), far exceeding the impact of lighting (-25.9%) or viewing angle (-26.2%) changes.

Non-frontal perspectives and image quality: Non-frontal captures (e.g., oblique angles) led to performance degradation due to text deformation and blur (Qwen2.5-VL-72B dropped by 17.6%), but screen-captured images showed less performance drop (-8.3% to -9.1%) due to mature data augmentation algorithms.

Limited impact of language model scale: Larger parameter models (e.g., 72B parameter Qwen2.5-VL) performed slightly better on WildDoc but did not fully overcome real-world challenges, indicating that model architecture requires targeted optimization.

ImageImage

Furthermore, some models showed little difference in performance on original benchmarks, even approaching saturation, but exhibited significant performance disparities on WildDoc. This indicates that traditional benchmarks are no longer effective in distinguishing the true capabilities of models, while WildDoc can more acutely capture the shortcomings of models in real-world scenarios.

Image

The Future Path: How can MLLMs better understand real-world documents?

Facing these challenges, the research team proposed several improvement strategies, pointing the way for future research.

First, data augmentation. By using more augmentation techniques to simulate real-world conditions, such as varying lighting and shadows, models can be exposed to more diverse scenarios during training, thereby improving their adaptability.

Second, robust feature learning. Enable models to learn features that are insensitive to real-world variations, so that even if document images change, the model can still accurately understand their content.

Third, introduction of real data. Collect more real-world document images to enrich the training dataset, allowing models to gain experience in more "actual combat" and improve performance.

The WildDoc dataset effectively revealed the shortcomings of MLLMs in real-world document understanding, providing a key benchmark and optimization directions for subsequent research, and further pushing document understanding research a crucial step towards "practicality and generalization."

Appendix: More Visualized Data

Image

More Reading

ImageImageImageImage

#Submission Channel#

Let your words be seen by more people

How can more high-quality content reach readers through shorter paths, reducing the cost for readers to find quality content? The answer is: people you don't know.

There are always people you don't know who know what you want to know. PaperWeekly can perhaps serve as a bridge, facilitating the collision of ideas and academic inspiration among scholars with different backgrounds and directions, sparking more possibilities.

PaperWeekly encourages university labs or individuals to share various high-quality content on our platform, whether it's the latest paper interpretations, academic hot topic analysis, research insights, or competition experience explanations. Our sole purpose is to make knowledge truly flow.

📝Basic submission requirements:

• The article must be an individual's original work, not previously published through public channels. If it has been or is intended to be published on other platforms, please clearly indicate this.

• Submissions are recommended to be written in markdown format, with accompanying images sent as attachments. Images must be clear and free of copyright issues.

• PaperWeekly respects the original author's right to attribution and will provide competitively remunerated royalties for each original first-published submission accepted, with the specific amount tiered based on article readership and quality.

📬Submission Channel:

• Submission email: hr@paperweekly.site

• Please include immediate contact information (WeChat) with your submission so we can contact the author as soon as the submission is selected.

• You can also directly add our editor's WeChat (pwbot02) for quick submissions, with the note: Name-Submission.

Image

△Long press to add PaperWeekly editor

🔍

Now, you can also find us on "Zhihu"

Go to the Zhihu homepage and search for "PaperWeekly"

Click "Follow" to subscribe to our column!

Image

Main Tag:Artificial Intelligence

Sub Tags:Document UnderstandingDataset EvaluationMachine LearningMultimodal Models


Previous:Google Research Finds: Prompt Design is the Core of Multi-Agent Systems!

Next:After ZeroSearch, Tongyi's Latest Work MaskSearch Proposes a New Framework for Reasoning-Search Pre-training

Share Short URL