Xiaohongshu Proposes DeepEyesV2: From "Visual Thinking" to "Tool Collaboration", Exploring New Dimensions in Multimodal Intelligence

Remember DeepEyes launched by the Xiaohongshu team in the first half of this year?

Yes, that's the multimodal model that can "zoom in on image details to find clues" like humans, basically achieving o3-like "visual thinking".

Now, a more powerful version—DeepEyesV2—has been officially released.

In conclusion: DeepEyesV2 not only continues DeepEyes' visual reasoning advantages but also breakthrough achieves full tool collaboration of "code execution + web search + image manipulation", evolving from "seeing details" to "agents that proactively solve complex problems".

Detailed breakdown below—

Multimodal Reasoning with Multi-Tool Collaboration

Existing multimodal large models can understand text, images, and other information, but they are more like "information interpreters"—passively perceiving information without proactively calling external tools to solve problems.

Thus, limited by two major pain points:

Pain Point 1: Weak tool calling capability.

When asking AI about a photo of an unfamiliar plant—“What flower is this?”

Traditional multimodal models either lack tool calling entirely, relying only on internal knowledge for basic understanding;

Or can only call a single type of tool, unable to form combination strategies.

For example, DeepEyes can achieve fine-grained image perception via cropping tools but lacks information retrieval, unable to identify flower species solely from internal knowledge;

In contrast, MMSearchR1 supports search but without fine-grained perception, often fails retrieval due to "unclear image details".

This "single-tool dependency" leaves models helpless with complex tasks.

Pain Point 2: Lack of multi-capability collaboration.

Humans solve problems by naturally chaining "observation (perception) → check data (search) → calculate results (reasoning)" steps, but traditional multimodal models struggle with such collaboration.

Perception, search, and reasoning often operate independently, completing only 1-2 steps, hard to chain into complete solutions like humans.

How does DeepEyesV2 solve these pain points?

Compared to previous models, DeepEyesV2 solves complex real-world problems through multi-tool collaborative reasoning.

For example, facing "Calculate the company's drop on April 4, 2024, 9:30-16:00 from the stock chart, and compare with Tootsie Roll Industries (TR) drop in the same period."

When involving complex issues like "which is larger", DeepEyesV2 demonstrates strong reasoning ability.

The overall process is divided into three steps:

Step 1: Image search to acquire more information.

DeepEyesV2 first calls image search to try getting more stock price info.

Step 2: Text search to try getting stock prices.

Since image search provides no valid info, DeepEyesV2 switches to text search for stock data.

Step 3: Code execution, API access and calculation.

Text search also fails to provide data, DeepEyesV2 generates code to access Yahoo Finance API for stock data and computes numerically for final result.

Through multiple searches, code execution, and complex reasoning, DeepEyesV2 successfully solves this complex problem.

Notably, code API access wasn't in training data, but DeepEyesV2 acquired it autonomously via reinforcement learning.

DeepEyesV2

Model Details

Similar to DeepEyes, DeepEyesV2 is a multimodal model with agent characteristics, but its tool usage is greatly expanded beyond simple cropping.

In DeepEyesV2, programmatic code execution and web retrieval as external tools can be interactively called during reasoning, combined with tool results for further reasoning.

Given image input and user query, DeepEyesV2 first generates initial reasoning plan, judging if solvable internally or needs tools.

If tools needed, generates executable Python code or web search queries.

Code executes in sandbox, producing structured outputs like processed images, measurements, arrays, charts, or logs.

Image queries via SerpAPI return top 5 visual match pages; text queries return top 5 relevant pages with titles/snippets... All tool outputs added to model context.

Then, DeepEyesV2 thinks further based on observations, possibly planning more tool calls, repeating reasoning-tool-integration loop until accurate answer.

In short, DeepEyesV2 dynamically selects, combines, and uses tools.

This integration brings three main advantages:

1. Extends analysis via executable code;

2. Acquires proactive real-time knowledge from multimodal web evidence;

3. Code and search dynamically combine in single trajectory during reasoning, not isolated modules, improving tool call flexibility.

These make DeepEyesV2 a more general, reliable, scalable multimodal reasoning framework.

Exploration Experiments

DeepEyes sparks image thinking via RL, so team explored on Qwen2.5-VL-7B referencing DeepEyes.

Studying if RL can directly grant complex tool use, team observed two key issues.

Issue 1: Early tool exploration "willing but unable", low code execution rate.

In early training, model generates Python for cropping/numeric tools, but code has syntax/logic errors, low success rate.

As training progresses, model abandons code gen, converging to short reasoning chains bypassing tools.

Issue 2: "Reward hacking", model cheats rewards with invalid ops.

To improve, team adds DeepEyes' effective "tool use reward": extra reward for code gen.

Initially effective, success rate rises.

But late training, model outputs meaningless comment-only code blocks to hack rewards.

Exploration shows existing multimodal models can't reliably learn complex tools via direct RL due to capability limits, highlighting cold-start importance.

Two-Stage Training

Thus, team uses "cold-start + RL" two-stage: from "can use tools" to "good at using tools".

Stage 1: Cold-start—build foundation

High-quality datasets teach basic tool logic. Team curates four types:

Perception data: problems needing image crop/mark tools.
Reasoning data: math problems needing code calc tools.
Search data: problems needing web tools.
CoT data: pure text reasoning CoT.

Data double-filtered:

1. Difficulty filter: keep only problems base model can't solve;

2. Tool benefit filter: ensure tools significantly boost accuracy.

Stage 2: RL—fine-tune

On cold-start base, optimize via "accuracy + format" dual rewards.

Unlike complex rewards, DeepEyesV2 uses two simple:

1. Accuracy reward: score final vs ground truth match;

2. Format reward: penalize code errors, invalid search keywords, etc.

RealX-Bench

Existing benchmarks test single abilities (e.g. image recognition, math calc), but real-world needs "multi-ability collab".

Team builds new RealX-Bench: 300 real-scene problems across daily life, media, sports, knowledge, games.

Problems collected/rewritten from real scenes, many needing multi-ability combo.

Accuracy far exceeds open-source models

Team evaluates existing models and DeepEyesV2 on RealX-Bench.

Results: Even SOTA generals <50% acc, DeepEyesV2 far surpasses opensource via tool collab, esp multi-ability tasks.

Also evaluated real-world understanding, math reasoning, search tasks.

Results: Huge gains over existings, proving tool calling importance.

Deep Analysis: Data Ablation & Tool Preferences

Then, multi-ablation explores data type impacts on tool use.

First, cold-start data: goal teach "basic tool logic".

Divided perception, reasoning, CoT; ablations validate roles.

Perception only: boosts real-world perception acc, no math gain.

Perception teaches visual tools but no transfer to code reasoning—like knowing magnifier but not calculator.

Reasoning only: math acc up, perception down.

Reasoning needs complex code+verify; lacks perception-tool link, loses perception.

Perception+reasoning + CoT: big gains in understanding/reasoning.

CoT boosts reasoning, aiding complex tools.

Optimal: "perception+reasoning+CoT".

Combo best on both tests: diverse complex-reasoning cold-start foundations multi-tool collab.

Further, only diverse RL data effectively boosts tool calling.

Cold-start: "know what tool"; RL: "know when".

Compare pre/post RL tool use: RL optimizes accuracy, forms task-adaptive patterns—

"on-demand" smarts core to DeepEyesV2 vs traditional.

Analysis: post-cold-start basic task-tool match; RL strengthens, pushes cross-tool combos.

DeepEyesV2 shows clear tool prefs per task.

Real-world perception: prefers crop for details; OCR: mark+numeric; charts: more arithmetic.

Math reasoning: math calc dominant; search: search tools.

Post-RL changes: more numerics, search+image process combos—RL aids cross-tool collab.

Cold-start: over-calls tools (>90% tasks), low efficiency.

Post-RL: call rate drops, adaptive—calls only beneficial, boosts efficiency.

Tracked RL dynamics: output len down, avg calls down, but high variance.

Not fixed calls (e.g. 1/tool per prob).

Adaptive think: selective calls when needed.

Complex probs: high calls—dynamic adjust by difficulty, true adaptive reasoning.

Conclusion

In summary, team explores building agent multimodal models proactively calling/integrating tools in reasoning from training, data design, eval angles.

Analysis reveals task-related tool behaviors; RL teaches complex context-aware combos.

Extensive exps on perception/reasoning/search prove DeepEyesV2 strong reasoning, highlighting tool-reasoning synergy.

Paper: https://arxiv.org/pdf/2511.05271

Project homepage: https://visual-agent.github.io/

GitHub: https://github.com/Visual-Agent/DeepEyesV2

Xiaohongshu Proposes DeepEyesV2: From "Visual Thinking" to "Tool Collaboration", Exploring New Dimensions in Multimodal Intelligence

Share Short URL