Think While Drawing! Multimodal Reasoning Achieves Significant Improvement!

Why is spatial reasoning a weakness for visual-language models?

Imagine trying to find shelves in IKEA's labyrinthine warehouse. Humans would look at a map and sketch a route, but current Visual Language Models (LVLMs) only describe with words: "turn left, turn right..." – resulting in going around in circles! The paper sharply points out: text cannot precisely express spatial relationships. For example, an object's trajectory might be vaguely described as "from A to B then to C" in text, while actual needs demand pixel-level coordinate changes.

GPT-4o lost vs ViLaSR precise drawing

Even more frustrating, existing methods rely on external perception tools (like object detectors), which is akin to giving someone glasses with limited vision. When the tool misidentifies something, the model has no error correction capability, leading to errors accumulating. "This is like teaching AI calculus with an abacus," the authors metaphorically state in the introduction.

Image

Paper: Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

Address: https://arxiv.org/pdf/2506.09965

Method: "Think While Drawing" Like Humans

The core of ViLaSR is to allow the model to directly draw for reasoning, just like humans working out problems on scratch paper. This is achieved through two main operations:

Bounding Box Localization: Using bounding boxes to pinpoint object locations (e.g., "sofa is in the bottom left")

Line Analysis: Using auxiliary lines to measure distances and angles (e.g., "air conditioner is 1.5 meters from the window")

Three-Stage Training is like teaching a child to draw:

1. Cold Start: Basic drawing instruction using synthetic data (copying calligraphy practice sheets)

2. Reflection Training: Filtering for self-correcting answers (teacher grading homework)

3. Reinforcement Learning: Optimizing drawing strategy with reward mechanisms (bonus points for exams)

Key formula: Reward function design

Model Score = Answer Correctness + Drawing Standardization

(Drawing standardization score is only calculated when correctness meets the standard, preventing the model from "drawing beautifully but answering everything wrong")

Three-stage training flowchart

Experimental Results

In five major spatial reasoning tests, ViLaSR outperformed all competitors:

Maze navigation accuracy 98.2% (49.4% higher than GPT-4o)

Video object tracking accuracy improved by 12.7%

Multi-view reasoning win rate exceeded open-source models by 30%

Image

The most striking result is from the ablation study: reflection training increased the model's self-correction behavior by 96.5%! When the model learned to question its own drawing results, the error rate dropped dramatically. For example, when measuring room dimensions, a model without reflection training would draw arbitrary lines leading to a 20% error, while ViLaSR would repeatedly calibrate bounding box positions.

Image

Case Studies: How the Model "Solves Cases by Drawing"

Case 1: Ultimate Maze Challenge

GPT-4o: Textual reasoning showed a contradiction like "should turn right after turning left"

ViLaSR:

1. Draws a red line to mark the starting point

2. Gradually extends a blue line according to instructions

3. Discovers a dead end, then backtracks and reroutes, eventually drawing a complete green path

Image

Case 2: Finding a Phone in Video

Requirement: Calculate the phone's movement distance in surveillance video

Traditional Model: Incorrectly boxed the phone model (mistook a remote control for a phone)

ViLaSR:

1. Frame 5: Draws a box marking a suspected phone → discovers the size is incorrect

2. Frame 12: Re-locates the real phone

3. Uses earphone size as a scale to convert the distance

ImageImage

Industry Significance: Disruptive Breakthrough for Robotics and AR

This research solves a major pain point for AI deployment – the lack of spatial common sense. Previously, robots often failed to grasp items because they couldn't understand "the cup is 5 centimeters in front of the plate." ViLaSR's drawing-based reasoning grants machines the ability to internalize spatial thinking, and experimental results have already shown its potential in robotic arm operations.

Even more exciting, the team has open-sourced all resources:

Code: https://github.com/AntResearchNLP/ViLaSR

Model: https://huggingface.co/AntResearchNLP/ViLaSR Developers can quickly deploy it to scenarios like robot vacuum cleaners and AR navigation.

"When AI learns to draw, the singularity of machine cognition elevation is at hand."

Main Tag:Artificial Intelligence

Sub Tags:Visual Language ModelsMultimodal AIMachine LearningSpatial Reasoning


Previous:Chinese-American Scientist Invents Revolutionary Stroke Therapy: 90% Thrombus Removal Success Rate Could Be a Game-Changer!

Next:ACL 2025 | Large Models "Spreading Misinformation"? DRAG's Two-Stage "Multi-Agent Debate" Solves Hallucination on Hallucination

Share Short URL