Why is spatial reasoning a weakness for visual-language models?
Imagine trying to find shelves in IKEA's labyrinthine warehouse. Humans would look at a map and sketch a route, but current Visual Language Models (LVLMs) only describe with words: "turn left, turn right..." – resulting in going around in circles! The paper sharply points out: text cannot precisely express spatial relationships. For example, an object's trajectory might be vaguely described as "from A to B then to C" in text, while actual needs demand pixel-level coordinate changes.
Even more frustrating, existing methods rely on external perception tools (like object detectors), which is akin to giving someone glasses with limited vision. When the tool misidentifies something, the model has no error correction capability, leading to errors accumulating. "This is like teaching AI calculus with an abacus," the authors metaphorically state in the introduction.
Paper: Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing
Address: https://arxiv.org/pdf/2506.09965
Method: "Think While Drawing" Like Humans
The core of ViLaSR is to allow the model to directly draw for reasoning, just like humans working out problems on scratch paper. This is achieved through two main operations:
Bounding Box Localization: Using bounding boxes to pinpoint object locations (e.g., "sofa is in the bottom left")
Line Analysis: Using auxiliary lines to measure distances and angles (e.g., "air conditioner is 1.5 meters from the window")
Three-Stage Training is like teaching a child to draw:
1. Cold Start: Basic drawing instruction using synthetic data (copying calligraphy practice sheets)
2. Reflection Training: Filtering for self-correcting answers (teacher grading homework)
3. Reinforcement Learning: Optimizing drawing strategy with reward mechanisms (bonus points for exams)
Key formula: Reward function design
Model Score = Answer Correctness + Drawing Standardization
(Drawing standardization score is only calculated when correctness meets the standard, preventing the model from "drawing beautifully but answering everything wrong")
Experimental Results
In five major spatial reasoning tests, ViLaSR outperformed all competitors:
Maze navigation accuracy 98.2% (49.4% higher than GPT-4o)
Video object tracking accuracy improved by 12.7%
Multi-view reasoning win rate exceeded open-source models by 30%
The most striking result is from the ablation study: reflection training increased the model's self-correction behavior by 96.5%! When the model learned to question its own drawing results, the error rate dropped dramatically. For example, when measuring room dimensions, a model without reflection training would draw arbitrary lines leading to a 20% error, while ViLaSR would repeatedly calibrate bounding box positions.
Case Studies: How the Model "Solves Cases by Drawing"
Case 1: Ultimate Maze Challenge
GPT-4o: Textual reasoning showed a contradiction like "should turn right after turning left"
ViLaSR:
1. Draws a red line to mark the starting point
2. Gradually extends a blue line according to instructions
3. Discovers a dead end, then backtracks and reroutes, eventually drawing a complete green path
Case 2: Finding a Phone in Video
Requirement: Calculate the phone's movement distance in surveillance video
Traditional Model: Incorrectly boxed the phone model (mistook a remote control for a phone)
ViLaSR:
1. Frame 5: Draws a box marking a suspected phone → discovers the size is incorrect
2. Frame 12: Re-locates the real phone
3. Uses earphone size as a scale to convert the distance
Industry Significance: Disruptive Breakthrough for Robotics and AR
This research solves a major pain point for AI deployment – the lack of spatial common sense. Previously, robots often failed to grasp items because they couldn't understand "the cup is 5 centimeters in front of the plate." ViLaSR's drawing-based reasoning grants machines the ability to internalize spatial thinking, and experimental results have already shown its potential in robotic arm operations.
Even more exciting, the team has open-sourced all resources:
Code: https://github.com/AntResearchNLP/ViLaSR
Model: https://huggingface.co/AntResearchNLP/ViLaSR Developers can quickly deploy it to scenarios like robot vacuum cleaners and AR navigation.
"When AI learns to draw, the singularity of machine cognition elevation is at hand."