Thinking with Images Only: Reinforcement Learning Forges a New Reasoning Model Paradigm, Maximizing Complex Scene Planning!

Synced Review

Edited by: Panda, +0

In recent years, LLMs and their multimodal extensions (MLLMs) have continuously improved their reasoning capabilities across various tasks. However, existing MLLMs primarily rely on text as a medium for expressing and constructing reasoning processes, even when dealing with visual information.

Image

Common MLLM architecture.

This paradigm requires the model to first "translate" or "map" visual information into textual descriptions or internal text-based tokens, and then leverage the text reasoning capabilities of large language models for processing.

This conversion process can inevitably lead to the loss or weakening of rich details, spatial relationships, and dynamic features inherent in visual information, forming what is known as the "modality gap." This gap not only limits the model's fine-grained perception of the visual world but also affects its ability to plan effectively in complex visual scenarios.

For example, while models can identify objects in images and describe some relatively simple spatial relationships between them, their performance may still be limited by the loss of visual details during the text conversion process when pursuing extreme localization accuracy or needing to deeply understand and predict highly complex, dynamic, or implicit interaction logics between objects (rather than just recognizing surface phenomena).

Image

A research team from Cambridge, University College London, and Google believes that language is not necessarily always the most natural or effective modality for reasoning, especially in task scenarios involving spatial and geometric information.

Image

Based on this motivation, the research team proposed a new reasoning and planning paradigm—Visual Planning. This paradigm performs planning entirely based on visual representations, completely independent of the text modality.

Image

Paper Title: Visual Planning: Let’s Think Only with Images

Paper Address: https://arxiv.org/pdf/2505.11409

Code Repository: https://github.com/yix8/VisualPlanning

Under this framework, planning encodes the reasoning process within the visual domain step-by-step through a series of images, similar to how humans plan future actions by sketching or imagining visual scenes.

Image

Comparison of reasoning paradigms. Traditional methods (top and middle rows) tend to generate verbose and inaccurate text plans, while the visual planning paradigm (bottom row) directly predicts the next visual state, forming a purely image-based state trajectory without language mediation.

To support this method, the research team proposed an innovative reinforcement learning framework—Visual Planning via Reinforcement Learning (VPRL). This framework uses GRPO (Grouped Relative Policy Optimization) as its core optimization method to enhance the planning capabilities of large-scale visual models after training.

The method achieved significant performance improvements in multiple typical visual navigation tasks, including FROZENLAKE, MAZE, and MINIBEHAVIOR. Experimental results show that the purely visual planning paradigm proposed by the research team offers a stronger advantage in effectiveness compared to all other planning variants that perform reasoning in a purely text space.

Here are dynamic examples:

FrozenLake: This is a stochastic gridworld environment where the agent must start from a specified point and safely reach the target location, avoiding falling into "ice holes" along the way.

Image

Maze: The agent receives an initial image showing the maze layout. Its task is to navigate the maze from the starting point (green marker) to the destination (red flag).

Image

MiniBehaviour: The agent first needs to move from the starting point to the printer's location and "pick it up," then transport the printer to the table and "put it down."

Image

This research not only proves that visual planning is a feasible alternative but also reveals its immense potential in tasks requiring intuitive image reasoning, opening up a new direction for the field of image perception and reasoning.

Reinforcement Learning-Driven Visual Planning

Visual Planning Paradigm

Most previous visual reasoning benchmark tasks are usually solved by mapping visual information to the text domain, for example, by converting it into object names, attributes, or relational annotation labels, and then performing a few steps of language reasoning based on this.

However, once visual content is converted into a textual representation, the task degenerates into a pure language reasoning problem, at which point the language model can complete the reasoning without reintroducing visual modality information during the process.

The visual planning paradigm proposed by the research team is fundamentally different from the methods mentioned above. It performs planning purely in the visual modality. The research team formally defines visual planning as: given an initial image v₀, generate an intermediate image sequence T = (ˆv₁, ..., ˆvₙ), where each ˆvᵢ represents a visual state, together forming a visual planning trajectory. Specifically, let π_θ be a parameterized generative visual model. This visual planning trajectory is generated autoregressively, with each intermediate visual state ˆvᵢ sampled given the initial state and previously generated states:

Image

Reinforcement Learning in Large-Scale Visual Models

Reinforcement Learning (RL) has shown significant advantages in optimizing autoregressive models. It trains using sequence-level reward signals, breaking through the limitations of traditional token-level supervision signals. In autoregressive image generation tasks, images are represented as sequences of visual tokens.

Inspired by the successful application of RL in language reasoning tasks, the research team introduced an RL-based training framework to support visual planning under large models, adopting the GRPO method. This method uses transition information between visual states to calculate rewards and verifies whether the generation strategy meets environmental constraints.

To train a policy model capable of generating effective actions and maintaining exploration diversity during the RL phase, the research team proposed an innovative two-stage reinforcement learning framework:

Stage 1: Policy Initialization. In this stage, the research team employed supervised learning to initialize the visual generative model π_θ by using trajectories generated from random walks in the environment. The goal is to generate effective sequences of visual states and maintain sufficient exploratory behavior in the "simulated" environment. During training, each trajectory consists of a sequence of visual states (v₀, ..., vₙ). For each trajectory, the research team extracts n−1 pairs of image samples (v≤ᵢ, vᵢ₊₁), where v≤ᵢ denotes the prefix sequence (v₀, ..., vᵢ). Subsequently, given the input prefix, the model is exposed to a candidate set of next states {vᵢ₊₁^(j)}_{j=1}^K from K valid trajectories. These candidate states share the same prefix. To prevent the model from overfitting to a specific transition and to encourage randomness in the generation process, the research team randomly samples one candidate state vᵢ₊₁^(ℓ) as the supervision target in each training step, optimizing the model by minimizing the Visual Fine-Tuning Loss Function (VPFT):

ImageImage

Overview of the proposed VPRL framework. The figure illustrates the application of this framework in visual navigation tasks, utilizing an autoregressive large-scale visual model for image generation. GRPO is used to train the visual policy model, and a progress reward function is introduced to encourage progressive actions and penalize illegal behaviors, thereby achieving goal-aligned visual planning.

Overall, this stage primarily serves as a warm-start process for the subsequent reinforcement learning stage, aiming to improve the coherence of generated images and the overall planning quality.

Stage 2: Reinforcement Learning for Visual Planning. After initialization in the first stage, the model possesses strong exploratory capabilities, which are crucial for reinforcement learning to ensure the model covers various state transition paths and avoids falling into suboptimal policies. In the second stage, the model gradually learns an effective visual planning policy by simulating future states (i.e., consequences of potential actions) and receiving reward feedback based on the generation results.

Specifically, given the current input prefix v≤ᵢ, the old version model π_θ^old samples G candidate intermediate states {ˆvᵢ₊₁^(1), ..., ˆvᵢ₊₁^(G)}. Each candidate state represents the next visual state simulated after the agent takes a certain action a^(k) at time step i. The research team uses a rule-based parsing function to map the state pair (vᵢ, ˆvᵢ₊₁^(k)) to discrete actions for structured interpretation.

Subsequently, the research team designed a composite reward function r (vᵢ, ˆvᵢ₊₁^(k)) to score each candidate state, measuring whether the candidate state represents an effective progression towards the target state (i.e., its utility).

Unlike traditional reinforcement learning which relies on learning a value function estimator (critic), GRPO calculates advantage values through relative comparisons within candidate groups, thereby providing training signals that are easier to interpret and more computationally efficient. The relative advantage A^(k) for each candidate is calculated as:

Image

To guide the model to produce better candidate responses and reinforce the tendency towards high-advantage behaviors, the research team updates the policy according to the following objective function:

Image

Where D refers to the prefix distribution, and ρ^(k) = π_θ(ˆvᵢ₊₁^(k) | v≤ᵢ) / π_θ^old (ˆvᵢ₊₁^(k) | v≤ᵢ) represents the importance sampling ratio.

Reward Design. Unlike discrete operations or text tokens, visual outputs are often high-dimensional sparse information that is difficult to directly decompose into interpretable units. Under the research team's visual planning framework, the core challenge lies in how to determine whether a generated visual state can accurately express the corresponding planning action. Therefore, reward design focuses on evaluating progression towards the target state, considering environmental constraints.

To interpret the action plan implied by the transition from state vᵢ to candidate state ˆvᵢ₊ₜ^(k), the research team defines a state-action parsing function P: V × V → A ∪ E, where A represents the set of valid actions and E represents the set of illegal state transitions (e.g., actions violating physical constraints).

Image

This process can be completed with the help of independent image segmentation components or rule-based scripts to parse interpretable action units from pixel-level data.

Once actions are identified, the research team introduces a "progress map" D (v) ∈ ℕ, which indicates the remaining number of steps or effort required to reach the target state from a given visual state v. By comparing the relative change in the progress map between the current state and the generated state, the research team divides the action set A ∪ E into three categories:

Image

Accordingly, the research team proposes the progress reward function r (vᵢ, ˆvᵢ₊₁^(k)):

Image

r =αₒₚₜ, if it is an optimal progressive action (optimal) r =αₙₒₚₜ, if it is a non-progressive action (non-optimal) r =αᵢₙᵥ, if it is an invalid action (invalid)

In the experiments, the research team set αₒₚₜ = 1, αₙₒₚₜ = 0, and αᵢₙᵥ = −5, thereby encouraging progressive behavior and penalizing infeasible state transitions.

System Variants

In addition to the proposed VPRL backbone framework, to comprehensively evaluate the impact of supervision methods (language vs. image) and optimization methods (supervised fine-tuning vs. reinforcement learning) on performance, the research team proposed several system variants as baseline comparisons:

Visual Fine-Tuning Planning (VPFT). The research team proposed "Visual Planning via Fine-Tuning (VPFT)" as a simplified version of this framework, whose training structure is consistent with Stage 1 in Section 2.2, but uses optimal planning trajectories instead of random ones. For each environment, the research team sampled a minimum-step optimal trajectory (v₀^opt, v₁^opt, ..., vₙ^opt) that leads from the initial state v₀^opt = v₀ to the target state. At each step, the model learns to predict the next state vᵢ₊₁^opt based on the current prefix v≤ᵢ^opt. The training objective is the same as Equation (2), using the optimal trajectory as the supervision signal.

Language-based Supervised Fine-Tuning (SFT). In this comparative method, the planning task is constructed in the language modality. Unlike generating intermediate states in image form, the model needs to generate text descriptions of action sequences. Formally, given an input visual state v and a task description text prompt p, the model is trained to output an action sequence t = (t₁, ..., t_L), where each token tᵢ ∈ V_text represents an action. The model's input is a concatenation of prompt tokens and visual tokens, and the target is the corresponding text action sequence. The research team adopted the supervised fine-tuning method commonly used in autoregressive models to learn action prediction by minimizing the cross-entropy loss:

Image

How Does Visual Planning Perform in Experiments?

The team examined the practical performance of this new visual planning paradigm based on some representative tasks.

Specifically, to compare visual planning with language-based planning, the team experimented with three visual navigation environments: FROZENLAKE, MAZE, and MINIBEHAVIOR. All these environments can be solved in both modalities, making it easier to compare the two strategies.

In terms of models, the team chose models trained entirely on visual data—these models had not been exposed to any text data during pre-training.

Specifically, they selected the large visual model LVM-3B as the backbone network and used VPFT and VPRL methods. Meanwhile, comparative text models included Qwen 2.5-VL-Instruct with different settings, as well as Gemini 2.0 Flash (gemini-2.0-flash-002) and the advanced reasoning model Gemini 2.5 Pro (gemini-2.5-pro-preview-03-25).

The evaluation metrics used were Exact Match (EM) and Progress Rate (PR).

So, how does visual planning perform?

Visual Planning Outperforms Text Planning

Image

As shown in Table 1 below, visual planners (VPFT and VPRL) achieved the highest scores on all tasks, outperforming all baseline models that used language reasoning.

Under the same fine-tuned supervised training method, VPFT averaged over 22% higher than language-based SFT in Exact Match (EM), and VPRL's advantage was even greater. A similar trend was observed in Progress Rate (PR).

These results indicate that the visual planning paradigm has a significant advantage in visual-centric tasks because language-driven methods may not be well-suited to the task structure. Pure reasoning models (whether large closed-source systems or small open-source MLLMs) encounter difficulties in completing these planning tasks if not fine-tuned for specific tasks. Even the advanced reasoning model Gemini 2.5 Pro achieved EM and PR scores of almost less than 50% in the more complex MAZE and MINIBEHAVIOR tasks, suggesting that current state-of-the-art language models still struggle to cope with these challenges, even though these tasks are intuitive for humans.

Reinforcement Learning Delivers Gains

The two-stage reinforcement learning method VPRL delivered the highest overall performance, surpassing other variants. After the second stage, the model achieved nearly perfect planning on the simpler FROZENLAKE task (91.6% EM, 93.2% PR) and maintained strong performance on the MAZE and MINIBEHAVIOR tasks. Performance on all tasks was over 20% higher than VPFT.

As expected, the first stage of the team's reinforcement learning training (forcing output format but not teaching planning behavior) yielded near-random performance (e.g., 11% EM on the FROZENLAKE dataset). After comprehensive optimization in the second stage using the newly proposed reward scheme, the planner achieved optimal performance. This improvement highlights a key advantage of reinforcement learning over SFT: VPRL allows the model to freely explore various actions and learn from their results, while VPFT relies on imitation and tends to fit the training distribution. By encouraging exploitation through reward-driven updates, VPRL learned to capture underlying rules and patterns, leading to more robust planning performance.

The figure below shows a visualized comparison example.

Image

Robustness with Increasing Complexity

The team found that reinforcement learning maintained its advantage when studying the performance of different methods under varying task difficulties (larger grids are generally harder).

Image

As shown in Figure 5, in the FROZENLAKE environment, as the grid size increased from 3×3 to 6×6, Gemini 2.5 Pro's EM score plummeted from 98.0% to 38.8%. In contrast, the newly proposed visual planner not only maintained higher accuracy across all grid sizes but also exhibited a flatter performance curve. Similarly, VPRL also performed more stably than VPFT, maintaining an EM score of 97.6% on the 3×3 grid and still achieving 82.4% on the 6×6 grid, indicating VPRL's good robustness.

Image

© THE END

For reprinting, please contact this official account for authorization.

For submissions or coverage inquiries: liyazhou@jiqizhixin.com

Main Tag:Reinforcement Learning

Sub Tags:Visual PlanningArtificial IntelligenceComputer VisionMultimodal AI


Previous:312 Trajectories Boost Performance by 241%! SJTU and SII Open-Source Computer Agent Surpasses Claude 3.7

Next:How She Brought "System 2" to Large Language Models | An Interview with Dr. Li Zhang from Microsoft Research Asia

Share Short URL