Recently, Google's anonymously launched "nano-banana" (which is actually Gemini 2.5 Flash Image) topped the image editing leaderboard on LMArena with a dominant score of 1362 points. Netizens exclaimed, "You can turn a model into a banana suit with just one natural language prompt."
The head of Google AI Studio recently interviewed the team behind this project, revealing a key technical detail: the core of this model lies in its native multimodal capability.
So, a more fundamental question arises: When an MLLM hears "change the background to a blue sky with white clouds," at what layer does it truly "understand" the image, and at what layer does it decide "how to change it"? The answer is hidden in the newly released paper, "How Multimodal LLMs Solve Image Tasks."
The paper proposes a lightweight linear probe framework, using three carefully designed prompt variants to thoroughly dissect the internal processes of LLaVA-1.5, LLaVA-Next-LLaMA-3, and Qwen2-VL. It uncovered a surprising "four-stage" universal structure and pointed out that changing the tokenizer, adding data, or altering pre-training corpora cannot shake this structure; what truly determines "which layer does what" is the underlying LLM architecture itself.
Method: "Examining" Each Layer with Three Prompt Variants
Variant Type: Lexical, Example Change: Does this image → Does this picture, Detection Purpose: Identify the layer where visual-text alignment occurs.
Variant Type: Semantic Negation, Example Change: animal → plane (answer yes→no), Detection Purpose: Identify the layer where semantic decision-making begins to solidify.
Type: Output Format, Example Change: yes/no → 1/0 (answer semantics unchanged), Detection Purpose: Decouple "decision" from "output format."
Figure 2: Train linear probes on the same layer; during inference, fix the probe and only change the prompt to observe accuracy changes.
2.1 Experimental Setup
Data: ImageNet 120 fine-grained dog breeds (to avoid overly simple tasks).
Anchor Question: Does this image show an animal? The answer must be always yes or no.
A linear classifier is independently trained for each layer to predict dog breed labels; the drop in accuracy is used to measure the layer's sensitivity to prompt perturbations.
3. Understanding the Four-Stage Pipeline at a Glance
3.1 The Typical Four Stages of LLaVA-1.5
Layers 1-4: Visual Grounding, almost no accuracy drop when changing prompts → pure visual encoding.
Layers 5-13: Lexical Integration, immediate 40% drop when changing "image→picture" → image-text fusion begins.
Layers 12-15: Semantic Reasoning, significant drop with Semantic Negation, Output Format remains high → decision-making is solidified.
Layers 16+: Answer Decoding, accuracy drop with Output Format change → preparing to output tokens.
3.2 Decoupling "Decision" and "Format"
Layers 12-15: High accuracy for both formats → the semantic answer itself is stored here.
After Layer 16: Different formats lead to accuracy drop → here, the focus shifts to "how to say" rather than "what to say."
4. What Determines the Pipeline? Architecture > Data > Tokenizer
Comparing LLaVA-1.5, LLaVA-Next-LLaMA-3, and Qwen2-VL by controlling variables to see "what moved my pipeline."
4.1 Tokenizer, Instruction Data, Pre-training Corpora: Minimal Impact
→ Curves almost overlap, indicating a stable four-stage structure.
4.2 Changing the Underlying LLM: Stages Remain, Layer Counts Shift
Changing to Qwen → fewer layers for visual grounding, more layers for semantic reasoning.
Stage: Grounding, LLaVA-1.5: 1-4, Qwen2-VL: 1 (shorter).
Stage: Reasoning, LLaVA-1.5: 12-15, Qwen2-VL: 10-20 (longer).
Stage: Decoding, LLaVA-1.5: 16+, Qwen2-VL: 21-28.
Conclusion: The architectural differences of the underlying LLM determine "how many layers" are used for each stage, but the four-stage logic remains unchanged.
5. Conclusion
Universal Four Stages: Grounding → Integration → Reasoning → Decoding.
Architecture Determines Depth: Changing LLaMA→Qwen is like "stretching" or "compressing" the same pipeline.
Lightweight Probe: No gradient backpropagation or model modification required for cross-comparison of different MLLMs.
Future work will extend this probe set to non-LLaVA architectures such as BLIP-2 and Chameleon, to verify if the four stages remain a "universal law."
Want to tune an MLLM? First, understand at what layer your underlying LLM "starts thinking," then discuss data and tokenizers!
https://arxiv.org/pdf/2508.20279 How Multimodal LLMs Solve Image Tasks: A Lens on Visual Grounding, Task Reasoning, and Answer Decoding