Gemini Diffusion: 1500 tokens/sec, Lightning Fast!

Google unveils revolutionary text diffusion technology!

image

You might have missed it, but Google DeepMind announced a significant experimental model at I/O 2025 – Gemini Diffusion!

A brand new attempt to apply diffusion technology to text generation!

This could be a major technological breakthrough.

Diffusion models have already proven their powerful capabilities in image generation (e.g., Stable Diffusion, DALL-E), but applying them to pure text generation is a significant challenge to the traditional language model paradigm.

Why is it so fast?

Traditional autoregressive language models (like GPT-4, Claude) generate text by sequentially generating each token from left to right, similar to the human writing process.

image

That is, for every additional token the model generates, it first needs to get all the tokens to its left, then feed all current tokens into the neural network, and finally predict the next token.

image

Gemini Diffusion, on the other hand, adopts a completely different approach: instead of generating tokens one by one, it first initializes the entire text as "noise" and then, through multiple iterations, gradually "purifies" this noise, ultimately forming meaningful complete text.

image

This method brings significant performance improvements: official test data shows that Gemini Diffusion can generate approximately 1500 tokens per second!

image

That's a full 5 times faster than the existing Gemini 2.0 Flash-Lite model!

Core Capabilities

According to Google DeepMind's technical introduction, Gemini Diffusion possesses three key advantages:

Ultra-high response speed: significantly faster than Google's existing fastest models

Higher text coherence: capable of generating entire blocks of tokens at once, rather than one by one

Iterative self-correction: corrects errors during the generation process, ensuring output consistency

image

Particularly for tasks requiring high logical consistency and multiple verifications, such as programming and mathematics, diffusion models show clear advantages.

@amirkdev raised an interesting question:

"For coding, will it argue with itself about which bracket style is best?"

This is a humorous yet insightful question—due to the nature of parallel generation, diffusion models can globally optimize the entire code snippet over multiple iterative steps, including maintaining a consistent coding style.

Comparable Performance, but Lightning Fast

It's worth noting that although Gemini Diffusion employs a completely new generation mechanism, its performance on standard benchmarks is quite close to that of Gemini 2.0 Flash-Lite:

Benchmark

Gemini Diffusion

Gemini 2.0 Flash-Lite

LiveCodeBench (v6)

30.9%

28.5%

BigCodeBench

45.4%

45.8%

HumanEval

89.6%

90.2%

AIME 2025

23.3%

20.0%

Note: Both perform similarly, but Gemini Diffusion has a speed advantage of up to 5 times!

Official detailed benchmark results are provided:

image

The data shows that Gemini Diffusion performs comparably to Gemini 2.0 Flash-Lite on most metrics, with a slight advantage in the AIME 2025 (mathematics) test.

Technical Principles Behind the Speed Breakthrough

Netizen @karthik_dulam also curiously asked:

"Who can explain why diffusion language models can be an order of magnitude faster?"

So, why can diffusion models achieve an order-of-magnitude speed increase in text generation?

According to analysis, this involves four core technical "acceleration mechanisms":

1. Parallel Decoding Architecture

Autoregressive models: Must generate tokens sequentially, with each subsequent token depending on the completion of the previous one.

Diffusion models: The entire sentence is processed simultaneously, with noise removal performed in parallel at all positions.

@itsArmanj provided a speculative analysis:

"Help me understand: if you ask a Transformer to calculate two times three, it will reason out 2*3=, and then the next token is 6. How does a diffusion model get 6 before forming 2*3?"

In fact, diffusion models do not rely on sequential reasoning but optimize the entire sequence over multiple iterations.

It first generates "candidate answers" containing noise, then, through a multi-step denoising process, ensures mathematical consistency across the entire expression and answer.

2. Adjustable Iteration Steps

Gemini Diffusion requires only about 12 iteration steps to generate high-quality text, while an autoregressive model needs 1000 sequential operations to process a paragraph containing 1000 tokens.

3. Efficient Operator Fusion

Diffusion models use bidirectional attention instead of unidirectional attention mechanisms, eliminating the need to maintain a KV-cache, making them more suitable for fully utilizing GPU/TPU parallel computing architectures.

@LeeLeepenkman observed:

"We're back to the diffuser and DIT block route. Previously, everyone was trying autoregressive image generation because 4oimage used that approach, but when you think deeply or actually try it, you find it quite slow. By scaling diffusion models extensively, we might be able to achieve this level of logic and text accuracy, just like achieving realistic lighting."

Through large-scale expansion, diffusion models may be able to achieve the same logical reasoning capabilities and accuracy as autoregressive models, while maintaining their significant speed advantage.

4. Computational Resource Optimization

Diffusion models only map the output to the vocabulary in the final step, significantly reducing computational overhead.

Technological Route Comparison: The Paradigm Battle Between Diffusion and Autoregressive Models

Dimension

Diffusion Language Model

Autoregressive Transformer

Generation Process

Parallel: Initializes entire sentence as noise, iteratively denoises

Sequential: Generates tokens one by one in order

Latency

Approx. 12 iteration steps, largely independent of sequence length

Linear increase with sequence length

Controllability

Gradient-based optimization, easier to achieve precise control

Primarily relies on RLHF and prompt engineering

Maturity

Experimental stage, still requires validation

Mature technology, widely applied

@TendiesOfWisdom offered an inspiring analogy:

"Alien writing in the sci-fi movie 'Arrival' = new diffusion language models? Their circular script conveys complete concepts at once; these models iterate in parallel to achieve coherence, abandoning the step-by-step token generation. Non-linear thinking meets the next wave of AI."

This analogy is quite interesting. The alien circular script in the sci-fi movie 'Arrival' can express complete concepts at once, and diffusion language models also use a "non-linear" approach to synchronously generate entire segments of content.

Trend of Cross-Modal Unification

It's worth noting that Google is unifying diffusion technology across three major domains: text (Gemini Diffusion), image (Imagen 4), and video (Veo 3). This is clearly building a full-modal AI ecosystem based on diffusion technology.

Google has not yet released a detailed technical paper for Gemini Diffusion, only a simple product introduction link:

https://deepmind.google/models/gemini-diffusion/

image

However, there have been related technical route studies before, such as Diffusion-LM (Stanford, 2022) and d1 (UCLA & Meta, 2025).

image

Currently, Gemini Diffusion is only open for testing to a limited number of partners, but Google has opened a waitlist for researchers and developers to register.

image

I'm already on the waitlist, here's the link:

https://docs.google.com/forms/u/0/d/e/1FAIpQLSdsxa-YU25JIPJGmu-pySJEYeTy6lwbdZAzxlZ11x3GPj6DhA/formResponse

This time, Gemini Diffusion demonstrates not just a speed increase, but potentially a fundamental shift in the generation paradigm.

This will likely be an interesting experimental subject.

And with the application of diffusion models in text generation, we might be witnessing another revolutionary shift in AI generation technology.

👇

👇

👇

Additionally, I also used AI to collect AI news from across the web, then used AI to select, review, translate, and summarize it before publishing it in the "AGI Hunt" knowledge planet.

This is an AI news feed that is only about information and no emotion (not a recommendation feed, not selling courses, not lecturing, not teaching you how to be human, just providing information).

image

Welcome to join! Also welcome to join the group and communicate with 2000+ group members.

imageimage

Main Tag:Artificial Intelligence

Sub Tags:Diffusion ModelsLarge Language ModelsGoogle DeepMindText Generation


Previous:More Capable Than Gemini Diffusion! The First Multimodal Large Diffusion Language Model MMaDA Released, Achieving Strong Reasoning and High Controllability

Next:HALO, a Hierarchical Dynamic Prompting Framework Based on MCTS, Enabling Agents to Always Find the Optimal Path | Latest

Share Short URL