In recent years, large language models (LLMs) have demonstrated strong potential in multimodal tasks, but existing models still face significant challenges in architectural unification and post-training methods.
Traditional multimodal large models are often based on autoregressive architectures, where the separation of text and image generation processes leads to low cross-modal synergy efficiency and makes it difficult to effectively optimize complex reasoning tasks during the post-training phase.
DeepMind's recently released Gemini Diffusion, for the first time, used diffusion models as the text modeling base, achieving breakthrough performance in general reasoning and generation tasks, validating the potential of diffusion models in the field of text modeling.
Against this backdrop, research teams from Princeton University, ByteDance Seed, Peking University, Tsinghua University, and others collaborated to propose MMaDA (Multimodal Large Diffusion Language Models). As the first systematic exploration of diffusion architecture for multimodal foundation models, MMaDA has successfully achieved unified modeling of text reasoning, multimodal understanding, and image generation through three core technological breakthroughs.
Paper Title: MMaDA: Multimodal Large Diffusion Language Models
Paper Link: https://arxiv.org/abs/2505.15809
Code Repository: https://github.com/Gen-Verse/MMaDA
Model Address: https://huggingface.co/Gen-Verse/MMaDA-8B-Base
Demo Address: https://huggingface.co/spaces/Gen-Verse/MMaDA
The team has open-sourced the training, inference, MMaDA-8B-Base weights, and online demo. They will subsequently open-source MMaDA-8B-MixCoT and MMaDA-8B-Max weights.
Performance and Cross-Task Synergy
MMaDA achieves SOTA performance in three major tasks:
Text Reasoning: MMLU accuracy of 68.4%, surpassing LLaMA-3-8B, Qwen2-7B, and LLaDA-8B; currently, all unified understanding and generation models do not support strong text reasoning. MMaDA is the first to maintain text modeling capabilities in multimodal tasks, realizing a truly unified foundation model.
Multimodal Understanding: On par with specialized models like LLaVA and Qwen-VL on benchmarks such as POPE (86.1 vs 85.9) and VQAv2 (76.7 vs 78.5);
Image Generation: CLIP Score reached 32.46, a significant improvement over models like SDXL and Janus, and a 56% accuracy increase in cultural knowledge generation tasks (WISE). In image generation tasks, this is the first comparison of unified multimodal large models on text-to-image tasks involving world knowledge, as shown below:
Cross-Task Synergy Effects
As shown in the figure below, during the mixed training phase (130K-200K steps), both text reasoning and image generation metrics simultaneously improved. For example, the model significantly improved its ability to solve complex geometric problems and the semantic accuracy of generated images, demonstrating the multi-task synergy achieved by using diffusion models as a unified architecture.
Task Generalization
A significant advantage of diffusion models is their ability to generalize to inpainting and extrapolation tasks without additional fine-tuning. MMaDA supports three types of cross-modal completion tasks:
Text Completion: Predicting missing segments in a text sequence.
Visual Question Answering Completion: Generating complete answers based on incomplete image-text inputs.
Image Completion: Reconstructing a complete image based on local visual cues.
These cases fully demonstrate the flexibility and generalization capabilities of the unified diffusion architecture in complex generation and reasoning tasks.
Key Technical Analysis
The training and testing framework is as follows:
Unified Diffusion Architecture
MMaDA's core architectural breakthrough lies in unifying text and image generation processes within a diffusion framework:
Data Representation: Text uses LLaMA's Tokenizer, and images use MAGVIT-v2's Tokenizer, converting 512×512 images into 1024 discrete Tokens;
Diffusion Objective: Defining a unified masked prediction loss function to synchronously optimize the semantic recovery capabilities of text and images through random masking. For example, during pre-training, the model needs to predict missing content based on partially masked Token sequences, whether the input is a text passage or an image block.
This design eliminates the complexity of traditional hybrid architectures (e.g., AR+Diffusion), enabling the model to achieve cross-modal information interaction at a fundamental level.
Mixed Long-CoT Finetuning
To address the cold-start problem in complex tasks, MMaDA proposes a cross-modal mixed CoT fine-tuning strategy:
Unified Reasoning Format: Defining a special tag structure <think>reasoning process</think>, forcing the model to output cross-modal reasoning steps before generating an answer. For example, when dealing with geometric problems, the model first needs to parse geometric relationships before performing numerical calculations;
Data Augmentation: Utilizing LLM/VLM to generate high-quality reasoning trajectories, and screening logically rigorous samples through a validator. Improvements in text mathematical reasoning capabilities can directly enhance the factual consistency of image generation (e.g., correctly generating "the largest terrestrial carnivore in the Arctic - polar bear").
Unified Policy Gradient Optimization (UniGRPO Algorithm)
Addressing three major challenges in diffusion model reinforcement learning—local mask dependency, mask ratio sensitivity, and non-autoregressive characteristics—MMaDA proposes innovative solutions:
Structured Noise Policy: Randomly sampling mask ratios for the answer part (e.g., 30%-70%), while keeping the question part complete. This design simulates a multi-step denoising process, avoiding the single-step prediction bias caused by full masking in previous methods (e.g., d1);
Diversified Reward Modeling: Designing composite reward functions for different tasks. For example, in image generation, CLIP Reward measures text-image alignment, and Image Reward reflects human aesthetic preferences, with both weighted by a 0.1 coefficient.
As shown in the figure below, UniGRPO consistently increased reward values during GSM8K training, achieving a 40% faster convergence rate compared to baseline methods. This is attributed to UniGRPO's full adaptation to the multi-step generation characteristics of diffusion models.
About the Main Authors
Ling Yang: Research Fellow at Princeton University, Ph.D. from Peking University, whose research focuses on large language models, diffusion models, and reinforcement learning.
Ye Tian: Ph.D. student at Peking University's School of Artificial Intelligence, whose research focuses on diffusion models, unified models, and reinforcement learning.
Ke Shen: AI Researcher at ByteDance Seed's large model team, whose research focuses on large language model pre-training and unified learning paradigms.
Yunhai Tong: Professor at Peking University's School of Artificial Intelligence, whose research areas include multimodal large models, image/video generation, and editing.
Mengdi Wang: Endowed Professor in the Department of Electrical and Computer Engineering at Princeton University, where she founded and serves as the inaugural director of the Princeton University "AI for Accelerated Invention" center. Her research areas include reinforcement learning, controllable large models, optimization learning theory, and AI for Science, among others.
© THE END
Please contact this official account for authorization to reprint.
Submissions or media inquiries: liyazhou@jiqizhixin.com