Kaiming He's New Work: Adding Regularization to Diffusion Models for Performance Improvement with No Pre-training or Data Augmentation, Simple to Implement

Diffusion models are gaining significant traction, and Kaiming He's latest paper is related to this field.

The research explores how to connect diffusion models with representation learning—

By adding a “tidying up” function to diffusion models, their internal features become more orderly, resulting in the generation of more natural and realistic images.

Specifically, the paper introduces Dispersive Loss—a plug-and-play regularization method.

The core idea is to introduce an objective function, in addition to the model's standard regression loss (e.g., denoising), to regularize the model's intermediate representations.

This is somewhat similar to the repulsion effect in contrastive learning. However, compared to contrastive learning, its unique advantages include:

No need for positive sample pairs, avoiding the complexity of contrastive learning;

Highly versatile, can be directly applied to existing diffusion models without modifying the model structure;

Low computational overhead, adding almost no additional computational cost;

Compatible with the original loss, does not interfere with the diffusion model's original regression training objective, and is easy to integrate into existing frameworks.

Dispersing Intermediate Representations in the Latent Space

Let's look at the paper details.

Kaiming He and collaborator Runqian Wang's starting points are threefold:

Limitations of Diffusion Models

Diffusion models excel at generating complex data distributions, but their training typically relies on regression-based objective functions, lacking explicit regularization of intermediate representations.

Inspiration from Representation Learning

Representation learning (especially contrastive learning), by encouraging similar samples to be close and dissimilar samples to be dispersed, can effectively learn general representations.

Contrastive learning has achieved success in tasks like classification and detection, but its potential in generative tasks has not been fully explored.

Shortcomings of Existing Methods

Existing methods like REPA (Representation Alignment) attempt to improve generative performance by aligning the intermediate representations of generative models with pre-trained representations, but they suffer from dependencies on external data, additional model parameters, and pre-training processes, which are costly and complex.

They began to consider how to draw lessons from contrastive self-supervised learning to encourage the intermediate representations of generative models to disperse in the latent space, thereby improving the model's generalization ability and generation quality.

Based on this core idea, they designed Dispersive Loss: by regularizing the model's intermediate representations, increasing their dispersion, making them more uniformly distributed in the latent space.

The difference from contrastive learning is that in contrastive learning, positive sample pairs need to be manually defined through methods like data augmentation, and then brought closer together, while negative sample pairs are pushed apart, using a loss function.

Dispersive Loss, on the other hand, does not require defining positive sample pairs; it achieves regularization simply by encouraging dispersion among negative sample pairs.

For a batch of input samples , the objective function of Dispersive Loss can be expressed as:

Where is the standard diffusion loss for a single sample, and is the dispersive loss term, i.e., the regularization term, and λ is the regularization strength, used to balance the weights of the diffusion loss and the dispersive loss.

As can be seen, the implementation of Dispersive Loss is very concise, requiring no extra sample pairs or complex operations, and can be directly applied to the model's intermediate layer representations.

Furthermore, it supports not only single-layer application but also multi-layer stacking—theoretically, Dispersive Loss can be applied simultaneously across multiple intermediate layers to further enhance the dispersion of features at different levels.

Experimental Results

The authors conducted extensive tests on ImageNet, using DiT and SiT as baseline models, across models of different scales.

The results show that Dispersive Loss improved generation quality across all models and settings. For example, on the SiT-B/2 model, FID dropped from 36.49 to 32.45.

Compared to the REPA method, Dispersive Loss does not rely on pre-trained models or external data, yet its generation quality is not inferior.

On the SiT-XL/2 model, Dispersive Loss achieved an FID of 1.97, while REPA's FID was 1.80.

Additionally, both multi-step diffusion models and single-step generative models showed significant improvement based on Dispersive Loss.

The authors believe that Dispersive Loss has potential not only in image generation tasks but also in other tasks like image recognition.

Paper address: https://arxiv.org/abs/2506.09027v1

— End —

📪 QbitAI's AI theme planning is currently underway! Welcome to participate in the special topics on 365 AI implementation solutions, a thousand and one AI applications, or share with us the AI products you are looking for, or new AI trends you've discovered.

💬 You are also welcome to join QbitAI's daily AI exchange group and chat about AI!

One-click follow 👇 Light up the star

Daily updates on cutting-edge tech advancements

One-click triple combo: "Like", "Share", "Little Heart"

Feel free to leave your thoughts in the comment section!

Kaiming He's New Work: Adding Regularization to Diffusion Models for Performance Improvement with No Pre-training or Data Augmentation, Simple to Implement

Share Short URL