Diversity-Preserved Distribution Matching Distillation for Fast Visual Synthesis
Abstract
A novel distillation framework called DP-DMD is introduced that preserves sample diversity in text-to-image generation by separating the roles of distilled steps, using v-prediction for diversity and standard DMD loss for quality refinement without additional computational overhead.
Distribution matching distillation (DMD) aligns a multi-step generator with its few-step counterpart to enable high-quality generation under low inference cost. However, DMD tends to suffer from mode collapse, as its reverse-KL formulation inherently encourages mode-seeking behavior, for which existing remedies typically rely on perceptual or adversarial regularization, thereby incurring substantial computational overhead and training instability. In this work, we propose a role-separated distillation framework that explicitly disentangles the roles of distilled steps: the first step is dedicated to preserving sample diversity via a target-prediction (e.g., v-prediction) objective, while subsequent steps focus on quality refinement under the standard DMD loss, with gradients from the DMD objective blocked at the first step. We term this approach Diversity-Preserved DMD (DP-DMD), which, despite its simplicity -- no perceptual backbone, no discriminator, no auxiliary networks, and no additional ground-truth images -- preserves sample diversity while maintaining visual quality on par with state-of-the-art methods in extensive text-to-image experiments.
Community
A simple yet effective approach to preserving sample diversity under DMD, with no perceptual backbone, no discriminator, no auxiliary networks, and no additional ground-truth images.
reliable work
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- GPD: Guided Progressive Distillation for Fast and High-Quality Video Generation (2026)
- Transition Matching Distillation for Fast Video Generation (2026)
- Fast, faithful and photorealistic diffusion-based image super-resolution with enhanced Flow Map models (2026)
- SnapGen++: Unleashing Diffusion Transformers for Efficient High-Fidelity Image Generation on Edge Devices (2026)
- DINO-SAE: DINO Spherical Autoencoder for High-Fidelity Image Reconstruction and Generation (2026)
- Flow2GAN: Hybrid Flow Matching and GAN with Multi-Resolution Network for Few-step High-Fidelity Audio Generation (2025)
- Single-step Diffusion-based Video Coding with Semantic-Temporal Guidance (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper