By Tiankai Hang, Shuyang Gu, Chen Li, Jianmin Bao, Dong Chen, Han Hu, Xin Geng, Baining Guo.
Abstract.
Denoising diffusion models have been a mainstream approach for image generation, however, training these models often suffers from slow convergence. In this paper, we discovered that the slow convergence is partly due to conflicting optimization directions between timesteps. To address this issue, we treat the diffusion training as a multi-task learning problem, and introduce a simple yet effective approach referred to as Min-SNR-$\gamma$. This method adapts loss weights of timesteps based on clamped signal-to-noise ratios, which effectively balances the conflicts among timesteps. Our results demonstrate a significant improvement in converging speed, 3.4x faster than previous weighting strategies. It is also more effective, achieving a new record FID score of 2.06 on the ImageNet 256x256 benchmark using smaller architectures than that employed in previous state-of-the-art.
-
12/2024 Adopted in PLAID (Protein Latent Induced Diffusion) for protein structure generation,
$\mathbf{v}$ -pred + Min-SNR. [Code] - 12/2024 Adopted in MuLan🌻 for multilingual diffusion models.
-
04/17/2024 Support
Limited Interval Guidance
for sampling on ImageNet-256 and improve the FID score from2.06
to1.57
. -
01/21/2024 A soft version
soft-min-snr
is proposed in HDiT - Adopted by DeciDiffusion-v1 and DeciDiffusion-v2
- The loss weight has been integrated into HuggingFace🤗 diffusers and k-diffusion!
For CelebA dataset, we follow ScoreSDE to process the data.
For ImageNet dataset, we download it from the official website. For ImageNet-64, we did not adopt offline pre-processing. For ImageNet-256, we cropped the images to 256x256 and compressed them using AutoencoderKL from Diffusers. The compressed latent codes are treated equally as images, except the file extension.
For training with ViT-B model, you should first put the downloaded/processed data above to some path, and set DATA_DIR
in the config file vit-b_layer12_lr1e-4_099_099_pred_x0__min_snr_5__fp16_bs8x32.sh
. Then you could run like
GPUS=8
BATCH_SIZE_PER_GPU=32
bash configs/in256/vit-b_layer12_lr1e-4_099_099_pred_x0__min_snr_5__fp16_bs8x32.sh $GPUS $BATCH_SIZE_PER_GPU
For sampling for ImageNet-256, you could directly run
bash configs/in256/inference.sh
Thanks to the sampling method from Applying Guidance in a Limited Interval Improves Sample and Distribution Quality in Diffusion Models, we achieve a new FID score of 1.57342
on the ImageNet 256x256 benchmark. You can run the following command
bash configs/in256/inference_limited_interval_guidance.sh
For sampling for ImageNet-64, you could directly run
bash configs/in64/inference.sh
Here we use 8 GPUs for sampling. You can change GPUS=8
to GPUS=1
for single GPU evaluation in configs/in256/inference.sh
The pre-trained models will be automatically downloaded and FID-50K will be calculated.
If you find our work useful for your research, please consider citing our paper. 😊
@InProceedings{Hang_2023_ICCV,
author = {Hang, Tiankai and Gu, Shuyang and Li, Chen and Bao, Jianmin and Chen, Dong and Hu, Han and Geng, Xin and Guo, Baining},
title = {Efficient Diffusion Training via Min-SNR Weighting Strategy},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2023},
pages = {7441-7451}
}
This repository is based on openai/guided-diffusion. We adopt the implementation for sampling and FID evaluation using NVlabs/edm.