Skip to content

Latest commit

 

History

History
38 lines (25 loc) · 2.1 KB

hcae.md

File metadata and controls

38 lines (25 loc) · 2.1 KB

10× inference speedup with high-compression autoencoder

The high computational cost of training video generation models arises from the large number of tokens and the dominance of attention computation. To further reduce training expenses, we explore training video generation models with high-compression autoencoders (Video DC-AEs). As shown in the comparason below, by switching to the Video DC-AE with a much higher downsample ratio (4 x 32 x 32), we can afford to further reduce the patch size to 1 and still achieve 5.2× speedup in training throughput and 10x speedup during inference:

opensorav2_speed

Nevertheless, despite the advantanges in drastically lower computation costs, other challenges remain. For instance, larger channels low down convergance. Our generation model adapted with a 128-channel Video DC-AE for 25K iterations achieves a loss level of 0.5, as compared to 0.1 from the initialization model. While the fast video generation model underperforms the original, it still captures spatial-temporal relationships. We release this model to the research community for further exploration.

Checkout more details in our report.

Model Download

Download from 🤗 Huggingface:

pip install "huggingface_hub[cli]"
huggingface-cli download hpcai-tech/Open-Sora-v2-Video-DC-AE --local-dir ./ckpts

Inference

To inference on our fast video generation model:

torchrun --nproc_per_node 1 --standalone scripts/diffusion/inference.py configs/diffusion/inference/high_compression.py --prompt "The story of a robot's life in a cyberpunk setting." 

Training

Follow this guide to parepare the DATASET for training. Then, you may train your own fast generation model with the following command:

torchrun --nproc_per_node 8 scripts/diffusion/train.py configs/diffusion/train/high_compression.py --dataset.data-path datasets/pexels_45k_necessary.csv