This repository contains the code and supplementary materials for REGIS: Refining Generated Videos via Iterative Stylistic Redesigning. This research explores a novel approach to enhance the quality of videos generated from textual descriptions. The proposed method, known as REGIS, refines generated videos through iterative processes, resulting in improved video fidelity and reduced spatio-temporal noise.
In recent years, generative models have made impressive advancements towards realistic output; in particular, models working in the modalities of text and audio have reached a level of quality at which generated text or audio cannot be easily distinguished from real text or audio Despite these revolutionary advancements, the synthesis of realistic and temporally consistent videos is still in its inception. In this paper, we introduce a novel approach to the creation of realistic videos that focuses on improving the generated video in the latter steps of a video generation process. Specifically, we propose a framework for the iterative refinement of generated videos through repeated passes through a neural network trained to model the spatio-temporal dependencies found in real videos. Through our experiments, we demonstrate that our proposed approach significantly improves upon the generations of text-to-video models and achieves state-of-the-art results of the UCF-101 benchmark; removing the spatio-temporal artifacts and noise that make synthetic videos distinguishable from real videos. In addition, we discuss the ways in which one might augment this framework to achieve better performance.
Before running the code, make sure you have the following prerequisites:
- Python 3.x
- PyTorch
- Torchsummary
- Opencv
- Tqdm
- Imageio
- Numpy
- Peft
- Matplotlib
- Clone this repository to your local machine.
git clone https://github.com/Jaso1024/Refining-Generated-Videos.git
- Download the UCF101 dataset
All training files are located in the Training folder, and Models in the Models folder. To train one of the models, select a model to train and run to the corresponding training file in the Training folder.
The results obtained from the experiments are as follows:
Method | FVD |
---|---|
VideoFusion | 130.19 ± 6 |
REGIS-VQ, I=2 | 129.39 ± 6 |
REGIS-U, I=2 | 115.36 ± 6 |
REGIS-Fuse, I=8 | 125.32 ± 6 |
REGIS-U achieved the best performance, reducing the FVD to 115.36 ± 6, a noticeable improvement over the baseline VideoFusion model.
REGIS-VQ demonstrates significant improvements in classes with large amounts of movement, such as "Pole Vault" and "High Jump." For example:
Class | VideoFusion FVD | REGIS-VQ FVD |
---|---|---|
Pole Vault | 370.06 | 76.684 |
High Jump | 259.67 | 114.53 |
Hammer Throw | 193.21 | 120.92 |
... | ... | ... |
REGIS-VQ achieved an FVD 179.3% better than VideoFusion's original FVD on the "Pole Vault" class.
REGIS outperforms other text-to-video generation methods on the UCF101 dataset:
Method | Resolution | FVD↓ |
---|---|---|
TGANv2 | 16 x 128 x 128 | 1209 |
TATS | 16 x 128 x 128 | 332 |
VideoFusion | 16 x 128 x 128 | 173 |
REGIS-Fuse | 16 x 128 x 128 | 141 |