Official PyTorch implementation of "Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition For Foley Sound". Keywords: Video-to-Audio Generation, Controllable Audio Generation, Multimodal Deep Learning.
![]() |
|
To make an end-to-end inference with your own video without processing the whole dataset, run infer_by_video.py
.
This script executes preprocessing, video2rms inference, and rms2sound inference.
Each video should be at least 10 seconds long.
(text prompt)
python infer_by_video.py --video_dir ./dir/to/your/videos --prompt_type "text" \
--prompt "A person hit metal with a wooden stick." "A person hit cup with a wooden stick." ... \
--epoch 500 --video2rms_ckpt_dir ./dir/to/video2rms/ckpt --rms2sound_ckpt_dir ./dir/to/rms2sound/ckpt \
--output_dir ./output_dir
(audio prompt)
python infer_by_video.py --video_dir ./dir_to_your_videos --prompt_type "audio" \
--prompt ./path/to/audio_1.wav ./path/to/audio_2.wav ... \
--epoch 500 --video2rms_ckpt_dir ./dir/to/video2rms/ckpt --rms2sound_ckpt_dir ./dir/to/rms2sound/ckpt \
--output_dir ./output_dir
video_dir
should look like this:
dir/to/your/videos/
├── your_video_1.mp4
├── your_video_2.mp4
└── ...
-
Clone this repository.
git clone --recurse-submodules https://github.com/jnwnlee/video-foley cd video-foley
-
Create a new Conda environment.
conda create -n v2s python=3.9.18 conda activate v2s
-
Install PyTorch and other dependencies.
conda install pytorch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 pytorch-cuda=11.8 -c pytorch -c nvidia conda install ffmpeg=6.1.0 x264 -c conda-forge # this may take long # sudo apt-get update -y && sudo apt-get install -y libgl1-mesa-glx # if you don't have libGL pip install -r requirements.txt
-
RMS2Sound submodule (RMS-ControlNet) init
- If you didn't clone this repo with --recurse-submodules option, run the following commands to initialize and update submodules.
# after executing: # git clone https://github.com/jnwnlee/video-foley # cd video-foley git submodule init git submodule update
- Install requirements as follows:
conda install lightning -c conda-forge pip install -e ./RMS_ControlNet_Inference pip install -e ./RMS_ControlNet_Inference/AudioLDMControlNetInfer/Model/AudioLdm pip install -e ./RMS_ControlNet_Inference/TorchJAEKWON
Refer to this repo for further details.
For GreatestHits dataset, download 'full-res videos and labels' in this website: Visually Indicated Sounds.
unzip ./data/vis-data.zip -d ./data/GreatestHits
Download weigths by cloning a repo in Huggingface to target path (here, ./ckpt
).
conda install git-lfs -c conda-forge
# (or) sudo apt-get install git-lfs
git clone https://huggingface.co/jnwnlee/video-foley ./ckpt
Or, manually download each files through links provided by Huggingface:
wget url/for/download/
Run data_preprocess.sh
to preprocess data and extract RGB and optical flow features.
Notice: The script we provided to calculate optical flow is easy to run but is resource-consuming and will take a long time. We strongly recommend you to refer to [TSN repository][TSN] and their built [docker image][TSN_docker] (our paper also uses this solution) to speed up optical flow extraction and to restrictly reproduce the results. (GreatestHits dataset)
source data_preprocess_GreatestHits.sh
Training Video2RMS from scratch. The results will be saved to {save_dir}.
(change config.py before executing python script: e.g., save_dir, rgb_feature_dir, flow_feature_dir, mel_dir, etc.)
CUDA_VISIBLE_DEVICES=0 python train.py
In case that the program stops unexpectedly, you can continue training.
CUDA_VISIBLE_DEVICES=0 python train.py \
-c path/to/.yml/file \
train.checkpoint_path path/to/checkpoint/file
Evaluate Video2RMS with GreatestHits dataset test split.
The metrics mentioned in the paper will be saved as a csv file (./evaluate.csv
). (check config.log.loss.types in opts.yml)
The checkpoint directory ('-c', '--ckpt_dir') should contain both 'opts.yml' file (configuration) and model weights.
CUDA_VISIBLE_DEVICES=0 python evaluate.py -c path/to/video2rms/ckpt -e 500 -a micro -o ./evaluate.csv
Generate audio from video, using Video2RMS and RMS2Sound.
You can choose either 'audio' or 'text' as a type of semantic prompt.
(- audio: use ground-truth audio as a prompt in config.data.audio_src_dir
)
(- text: use text made with annotation in config.data.annotation_dir
as a prompt)
Each checkpoint directory should contain the followings:
- Video2RMS: config (
opts.yml
) and model weight (e.g.,checkpoint_000500_Video2RMS.pt
) - RMS2Sound: model weight
ControlNetstep300000.pth
Outputs (generated audio and video paired with generated audio) will be saved to output_dir/{audio/video}
, respectively.
CUDA_VISIBLE_DEVICES=0 python infer.py -v path/to/video2rms/ckpt -r path/to/rms2sound/ckpt \
-e 500 -o path/to/output_dir -p "audio/text"
Evaluate generated audio with ground truth audio.
This code is only for E-L1 and CLAP score calculation.
(For CLAP score, please download the model weigt here)
(For FAD calculation, please refer to fadtk)
Gather audio files only in --generated_dir
and --ground_truth_dir
, respectively.
The audio files in --generated_dir
and --ground_truth_dir
should have the same name. (not necessary for FAD)
The calculated scores will be saved as a csv file (./eval_v2s_audio.csv
).
CUDA_VISIBLE_DEVICES=0 python evaluate_v2s.py --el1 --clap \
--clap_pretrained_path ./ckpt/clap_music_speech_audioset_epoch_15_esc_89.98.pt \
--ground_truth_dir path/to/ground_truth_audio \
--generated_dir path/to/generated_audio \
--csv_path eval_v2s_audio.csv
# (for text prompt using annotations) --annotation_dir /mnt/GreatestHits/vis-data/
# If you don't use pytorch~=2.1.x, we recommend to create new conda environment.
# conda create -n fadtk python=3.10 && conda activate fadtk
# conda install pytorch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 pytorch-cuda=11.8 -c pytorch -c nvidia # example
# pip install chardet scipy==1.11.2
pip install git+https://github.com/DCASE2024-Task7-Sound-Scene-Synthesis/fadtk.git
fadtk panns-wavegram-logmel path/to/ground_truth_audio path/to/generated_audio fad_result.csv --inf
@article{video-foley,
title={Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition For Foley Sound},
author={Lee, Junwon and Im, Jaekwon and Kim, Dabin and Nam, Juhan},
journal={arXiv preprint arXiv:2408.11915},
year={2024}
}