Official implementation for CVPR 2024 paper: Diff-BGM: A Diffusion Model for Video Background Music Generation
By Sizhe Li, Yiming Qin, Minghang Zheng, Xin Jin, Yang Liu.
- Thanks for the code structure from Polyffusion
pip install -r requirements.txt
pip install -e diffbgm
pip isntall -e diffbgm/mir_eval
-
The extracted features of the dataset POP909 can be accessed here. Please put it under
/data/
after extraction. -
The extracted features of the dataset BGM909 can be accessed here. Please put them under
/data/bgm909/
after extraction. We use VideoCLIP to extract the video feature, use BLIP to gain the video caption then use Bert-base-uncased as the language encoder and use TransNetV2 to capture the shot.
We also provide the original captions here. -
The needed pre-trained models for training can be accessed here. Please put them under
/pretrained/
after extraction. The split of the dataset can be find here.
python diffbgm/main.py --model ldm_chd8bar --output_dir [output_dir]
Please use the following message to generate music for videos in BGM909.
python diffbgm/inference_sdf.py --model_dir=[model_dir] --uncond_scale=5.
To reproduce the metrics in our original paper, please refer to /diffbgm/test.ipynb
.
Backbone | PCHE | GPS | SI | P@20 | Weights |
---|---|---|---|---|---|
Diff-BGM (original) | 2.840 | 0.601 | 0.521 | 44.10 | weights |
Diff-BGM (only visual) | 2.835 | 0.514 | 0.396 | 43.20 | weights |
Diff-BGM (w/o SAC-Att) | 2.721 | 0.789 | 0.523 | 38.47 | weights |
We provide our generation results here.
After generating a piece of music, you can use the following commands to generate a video.
sudo apt-get install ffmpeg fluidsynth
fluidsynth -i <SoundFont file> <midi file> -F <wav file>
ffmpeg -i <wav file> -b:a <bit rate> <mp3 file>
ffmpeg -i <video file> -i <mp3 file> -c:a aac -map 0:v:0 -map 1:a:0 <output file>
See our demo!