Official codebase and pre-trained models for our DeepAVFusion framework as described in the paper.
Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling
Shentong Mo, Pedro Morgado
IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
Our environment was created as follows
conda create -n deepavfusion python=3.10
conda activate deepavfusion
conda install pytorch=2.0 torchvision=0.15 torchaudio=2.0 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install submitit hydra-core av wandb tqdm scipy scikit-image scikit-learn timm mir_eval jupyter matplotlib
Simply run conda env create -f requirements.yml
to replicate it.
In this work, we used a variety of datasets, including VGGSound, AudioSet, MUSIC and AVSBench. We assume that you have downloaded all datasets. Expected data format is briefly described in DATASETS.md
PATH2VGGSOUND="/path/to/vggsound"
PATH2AUDIOSET="/path/to/audioset"
PATH2MUSIC="/path/to/music"
PATH2AVSBENCH="/path/to/avsbench"
We release two models based on the VIT-Base architecture, trained on the VGGSounds and AudioSet datasets, respectively. The models were trained with the following commands.
# Pre-training on VGGSounds
PYTHONPATH=. python launcher.py --config-name=deepavfusion job_name=deepavfusion_vitb_vggsound_ep\${opt.epochs} \
data.dataset=vggsound data.data_path=${PATH2VGGSOUND} \
model.fusion.layers=all model.fusion.attn_ratio=0.25 model.fusion.mlp_ratio=1.0 \
opt.epochs=200 opt.warmup_epochs=40 opt.batch_size=64 opt.accum_iter=1 opt.blr=1.5e-4 \
env.ngpu=8 env.world_size=1 env.seed=0
# Pre-training on AudioSet
PYTHONPATH=. python launcher.py --config-name=deepavfusion job_name=deepavfusion_vitb_as2m_ep\${opt.epochs} \
data.dataset=audioset data.data_path=${PATH2AUDIOSET} \
model.fusion.layers=all model.fusion.attn_ratio=1.0 model.fusion.mlp_ratio=4.0 \
opt.epochs=200 opt.warmup_epochs=40 opt.batch_size=64 opt.accum_iter=4 opt.blr=1.5e-4 \
env.ngpu=8 env.world_size=1 env.seed=0
The nearest neighbor training curve of the model trained on VGGSound can be seen below. The retrieval performance of fusion tokens is substantially better than uni-modal representations, suggesting that fusion tokens aggregate high-level semantics, while uni-modal representations encode the low-level details required for masked reconstruction.
The pre-trained models are available in the checkpoints/
directory.
We evaluate our model on a variety of downstream tasks. In each case, the pre-trained model is used for feature extraction (with or without fine-tuning depending on the evaluation protocol) and a task-specific decoder is trained from scratch to carry the task.
Dataset | Eval Protocol | Pre-trained Model | Top1 Acc | |
---|---|---|---|---|
VGGSound | Linear Probe | VGGSound-200ep | 53.08 | CMDPYTHONPATH=. python launcher.py --config-name=linprobe job_name=eval_linprobe_vggsound pretrain_job_name=deepavfusion_vitb_vggsound_ep200 model.fusion.attn_ratio=0.25 model.fusion.mlp_ratio=1.0 data.dataset=vggsound data.data_path=${PATH2VGGSOUND} opt.epochs=60 opt.warmup_epochs=10 opt.batch_size=64 opt.accum_iter=4 opt.blr=0.3 env.ngpu=4 env.world_size=1 |
VGGSound | Linear Probe | AudioSet2M-200ep | 53.08 | CMDPYTHONPATH=. python launcher.py --config-name=linprobe job_name=eval_linprobe_vggsound pretrain_job_name=deepavfusion_vitb_as2m_ep200 model.fusion.attn_ratio=1.0 model.fusion.mlp_ratio=4.0 data.dataset=vggsound data.data_path=${PATH2VGGSOUND} opt.epochs=60 opt.warmup_epochs=10 opt.batch_size=64 opt.accum_iter=4 opt.blr=0.3 env.ngpu=4 env.world_size=1 |
VGGSound | Fine-tuning | VGGSound-200ep | 58.19 | CMDPYTHONPATH=. python launcher.py --config-name=finetune job_name=eval_finetune_vggsound pretrain_job_name=deepavfusion_vitb_vggsound_ep200 model.fusion.attn_ratio=0.25 model.fusion.mlp_ratio=1.0 data.dataset=vggsound data.data_path=${PATH2VGGSOUND} opt.epochs=100 opt.warmup_epochs=20 opt.batch_size=32 opt.accum_iter=4 opt.blr=3e-4 env.ngpu=4 env.world_size=1 |
VGGSound | Fine-tuning | AudioSet2M-200ep | 57.91 | CMDPYTHONPATH=. python launcher.py --config-name=finetune job_name=finetune_vggsound pretrain_job_name=deepavfusion_vitb_as2m_ep200 model.fusion.attn_ratio=1.0 model.fusion.mlp_ratio=4.0 data.dataset=vggsound data.data_path=${PATH2VGGSOUND} opt.epochs=100 opt.warmup_epochs=20 opt.batch_size=32 opt.accum_iter=4 opt.blr=3e-4 env.ngpu=4 env.world_size=1 |
Dataset | Eval Protocol | Pre-trained Model | Top1 AP | |
---|---|---|---|---|
AudioSet-Bal | Linear Probe | VGGSound-200ep | 53.08 | CMDPYTHONPATH=. python launcher.py --config-name=linprobe job_name=eval_linprobe_as2mbal pretrain_job_name=deepavfusion_vitb_vggsound_ep200 model.fusion.attn_ratio=0.25 model.fusion.mlp_ratio=1.0 data.dataset=audioset-bal-orig data.data_path=${PATH2AUDIOSET} opt.epochs=300 opt.warmup_epochs=20 opt.batch_size=256 opt.accum_iter=1 opt.blr=0.3 env.ngpu=2 env.world_size=1 |
AudioSet-Bal | Linear Probe | AudioSet2M-200ep | 53.08 | CMDPYTHONPATH=. python launcher.py --config-name=linprobe job_name=eval_linprobe_as2mbal pretrain_job_name=deepavfusion_vitb_as2m_ep200 model.fusion.attn_ratio=1.0 model.fusion.mlp_ratio=4.0 data.dataset=audioset-bal-orig data.data_path=${PATH2AUDIOSET} opt.epochs=300 opt.warmup_epochs=20 opt.batch_size=256 opt.accum_iter=1 opt.blr=0.3 env.ngpu=2 env.world_size=1 |
AudioSet-Bal | Fine-tuning | VGGSound-200ep | 58.19 | CMDPYTHONPATH=. python launcher.py --config-name=finetune job_name=eval_finetune_as2mbal pretrain_job_name=deepavfusion_vitb_as2m_ep200 model.fusion.attn_ratio=1.0 model.fusion.mlp_ratio=4.0 data.dataset=audioset-bal-orig data.data_path=${PATH2AUDIOSET} opt.epochs=200 opt.warmup_epochs=20 opt.batch_size=32 opt.accum_iter=4 opt.blr=3e-4 env.ngpu=4 env.world_size=1 |
AudioSet-Bal | Fine-tuning | AudioSet2M-200ep | 57.91 | CMDPYTHONPATH=. python launcher.py --config-name=finetune job_name=eval_finetune_as2mbal pretrain_job_name=deepavfusion_vitb_vggsound_ep200 model.fusion.attn_ratio=0.25 model.fusion.mlp_ratio=1.0 data.dataset=audioset-bal-orig data.data_path=${PATH2AUDIOSET} opt.epochs=200 opt.warmup_epochs=20 opt.batch_size=32 opt.accum_iter=4 opt.blr=3e-4 env.ngpu=4 env.world_size=1 |
Dataset | Pre-training | SDR | SIR | SAR | |
---|---|---|---|---|---|
VGGSound-Music | VGGSound-200ep | 5.79 | 8.24 | 13.82 | CMDPYTHONPATH=. python launcher.py --config-name=avsrcsep job_name=eval_avsrcsep_vggsound_music pretrain_job_name=deepavfusion_vitb_vggsound_ep200 model.fusion.attn_ratio=0.25 model.fusion.mlp_ratio=1.0 data.dataset=vggsound_music data.data_path=${PATH2VGGSOUND} opt.epochs=300 opt.warmup_epochs=40 opt.batch_size=16 opt.accum_iter=8 opt.blr=3e-4 avss.log_freq=True avss.weighted_loss=True avss.binary_mask=False avss.num_mixtures=2 env.ngpu=4 env.world_size=1 |
VGGSound-Music | AudioSet2M-200ep | 6.93 | 9.93 | 13.49 | CMDPYTHONPATH=. python launcher.py --config-name=avsrcsep job_name=eval_avsrcsep_vggsound_music pretrain_job_name=deepavfusion_vitb_as2m_ep200 model.fusion.attn_ratio=1.0 model.fusion.mlp_ratio=4.0 data.dataset=vggsound_music data.data_path=${PATH2VGGSOUND} opt.epochs=300 opt.warmup_epochs=40 opt.batch_size=16 opt.accum_iter=8 opt.blr=3e-4 avss.log_freq=True avss.weighted_loss=True avss.binary_mask=False avss.num_mixtures=2 env.ngpu=4 env.world_size=1 |
Dataset | Pre-training | mIoU | FScore | |
---|---|---|---|---|
AVSBench-S4 | VGGSounds-200ep | 89.94 | 92.34 | CMDPYTHONPATH=. python launcher.py --config-name=avsegm job_name=eval_avsbench_s4 pretrain_job_name=deepavfusion_vitb_vggsound_ep200 model.fusion.attn_ratio=0.25 model.fusion.mlp_ratio=1.0 data.dataset=avsbench_s4 data.data_path=${PATH2AVSBENCH} opt.epochs=100 opt.warmup_epochs=20 opt.batch_size=16 opt.accum_iter=8 opt.blr=2e-4 env.ngpu=4 env.world_size=1 |
AVSBench-S4 | AudioSet2M-200ep | 90.27 | 92.49 | CMDPYTHONPATH=. python launcher.py --config-name=avsegm job_name=eval_avsbench_s4 pretrain_job_name=deepavfusion_vitb_as2m_ep200 model.fusion.attn_ratio=1.0 model.fusion.mlp_ratio=4.0 data.dataset=avsbench_s4 data.data_path=${PATH2AVSBENCH} opt.epochs=100 opt.warmup_epochs=20 opt.batch_size=16 opt.accum_iter=8 opt.blr=2e-4 env.ngpu=4 env.world_size=1 |
If you find this repository useful, please cite our paper:
@inproceedings{mo2024deepavfusion,
title={Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling},
author={Mo, Shentong and Morgado, Pedro},
booktitle={Proceedings of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR)},
year={2024}
}