A PyTorch implementation of the Muse: Multi-modal target speaker extraction with visual cues
- A new version of this code is scheduled to be released here (ClearVoice repo).
- The dataset can be found here.
/data/voxceleb2-800
: Scripts to preprocess the voxceleb2 datasets.
/pretrain_networks
: The visual front-end network
/src
: The training scripts
Download the pre-trained weights for the Visual Frontend and place it in the ./pretrain_networks folder using the following command:
wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1k0Zk90ASft89-xAEUbu5CmZWih_u_lRN' -O visual_frontend.pt
-
The pre-trained weights of the Visual Frontend have been obtained from Afouras T. and Chung J, Deep Audio-Visual Speech Recognition GitHub repository.
-
The model is adapted from Conv-TasNet GitHub repository.