Unofficial PyTorch implementation of Google AI's: VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking.
- Training took about 20 hours on AWS p3.2xlarge(NVIDIA V100).
- Listen to audio sample at webpage: http://swpark.me/voicefilter/
Median SDR | Paper | Ours |
---|---|---|
before VoiceFilter | 2.5 | 1.9 |
after VoiceFilter | 12.6 | 10.2 |
- SDR converged at 10, which is slightly lower than paper's.
-
Python and packages
This code was tested on Python 3.6 with PyTorch 1.0.1. Other packages can be installed by:
pip install -r requirements.txt
-
Miscellaneous
ffmpeg-normalize is used for resampling and normalizing wav files. See README.md of ffmpeg-normalize for installation.
-
Download LibriSpeech dataset
To replicate VoiceFilter paper, get LibriSpeech dataset at http://www.openslr.org/12/.
train-clear-100.tar.gz
(6.3G) contains speech of 252 speakers, andtrain-clear-360.tar.gz
(23G) contains 922 speakers. You may use either, but the more speakers you have in dataset, the more better VoiceFilter will be. -
Resample & Normalize wav files
First, unzip
tar.gz
file to desired folder:tar -xvzf train-clear-360.tar.gz
Next, copy
utils/normalize-resample.sh
to root directory of unzipped data folder. Then:vim normalize-resample.sh # set "N" as your CPU core number. chmod a+x normalize-resample.sh ./normalize-resample.sh # this may take long
-
Edit
config.yaml
cd config cp default.yaml config.yaml vim config.yaml
-
Preprocess wav files
In order to boost training speed, perform STFT for each files before training by:
python generator.py -c [config yaml] -d [data directory] -o [output directory] -p [processes to run]
This will create 100,000(train) + 1000(test) data. (About 160G)
-
Get pretrained model for speaker recognition system
VoiceFilter utilizes speaker recognition system (d-vector embeddings). Here, we provide pretrained model for obtaining d-vector embeddings.
This model was trained with VoxCeleb2 dataset, where utterances are randomly fit to time length [70, 90] frames. Tests are done with window 80 / hop 40 and have shown equal error rate about 1%. Data used for test were selected from first 8 speakers of VoxCeleb1 test dataset, where 10 utterances per each speakers are randomly selected.
Update: Evaluation on VoxCeleb1 selected pair showed 7.4% EER.
The model can be downloaded at this GDrive link.
-
Run
After specifying
train_dir
,test_dir
atconfig.yaml
, run:python trainer.py -c [config yaml] -e [path of embedder pt file] -m [name]
This will create
chkpt/name
andlogs/name
at base directory(-b
option,.
in default) -
View tensorboardX
tensorboard --logdir ./logs
-
Resuming from checkpoint
python trainer.py -c [config yaml] --checkpoint_path [chkpt/name/chkpt_{step}.pt] -e [path of embedder pt file] -m name
python inference.py -c [config yaml] -e [path of embedder pt file] --checkpoint_path [path of chkpt pt file] -m [path of mixed wav file] -r [path of reference wav file] -o [output directory]
These are some of my personal opinions for improvement. If you have other ideas, don't hesitate to open issue.
- Masks performed poorly on high-frequency channels.
- Training embedder system with linear-scale spectrogram instead of mel might improve this.
- Replace zero-padding with partial convolution.
- Try power-law compressed reconstruction error as loss function, instead of MSE.
- Tried
power=0.3
, but failed.
- Tried
Seungwon Park at MINDsLab ([email protected], [email protected])
Apache License 2.0
This repository contains codes adapted/copied from the followings:
- utils/adabound.py from https://github.com/Luolc/AdaBound (Apache License 2.0)
- utils/audio.py from https://github.com/keithito/tacotron (MIT License)
- utils/hparams.py from https://github.com/HarryVolek/PyTorch_Speaker_Verification (No License specified)
- utils/normalize-resample.sh from https://unix.stackexchange.com/a/216475