SKA-TDNN

This repository is the implementation of the paper "Frequency and Multi-Scale Selective Kernel Attention for Speaker Verification," which is to be published in Proc. IEEE SLT 2022. The code is developed based on the voxceleb_trainer.

Dependencies

If you use the Anaconda virtual environment,

conda create -n ska-tdnn python=3.9 cudatoolkit=11.3
conda activate ska-tdnn

Install all dependency packages,

pip3 install -r requirements.txt

Data Preparation

The VoxCeleb datasets are used for these experiments. The train list should contain the file path and speaker identity for instance,

id00012/21Uxsk56VDQ/00001.wav id00012
id00012/21Uxsk56VDQ/00002.wav id00012
...
id09272/u7VNkYraCw0/00026.wav id09272
id09272/u7VNkYraCw0/00027.wav id09272

The example files of train list for VoxCeleb2 and the test lists for VoxCeleb1-O, VoxCeleb1-E, VoxCeleb1-H can be download from train_vox2.txt and veri_test2.txt, list_test_all2, list_test_hard2, respectively. You can also follow the instructions on the voxceleb_trainer repository for the download and data preparation of training, augmentation, and evaluation.

For the data augmentation of noise addition, you can download the MUSAN noise corpus. After downloading and extracting the files, you can split the audio files into short segments for faster random access as the following command:

python process_musan.py /path/to/dataset/MUSAN

where /path/to/dataset/MUSAN is your path to the MUSAN corpus.

For the data augmentation of convolution with simulated RIRs, you can download the Room Impulse Response and Noise Database.

Training

Distributed Data Parallel (DDP) training example: SKA_TDNN with a vanilla cosine similarity (COS) evaluation every epoch,

CUDA_VISIBLE_DEVICES=0,1,2,3 python trainSpeakerNet.py \
        --max_frames 200 \
        --eval_frames 0 \
        --num_eval 1 \
        --num_spk 100 \
        --num_utt 2 \
        --augment Ture \
        --optimizer adamW \
        --scheduler cosine_annealing_warmup_restarts \
        --lr_t0 25 \
        --lr_tmul 1.0 \
        --lr_max 1e-3 \
        --lr_min 1e-8 \
        --lr_wstep 10 \
        --lr_gamma 0.5 \
        --margin 0.2 \
        --scale 30 \
        --num_class 5994 \
        --save_path ./save/ska_tdnn \
        --train_list ./list/train_vox2.txt \
        --test_list ./list/veri_test2.txt \
        --train_path /path/to/dataset/VoxCeleb2/dev/wav \
        --test_path /path/to/dataset/VoxCeleb1/test/wav \
        --musan_path /path/to/dataset/MUSAN/musan_split \
        --rir_path /path/to/dataset/RIRS_NOISES/simulated_rirs \
        --model SKA_TDNN \
        --port 8000 \
        --distributed

Evaluation

Evaluation example using vanilla cosine similarity (COS) on the VoxCeleb1-O,

CUDA_VISIBLE_DEVICES=0,1,2,3 python trainSpeakerNet.py \
        --eval \
        --eval_frames 0 \
        --num_eval 1 \
        --initial_model ./save/ska_tdnn/model/your_model.model \
        --test_list ./list/veri_test2.txt \
        --test_path /path/to/dataset/VoxCeleb1/test/wav \
        --model SKA_TDNN \
        --port 8001 \
        --distributed

Evaluation example using Test Time Augmentation (TTA) on the VoxCeleb1-E,

CUDA_VISIBLE_DEVICES=0,1,2,3 python trainSpeakerNet.py \
        --eval \
        --tta \
        --eval_frames 400 \
        --num_eval 10 \
        --initial_model ./save/ska_tdnn/model/your_model.model \
        --test_list ./list/list_test_all2 \
        --test_path /path/to/dataset/VoxCeleb1/all/wav \
        --model SKA_TDNN \
        --port 8002 \
        --distributed

Evaluation example using Score Normalisation (SN) on the VoxCeleb1-H,

CUDA_VISIBLE_DEVICES=0,1,2,3 python trainSpeakerNet.py \
        --eval \
        --score_norm \
        --type_coh utt \
        --top_coh_size 20000 \
        --eval_frames 0 \
        --num_eval 1 \
        --initial_model ./save/ska_tdnn/model/your_model.model \
        --train_list ./list/train_vox2.txt \
        --test_list ./list/list_test_hard2 \
        --train_path /path/to/dataset/VoxCeleb2/dev/wav \
        --test_path /path/to/dataset/VoxCeleb1/all/wav \
        --model SKA_TDNN \
        --port 8003 \
        --distributed

Citation

If you utilize this repository, please cite the following paper,

@inproceedings{mun2022frequency,
  title={Frequency and Multi-Scale Selective Kernel Attention for Speaker Verification},
  author={Mun, Sung Hwan and Jung, Jee-weon and Han, Min Hyun and Kim, Nam Soo},
  booktitle={Proc. IEEE SLT},
  year={2022}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
img		img
loss		loss
models		models
optimizer		optimizer
scheduler		scheduler
DatasetLoader.py		DatasetLoader.py
README.md		README.md
SpeakerNet.py		SpeakerNet.py
process_musan.py		process_musan.py
requirements.txt		requirements.txt
trainSpeakerNet.py		trainSpeakerNet.py
tuneThreshold.py		tuneThreshold.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SKA-TDNN

Dependencies

Data Preparation

Training

Evaluation

Citation

About

Releases

Packages

Languages

msh9184/ska-tdnn

Folders and files

Latest commit

History

Repository files navigation

SKA-TDNN

Dependencies

Data Preparation

Training

Evaluation

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages