Skip to content

A Repository for Single- and Multi-modal Speaker Verification, Speaker Recognition and Speaker Diarization

License

Notifications You must be signed in to change notification settings

modelscope/3D-Speaker

Repository files navigation



license

3D-Speaker is an open-source toolkit for single- and multi-modal speaker verification, speaker recognition, and speaker diarization. All pretrained models are accessible on ModelScope. Furthermore, we present a large-scale speech corpus also called 3D-Speaker-Dataset to facilitate the research of speech representation disentanglement.

Benchmark

The EER results on VoxCeleb, CNCeleb and 3D-Speaker datasets for fully-supervised speaker verification.

Model Params VoxCeleb1-O CNCeleb 3D-Speaker
Res2Net 4.03 M 1.56% 7.96% 8.03%
ResNet34 6.34 M 1.05% 6.92% 7.29%
ECAPA-TDNN 20.8 M 0.86% 8.01% 8.87%
ERes2Net-base 6.61 M 0.84% 6.69% 7.21%
CAM++ 7.2 M 0.65% 6.78% 7.75%
ERes2NetV2 17.8M 0.61% 6.14% 6.52%
ERes2Net-large 22.46 M 0.52% 6.17% 6.34%

The DER results on public and internal multi-speaker datasets for speaker diarization.

Test DER pyannote.audio DiariZen_WavLM
Aishell-4 10.30% 12.2% 11.7%
Alimeeting 19.73% 24.4% 17.6%
AMI_SDM 21.76% 22.4% 15.4%
VoxConverse 11.75% 11.3% 28.39%
Meeting-CN_ZH-1 18.91% 22.37% 32.66%
Meeting-CN_ZH-2 12.78% 17.86% 18%

Quickstart

Install 3D-Speaker

git clone https://github.com/modelscope/3D-Speaker.git && cd 3D-Speaker
conda create -n 3D-Speaker python=3.8
conda activate 3D-Speaker
pip install -r requirements.txt

Running experiments

# Speaker verification: ERes2NetV2 on 3D-Speaker dataset
cd egs/3dspeaker/sv-eres2netv2/
bash run.sh
# Speaker verification: CAM++ on 3D-Speaker dataset
cd egs/3dspeaker/sv-cam++/
bash run.sh
# Speaker verification: ECAPA-TDNN on 3D-Speaker dataset
cd egs/3dspeaker/sv-ecapa/
bash run.sh
# Self-supervised speaker verification: SDPN on VoxCeleb dataset
cd egs/voxceleb/sv-sdpn/
bash run.sh
# Audio and multimodal Speaker diarization:
cd egs/3dspeaker/speaker-diarization/
bash run_audio.sh
bash run_video.sh
# Language identification
cd egs/3dspeaker/language-idenitfication
bash run.sh

Inference using pretrained models from Modelscope

All pretrained models are released on Modelscope.

# Install modelscope
pip install modelscope
# ERes2Net trained on 200k labeled speakers
model_id=iic/speech_eres2net_sv_zh-cn_16k-common
# ERes2NetV2 trained on 200k labeled speakers
model_id=iic/speech_eres2netv2_sv_zh-cn_16k-common
# CAM++ trained on 200k labeled speakers
model_id=iic/speech_campplus_sv_zh-cn_16k-common
# Run CAM++ or ERes2Net inference
python speakerlab/bin/infer_sv.py --model_id $model_id
# Run batch inference
python speakerlab/bin/infer_sv_batch.py --model_id $model_id --wavs $wav_list

# SDPN trained on VoxCeleb
model_id=iic/speech_sdpn_ecapa_tdnn_sv_en_voxceleb_16k
# Run SDPN inference
python speakerlab/bin/infer_sv_ssl.py --model_id $model_id

# Run diarization inference
python speakerlab/bin/infer_diarization.py --wav [wav_list OR wav_path] --out_dir $out_dir
# Enable overlap detection
python speakerlab/bin/infer_diarization.py --wav [wav_list OR wav_path] --out_dir $out_dir --include_overlap --hf_access_token $hf_access_token

Overview of Content

What‘s new 🔥

Contact

If you have any comment or question about 3D-Speaker, please contact us by

  • email: {yfchen97, wanghuii}@mail.ustc.edu.cn, {dengchong.d, zsq174630, shuli.cly}@alibaba-inc.com

License

3D-Speaker is released under the Apache License 2.0.

Acknowledge

3D-Speaker contains third-party components and code modified from some open-source repos, including:
Speechbrain, Wespeaker, D-TDNN, DINO, Vicreg, TalkNet-ASD , Ultra-Light-Fast-Generic-Face-Detector-1MB, pyannote.audio

Citations

If you find this repository useful, please consider giving a star ⭐ and citation 🦖:

@article{chen20243d,
  title={3D-Speaker-Toolkit: An Open Source Toolkit for Multi-modal Speaker Verification and Diarization},
  author={Chen, Yafeng and Zheng, Siqi and Wang, Hui and Cheng, Luyao and others},
  booktitle={ICASSP},
  year={2025}
}