Robust and Accessible: The CU MultiLang Dataset and Continuing Open-Set Speech LID
- arXiv Publication URL
- Please read the paper for a summary of both the new multi-language speech dataset, as well as the goals, architecture, and results of our language identification system.
Modernizing Open-Set Speech Language Identification
- Our original paper that inspired this continued work
- arXiv Publication URL
- Git Repo
Building an accessible, robust, and general solution for open-set speech language identification.
- Capable of identifying known languages with high accuracy, but also recognizing and learning unknown languages on-the-fly without having to retrain the foundation TDNN model.
- Highly portable, with not only full-system inference being possible on incredibly lightweight hardware, but even full model training on reasonable developer hardware (single-gpu, 32GB RAM system can train in a matter of hours).
Also building a diverse, high-coverage, open-source speech dataset spanning over 50 languages.
- Used to make the system robust and generalized.
- Not only coverage of most language families, but targeted diversity in speakers and dialects within each language as well.
The full dataset can be accessed at the below link:
Run demo
$ python3 full_system.py
Train TDNN with 5hrs, 4sec, 15 epochs
$ python3 train_tdnn.py 5 4 15 0.8
Test the tdnn-final-submission TDNN model
$ python3 test_tdnn.py ./saved-models/tdnn-final-submission.pickle
Save the outputs of tdnn-final-submission TDNN model
$ python3 get_tdnn_outputs.py ./saved-models/tdnn-final-submission.pickle
Train LDA and pLDA layers using saved-tdnn-outputs
$ python3 train_lda_plda.py ./saved-tdnn-outputs
MFCC + pitch features were generated using Kaldi
- MFCC and pitch conf files can be found in the
mfcc-confs
subdir- Originally found in
kaldi/egs/tedlium/s5_r3/scripts/conf/mfcc.conf
andkaldi/egs/tedlium/s5_r3/scripts/conf/pitch.conf
- Originally found in
- Also in that subdir is our modified version of the
make_mfcc_pitch.sh
script- Originally found in and runnable from
kaldi/egs/tedlium/s5_r3/scripts/steps/make_mfcc_pitch.sh
- Usage:
make_mfcc_pitch.sh --nj 1 --cmd "$train_cmd" <language directory> <log directory> <mfcc_pitch output directory>
- Originally found in and runnable from
Language Data Sources
- VoxForge
- VoxLingua107
- MediaSpeech
- BibleTTS
- African Accented French
- Free ST American English Corpus
- FHNW Swiss Parliament
- Samromur 21.05
- Russian LibriSpeech
- Iban
- THUYG-20
- Zeroth-Korean
- Kashmiri Data Corpus
- Large Bengali ASR training data set
- Crowdsourced high-quality Chilean Spanish speech data set
- Crowdsourced high-quality Colombian Spanish speech data set
- Crowdsourced high-quality Peruvian Spanish speech data set
- Crowdsourced high-quality Puerto Rico Spanish speech data set
- Crowdsourced high-quality Burmese speech data set
- Crowdsourced high-quality Telugu multi-speaker speech data set
- Crowdsourced high-quality Malayalam multi-speaker speech data set
- Crowdsourced high-quality Tamil multi-speaker speech data set
- Crowdsourced high-quality Catalan speech data set
Basic PyTorch TDNN Reference
Python pLDA Reference