Open-Set Speech Language Identification and the CU MultiLang Dataset

Mustafa Eyceoz, Justin Lee, Siddarth Pittie

Publication

Robust and Accessible: The CU MultiLang Dataset and Continuing Open-Set Speech LID

arXiv Publication URL
Please read the paper for a summary of both the new multi-language speech dataset, as well as the goals, architecture, and results of our language identification system.

Previous Works

Modernizing Open-Set Speech Language Identification

Our original paper that inspired this continued work
arXiv Publication URL
Git Repo

Project Summary

Building an accessible, robust, and general solution for open-set speech language identification.

Capable of identifying known languages with high accuracy, but also recognizing and learning unknown languages on-the-fly without having to retrain the foundation TDNN model.
Highly portable, with not only full-system inference being possible on incredibly lightweight hardware, but even full model training on reasonable developer hardware (single-gpu, 32GB RAM system can train in a matter of hours).

Also building a diverse, high-coverage, open-source speech dataset spanning over 50 languages.

Used to make the system robust and generalized.
Not only coverage of most language families, but targeted diversity in speakers and dialects within each language as well.

CU MultiLang Dataset

The full dataset can be accessed at the below link:

Link to Dataset

LID System Code Guide

Run demo

$ python3 full_system.py

Train TDNN with 5hrs, 4sec, 15 epochs

$ python3 train_tdnn.py 5 4 15 0.8

Test the tdnn-final-submission TDNN model

$ python3 test_tdnn.py ./saved-models/tdnn-final-submission.pickle

Save the outputs of tdnn-final-submission TDNN model

$ python3 get_tdnn_outputs.py ./saved-models/tdnn-final-submission.pickle

Train LDA and pLDA layers using saved-tdnn-outputs

$ python3 train_lda_plda.py ./saved-tdnn-outputs

Feature Generation

MFCC + pitch features were generated using Kaldi

MFCC and pitch conf files can be found in the mfcc-confs subdir
- Originally found in kaldi/egs/tedlium/s5_r3/scripts/conf/mfcc.conf and kaldi/egs/tedlium/s5_r3/scripts/conf/pitch.conf
Also in that subdir is our modified version of the make_mfcc_pitch.sh script
- Originally found in and runnable from kaldi/egs/tedlium/s5_r3/scripts/steps/make_mfcc_pitch.sh
- Usage: make_mfcc_pitch.sh --nj 1 --cmd "$train_cmd" <language directory> <log directory> <mfcc_pitch output directory>

Open-Source Citations

Language Data Sources

Basic PyTorch TDNN Reference

https://github.com/cvqluu/TDNN

Python pLDA Reference

https://github.com/RaviSoji/plda

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
data-formatting-scripts		data-formatting-scripts
feature-subset		feature-subset
mfcc-confs		mfcc-confs
saved-lda		saved-lda
saved-models		saved-models
saved-plda		saved-plda
saved-tdnn-outputs		saved-tdnn-outputs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
full_system.py		full_system.py
get_tdnn_outputs.py		get_tdnn_outputs.py
requirements.txt		requirements.txt
tdnn.py		tdnn.py
test_tdnn.py		test_tdnn.py
train_lda_plda.py		train_lda_plda.py
train_tdnn.py		train_tdnn.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open-Set Speech Language Identification and the CU MultiLang Dataset

Mustafa Eyceoz, Justin Lee, Siddarth Pittie

Publication

Previous Works

Project Summary

CU MultiLang Dataset

LID System Code Guide

Feature Generation

Open-Source Citations

About

Releases

Packages

Contributors 2

Languages

License

Maxusmusti/open-set-lid

Folders and files

Latest commit

History

Repository files navigation

Open-Set Speech Language Identification and the CU MultiLang Dataset

Mustafa Eyceoz, Justin Lee, Siddarth Pittie

Publication

Previous Works

Project Summary

CU MultiLang Dataset

LID System Code Guide

Feature Generation

Open-Source Citations

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages