- http://www.cs.toronto.edu/~complingweb/data/TORGO/torgo.html
- Speakers have speech impairments due to Cerebral Palsy or Amyotrophic Lateral Sclerosis.
- Build a kaldi-based GMM-HMM acoustic model for speech recognition.
- Improve the recognition accuracy for impaired speech (data augmentation, hyperparameter tuning, etc.)
- Train a DNN-HMM acoustic model using the alignments from the GMM-HMM model.
- Perform speaker identification/recognition via i-vectors and improve baseline results.
- Part 1: Installation & Data Preparation
- Part 2: Speech Recognition (acoustic and Language model training)
- Part 3: DNN-HMM acoustic model
- Part 4: Speaker Recognition (using i-vectors)
Part 1.1 Installation
- Kaldi
- The SRI Language Modeling Toolkit
- Sequitur Grapheme-to-Phoneme converter
- Intel MKL (Math Kernel Library)
Part 1.2 Data Preparation
- Audio data download
- Files that need to be created by us
- Kaldi directory structure
Part 2 Speech Recognition
- N-gram language model building
- MFCC extraction + CMVN (cepstral mean and variance normalization)
- GMM-HMM training
- Monophone training
- Triphone training
- Delta + delta-delta training computes dynamic coefficients to supplement the MFCC features.
- Linear Discriminant Analysis – Maximum Likelihood Linear Transform (LDA-MLLT to reduce feature space)
- Speaker Adaptive Training (SAT performs speaker and noise normalization)
- Alignment with Feature Space Maximum Likelihood Linear Regression (fmllr features are speaker-normalized features)
Part 3 Speech Recognition
- DNN-based acoustic model
- Use GMM-HMM generated alignments to train a deep neural network acoustic model
- Restricted Boltzmann Machine (RBM) pre-training
- Frame cross-entropy training
- Sequence-training optimizing state-level minimum Bayes risk (sMBR)
- DNN-based acoustic model
Part 4 Speaker Recognition (or identification)
- MFCC feature extraction
- Voice Activity detection (compute energy based VAD output)
- Train Gaussian Mixture Model - Universal Background Model (GMM-UBM)
- Train ivector extractor
- Extract ivector from audio files
- Train a Probabilistic Linear Discriminant Analysis (PLDA) model
- Compute PLDA score (Equal Error Rate)