Idea is to generate model which could recognize single words from short speech segments. I use GMM HMM for model.
This is medium article which explaines what and how.
Part of code is from https://github.com/jayaram1125/Single-Word-Speech-Recognition-using-GMM-HMM- I've refactored code and added some more features:
- added MFCC delta and delta-delta features to increase accuracy of the model
- script to record test audio to test your model(s)
- trained model on original data from original repository but also took bunch of data from Speech Command Dataset
- just for testing aligned Speech Command Dataset to gain higher accuracy
My trained models accuracy information is in models/accuracies directory. Original models are not included as they are too big. Only example fruit names model is in models [directory](https://github.com/RRisto/single_word_asr_gmm_hmm/tree/master/models. If you want to use them see example predict_google.py. You can record your own voice using record_test_audio.py
Script is tested on windows 10 using python 3.7.
- Download speech data (like Speech Command Dataset). Data should be in folders, each folder should have a name of the label/command/word spoken in particular directory
- Prepare data for training and testing using notebook This should be similar to original suggestions how to make data for training and testing. Note that testing and validation file lists are in [data/]https://github.com/RRisto/single_word_asr_gmm_hmm/tree/master/data folder
- Train model using train_hmm_google_orig.py or other train scripts as a template
- Predict on test data using predict_google_orig.py script
- Test your model using microphone by running script listen_mic_predict.py
Another script uses data from Google Speech Commands Datasets but has only few categories for quicker training (it doesn't have unknown word and noise category)
Original data, good for debugging, not very useful for real-life speech recognition.
- unzip data file
- Train model using train_hmm_fruits.py or other train scripts as a template
- Test your model using microphone by running script listen_mic_predict.py as template
This is just experiment I made. Original alignment was very good but this might improve model performance.
If you wan to align data and use it for training:
- Download Speech Command Dataset
- Run 1.0_prep_data4aligning.ipynb
- Download/install Montreal Forced Aligner
- Download LibriSpeech lexicon (you can create your own also)
- Run aligner using following template (in command line): bin/mfa_train_and_align /path/to/dataset_prepared_in_first_step /path/to/librispeech/lexicon.txt /path/to/aligned/dataset This part takes few hours (in usual Windows laptop)
- Run 1.1_generate_aligned_audio_files_risto.ipynb - this will create chunks from original audio which contain only part where command was said
- Train new model example is in train_hmm_google_aligned.py
There is also Docker image. To use it:
-
build image (run build_docker.bat)
-
run container (run run_docker.bat)
-
if you wan to use jupyter notebook:
- go inside docker container: docker exec -it single_word_gmmhmm_run /bin/bash - start jupyter notebook server jupyter notebook --ip=0.0.0.0 --port=8888 --allow-root - go to your browser and copy: http://127.0.0.1:7006/ - from terminal you should see notebook token, copy-paste it to browser and you should be inside jupyter notebook