In Kaldi all projects are present inside the egs directory from Kaldi root directory. An example is shown in the figure below
Ref: https://www.eleanorchodroff.com/tutorial/kaldi/familiarization.html
The explanation used in this is mainly dervied from the references given below. First let's go through the wsj folder structure and then create an example directory of our own.
In wsj directory we see other directory s5(different versions) where actual files reside.
- utils, steps and local directories contain the necessary files for further processing.
- exp directory contain all the model parameters be it GMM or TDNN model. It will have the acoustic model.
- conf directory contain config files that indicate any parameters that are to be set like sampling frequency of audios, beam and lattice beam widths etc.
- data directory contain all the input data that is needed for training, validation and testing. In ASR as input we need audios, transcripts, words and their phonetic representation, nonsilence phones, silence phones etc. These files are to be created by the user.
Inside data directory we have train, lang and dict directories.
- In train sub-directory four files are needed to be created fundamentally. wav.scp, text, utt2spk and spk2utt. Further details of these files can be found in here and here.
- In dict(mentioned here as local/lang ) sub-directory we need files that are mentioned in detail here and here
As the directories utils and steps are common to may projects we can simply create a symbolic link as shown here
After creating train(and correspondingly validation and test) and dictionary directories. We will create L.fst. For that we need an OOV entry which is used for any word that is not present in the lexicon. That OOV symbol is needed to be present in lexicon as a word. Follow the commands here to create the lang directory where L.fst is created. This will be used later. L.fst is nothing but the pronunciation model for the corpus. After we create a lanugage model in following steps a G.fst file will be created in this location.
After this we proceed to compute features from the audios. Config files are set as shown here. In the config.ini file a variable(mfcc_conf) is present. This variable has path to mfcc config file
At this stage train folder, dictionary folder and pronunciation model have been prepared. In the Hybrid ASR system an HCLG graph is obtained from four components Acoustic Model(H), Context Transducer(C), Pronunciation Model(L) and Language Model(G). All these components are individually obtained and then a decoding graph is constructed. Pronunciation Model(L) is already obtained. Now we will look at Acoustic Model(H) Training.
Starting with monophone training steps are mentioned here. Next we move to triphone training using the alignment obtained from monophone training. There are three levels of triphone training and parameters for each are given in config.ini file After each level we use the decode script and get the decoded output of the trained triphone system
After gmm-hmm training we proceed to tdnn training. A brief overview is given here. Training parameters can be changed under stage 16 where train.py is called. The baseline training has finished. Decode any test set with this trained model.
- https://www.eleanorchodroff.com/tutorial/kaldi/training-acoustic-models.html
- http://jrmeyer.github.io/
- https://desh2608.github.io/blog/
- https://www.superlectures.com/icassp2011/category.php?lang=en&id=131
- https://medium.com/@qianhwan/understanding-kaldi-recipes-with-mini-librispeech-example-part-1-hmm-models-472a7f4a0488
- https://medium.com/@qianhwan/understanding-kaldi-recipes-with-mini-librispeech-example-part-2-dnn-models-d1b851a56c49