Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong transcription on mini_librispeech nnet3 trained model #96

Open
cassiotbatista opened this issue Apr 15, 2020 · 0 comments
Open

Wrong transcription on mini_librispeech nnet3 trained model #96

cassiotbatista opened this issue Apr 15, 2020 · 0 comments

Comments

@cassiotbatista
Copy link

Hi there.

I've been training the standard recipe for mini librispeech dataset using tuning/run_tdnn_1k.sh script (which is linked to local/chain/run_tdnn.sh). I was using a 32-core cluster running Ubuntu 18.04 since I didn't have a GPU available. The only major modification I made on the scripts was to comment out the background decoding processes for intermediary models (like monophones) in run.sh, so the only decoding process left was for tri3b (SAT). The remaining portions of the script were executed "as it is".

The problem is that when I try to decode the example file dr_strangelove.mp3 I get just some kind of small random transcriptions that do not even reflect the size of the audio, as you can see in the screenshot below.

2020-04-15-152937_1096x336_scrot

My model and ivector files are linked to the recipe's exp/ folder as follows:

$ tree models-chaina/ ivector_extractor-chaina/
models-chaina/
├── final.mdl         -> /mnt/extra/git-all/kaldi/egs/mini_librispeech_b/s5/exp/chain_online_cmn/tdnn1k_sp/final.mdl
├── HCLG.fst          -> /mnt/extra/git-all/kaldi/egs/mini_librispeech_b/s5/exp/chain_online_cmn/tree_sp/graph_tgsmall/HCLG.fst
├── phones.txt        -> /mnt/extra/git-all/kaldi/egs/mini_librispeech_b/s5/exp/chain_online_cmn/tdnn1k_sp/phones.txt
├── word_boundary.int -> /mnt/extra/git-all/kaldi/egs/mini_librispeech_b/s5/exp/chain_online_cmn/tree_sp/graph_tgsmall/phones/word_boundary.int
└── words.txt         -> /mnt/extra/git-all/kaldi/egs/mini_librispeech_b/s5/exp/chain_online_cmn/tree_sp/graph_tgsmall/words.txt
ivector_extractor-chaina/
├── final.dubm        -> /mnt/extra/git-all/kaldi/egs/mini_librispeech_b/s5/exp/nnet3_online_cmn/extractor/final.dubm
├── final.ie          -> /mnt/extra/git-all/kaldi/egs/mini_librispeech_b/s5/exp/nnet3_online_cmn/extractor/final.ie
├── final.mat         -> /mnt/extra/git-all/kaldi/egs/mini_librispeech_b/s5/exp/nnet3_online_cmn/extractor/final.mat
└── global_cmvn.stats -> /mnt/extra/git-all/kaldi/egs/mini_librispeech_b/s5/exp/nnet3_online_cmn/extractor/global_cmvn.stats

Configuration files were kept similar to those downloaded by the script prepare-models.sh except for the MFCC config file, which was modified to match the mfcc_hires.conf used during training.

$ cat conf/mfcc.conf
--use-energy=false   # only non-default option.
--sample-frequency=16000 #  Switchboard is sampled at 8kHz # changed for mini librispeech

# config for high-resolution MFCC features, intended for neural network
# training
# Note: we keep all cepstra, so it has the same info as filterbank features,
# but MFCC is more easily compressible (because less correlated) which is why
# we prefer this method.
--num-mel-bins=40     # similar to Google's setup.
--num-ceps=40     # there is no dimensionality reduction.
--low-freq=20     # low cutoff frequency for mel bins... this is high-bandwidth data, so
                  # there might be some information at the low end.
--high-freq=-400 # high cutoff frequently, relative to Nyquist of 8000 (=7600)

Regarding the command line options for kaldinnet2onlinedecoder in transcribe-audio.sh, I just switched nnet-mode to 3 in order to enable nnet3 support, and set use-threaded-decoder=false (#45).

Any ideas on what am I possibly missing here?

Possibly related: #83

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant