WER evaluation conditions #1853
Unanswered
yassine-ajaaoun
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello,
I'm currently investigating STT open source solutions for live transcription of company meetings. I'm comparing the 0.9.3 english DeepSpeech model with Vosk-Kaldi. The idea is to compare WERs for simple .wav's (meeting, sentences read...).
Before automating my tests on a larger database I wanted to look for the best conditions for deepspeech to properly transcribe.
What I've seen so far is that the best conditions were (besides having 16khz 16-bit PCM wavs), having enough sound amplitude, not exceeding 15s of audio, not having to much noise...
I've collected such wavs and proceeded to transcribe them with DeepSpeech and Vosk.
For the DeepSpeech part, I've used this code :
With wavTranscriber functions available here : https://github.com/mozilla/DeepSpeech-examples/tree/r0.9/vad_transcriber
I'm not using evaluate.py for evaluation but a personal algorithm that takes the transcription from the code I showed for all the wavs and compare WERs with Vosk-Kaldi ones.
As a conclusion Vosk-Kaldi appeared to have way better transcription in general, and DeepSpeech mean-wer often appears to exceed 60+% which isn't really good.
I would like to know if I was doing things wrong, if you see any thing that dramatically makes the transcription bad, or if that was common wer for an average speech wav that has nothing to do with the training set, meaning personal training is mandatory to obtain acceptable results for real meetings use (with interruptions, noise etc...)
Thank you for your time
Beta Was this translation helpful? Give feedback.
All reactions