Issue where silence gaps aren't recognized properly sometimes #194
Replies: 5 comments 4 replies
-
also i'm not exactly sure how prepend_puncuations and append_punctuations work since they don't seem to add punctuation in the output file for me. (should ! be included in the calls to those as well?) |
Beta Was this translation helpful? Give feedback.
-
If the gaps are detected (seen in the visualization) but there final timestamps ignores them, it is either because the gap will make the duration of a word lower than The end cutting off too early tends to occur more often if the audio is preprocessed by noise remover/voice isolator. Another cause is the silence detection (non-VAD/VAD) marked that part as silent before the word finished because end of the word is significantly quieter. The latter can be reduced by lower the threshold of the silence detection. A punctuation is treated as its own word if it is not in |
Beta Was this translation helpful? Give feedback.
-
These functions, prepend_punctuations and append_punctuations, are they documented somewhere? |
Beta Was this translation helpful? Give feedback.
-
https://github.com/linto-ai/whisper-timestamped i've found more luck with this repository for my purposes as the gaps seem to be consistently accounted for (any other problem can be handled manually with a bit of work in python) |
Beta Was this translation helpful? Give feedback.
-
Late reply to an old issue, but I got too frustrated with this killing my workflow. Tried Whisper-Timestamped too, but that one seems to love hallucinating and just repeating random blocks of big text in the middle of transcriptions, and has yet to be solved by the author. WorkaroundPreprocessing the audio data with ffmpeg (using the split_silence.py example from ffmpeg-python repo) to split based on extended silences, then feed the resulting chunks to Whisper. For what it's worth, I spent more time than I care to say testing prepends and appends and different vad/ksize/qlevels/etc settings - all of these attempted adjustments failed to make any meaningful difference to the accuracy or in mitigating the errors with silences. Using visualise_suppression() confirmed the silences were being marked/'suppressed' appropriately, but seemingly with no effect on the end result. I would highly recommend introducing an old low-level bit of preprocessing for any audio with gaps larger than ~3 seconds. I can provide the code I ended up using if it's of help to anyone else. |
Beta Was this translation helpful? Give feedback.
-
I have an audio file that when visualized looks like this (already preprocessed with ultimate vocal remover)
The splitting of groups works for the most part, but for some reason the following areas get linked together even though the gaps of silence are the largest in these spots.
i am running this in the command line: stable-ts "filename.wav" --language Japanese -o audio.srt --model large --segment_level true --word_level false --regroup "cm_sp=.* /。/?/?/,* /,_sg=.5_mg=.3+3_sp=.* /。/?/?" --prepend_punctuations ".* /。/?/?"
apart from that i also have a minor issue where the ends of the words/phrases are sometimes cut off too early.
also rather than splitting from the end of a gap and finding the start of the next phrase for the timestamps it tends to just use the end of the previous phrase as the start of the next phrase even if there's some gap. this doesn't always happen, but it happens enough to be a hinderance (combined with the above issue where sometimes ends are cut off too early, the end of the previous phrase may be included in the start of the next followed by a gap finally followed by the start of the current phrase).
my goal is to be able to make an automated process to transcribe an audio file with accurate timestamps for every phrase in the audio file so that i can take the output subtitle file and create a script to cut up each phrase into an individual audio file that is titled with the corresponding transcription. as such i need to find a way to get the timings as accurate as possible for it to be successful.
thanks for any insights!
Beta Was this translation helpful? Give feedback.
All reactions