Issue where silence gaps aren't recognized properly sometimes #194

imnojabroni · 2023-08-29T07:23:43Z

imnojabroni
Aug 29, 2023

I have an audio file that when visualized looks like this (already preprocessed with ultimate vocal remover)

The splitting of groups works for the most part, but for some reason the following areas get linked together even though the gaps of silence are the largest in these spots.

i am running this in the command line: stable-ts "filename.wav" --language Japanese -o audio.srt --model large --segment_level true --word_level false --regroup "cm_sp=.* /。/?/？/,* /，_sg=.5_mg=.3+3_sp=.* /。/?/？" --prepend_punctuations ".* /。/?/？"

apart from that i also have a minor issue where the ends of the words/phrases are sometimes cut off too early.
also rather than splitting from the end of a gap and finding the start of the next phrase for the timestamps it tends to just use the end of the previous phrase as the start of the next phrase even if there's some gap. this doesn't always happen, but it happens enough to be a hinderance (combined with the above issue where sometimes ends are cut off too early, the end of the previous phrase may be included in the start of the next followed by a gap finally followed by the start of the current phrase).

my goal is to be able to make an automated process to transcribe an audio file with accurate timestamps for every phrase in the audio file so that i can take the output subtitle file and create a script to cut up each phrase into an individual audio file that is titled with the corresponding transcription. as such i need to find a way to get the timings as accurate as possible for it to be successful.

thanks for any insights!

imnojabroni · 2023-08-29T07:26:21Z

imnojabroni
Aug 29, 2023
Author

also i'm not exactly sure how prepend_puncuations and append_punctuations work since they don't seem to add punctuation in the output file for me. (should ! be included in the calls to those as well?)

1 reply

imnojabroni Aug 30, 2023
Author

after trying many things i don't understand what the issue is. it doesn't make any sense to me why the timestamps for the starts would just correspond exactly to the ends of the previous ones when there is a gap between the 2

these timestamps correspond to the region:

stable-ts "filename" --language Japanese -o audio.vtt --model large --segment_level true --word_level true --regroup "sg=.18"
no matter how low i drop sg it will never matter because the words themselves are supposedly right next to each other based on the output file so i don't know what to do. :/

jianfch · 2023-08-31T00:54:51Z

jianfch
Aug 31, 2023
Maintainer

If the gaps are detected (seen in the visualization) but there final timestamps ignores them, it is either because the gap will make the duration of a word lower than min_word_dur, or if the gap is between the predicted start and end of aword. The former can fixed by lowering min_word_dur (default is 0.1 seconds). The latter is a limitation of the model.

The end cutting off too early tends to occur more often if the audio is preprocessed by noise remover/voice isolator. Another cause is the silence detection (non-VAD/VAD) marked that part as silent before the word finished because end of the word is significantly quieter. The latter can be reduced by lower the threshold of the silence detection.

A punctuation is treated as its own word if it is not in prepend_puncuations and append_punctuations (the default covers most punctuations). Eg: if a period is in append_punctuations then duration of all periods that follows a word is ignored. Punctuations usually take up the duration of small gaps. Note that prepend_punctuations and append_punctuations do not follow the same syntax as regroup.

0 replies

mirix · 2023-08-31T06:54:19Z

mirix
Aug 31, 2023

These functions, prepend_punctuations and append_punctuations, are they documented somewhere?

0 replies

imnojabroni · 2023-08-31T14:34:53Z

imnojabroni
Aug 31, 2023
Author

https://github.com/linto-ai/whisper-timestamped

i've found more luck with this repository for my purposes as the gaps seem to be consistently accounted for (any other problem can be handled manually with a bit of work in python)

0 replies

CopaceticMeatbag · 2024-09-09T05:33:55Z

CopaceticMeatbag
Sep 9, 2024

Late reply to an old issue, but I got too frustrated with this killing my workflow. Tried Whisper-Timestamped too, but that one seems to love hallucinating and just repeating random blocks of big text in the middle of transcriptions, and has yet to be solved by the author.

Workaround

Preprocessing the audio data with ffmpeg (using the split_silence.py example from ffmpeg-python repo) to split based on extended silences, then feed the resulting chunks to Whisper.
My transcriptions were 1000% better without any of the weirdness that silences bring in.

For what it's worth, I spent more time than I care to say testing prepends and appends and different vad/ksize/qlevels/etc settings - all of these attempted adjustments failed to make any meaningful difference to the accuracy or in mitigating the errors with silences. Using visualise_suppression() confirmed the silences were being marked/'suppressed' appropriately, but seemingly with no effect on the end result.

I would highly recommend introducing an old low-level bit of preprocessing for any audio with gaps larger than ~3 seconds.
The only extra post-processing I had to do was merging the resulting transcriptions back together, padding each transcriptions timestamps with the offset of that chunk's start time. (e.g if audio has 5 seconds audio, 5 seconds silence, then 5 seconds audio; the 2nd transcription's timestamps will all be adjusted += 10 seconds to account for the chunk starttime of 10s into the original audio.

I can provide the code I ended up using if it's of help to anyone else.

3 replies

jianfch Sep 9, 2024
Maintainer

align() has a parameter this does exactly this but only for alignment.

stable-ts/stable_whisper/alignment.py

Lines 102 to 103 in 3bc76b9

    
               nonspeech_skip : float or None, default 5.0 
        
                   Skip non-speech sections that are equal or longer than this duration in seconds. Disable skipping if ``None``.

It was not added to transcribe() because it could dramatically slow down transcription. Since this fixes this issue, I'll add it on the next commit.

CopaceticMeatbag Sep 11, 2024

Thanks! I'm still trying a few different figures to get rid of all the early timestamp issues using the below as a baseline:

model = stable_whisper.load_model('large-v3') result = model.transcribe(input_file,regroup=False,language="en") aligned_result = model.align(input_file,result,nonspeech_skip=2.5,language='en')

Sometimes a word with a moderate 3-4 second silence preceeding it still timestamps too early - in my test audio it picks up a word "Hello" starting at 31,520. The ffmpeg chunked process with just the single transcribe() operation gets it ~exactly right @ 33,380.

The 31,520 result is still a ~1.5s improvement over the old result of 30,012 with neither chunking nor aligning applied. I have a feeling tuning the parameters will get me there, but for the moment the little preprocessing with ffmpeg takes ~1 second to execute and there's no tweaking of params necessary on the default transcribe(); instead I can move directly to my regrouping (which is awesome btw!).

jianfch Sep 11, 2024
Maintainer

Using nonspeech_skip directly in transcribe() should be more effective. Added in 888181f.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue where silence gaps aren't recognized properly sometimes #194

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Issue where silence gaps aren't recognized properly sometimes #194

imnojabroni Aug 29, 2023

Replies: 5 comments · 4 replies

imnojabroni Aug 29, 2023 Author

imnojabroni Aug 30, 2023 Author

jianfch Aug 31, 2023 Maintainer

mirix Aug 31, 2023

imnojabroni Aug 31, 2023 Author

CopaceticMeatbag Sep 9, 2024

Workaround

jianfch Sep 9, 2024 Maintainer

CopaceticMeatbag Sep 11, 2024

jianfch Sep 11, 2024 Maintainer

imnojabroni
Aug 29, 2023

Replies: 5 comments 4 replies

imnojabroni
Aug 29, 2023
Author

imnojabroni Aug 30, 2023
Author

jianfch
Aug 31, 2023
Maintainer

mirix
Aug 31, 2023

imnojabroni
Aug 31, 2023
Author

CopaceticMeatbag
Sep 9, 2024

jianfch Sep 9, 2024
Maintainer

jianfch Sep 11, 2024
Maintainer