Transcription concluded with inaccurate "Yeah" repeated 27x #1010

nattaylor · 2024-10-02T20:42:25Z

nattaylor
Oct 2, 2024

My transcription task of this audio file with this mlx conversion of Whisper-v3-turbo via mlx-whisper concluded inaccurately with the word "Yeah" repeated 27x.

I wondered where this token repetition came from and @awni suggested I start a discussion to "verify the result by running the same audio through the transformers PyTorch implementation as a test"

Here is the code I used:

import mlx_whisper

result = mlx_whisper.transcribe(
    "/Users/ntaylor/conal_foley.mp3",
    path_or_hf_repo="mlx-community/whisper-large-v3-turbo",
)

Here is the tail end of the transcription for the audio that concludes with "oh yeah"

And this is what I'm saying. The founding fathers, those articles of confederation, they were smarter than we think they were. They were more, they saw these things. They saw the, they saw the limits of power because once you get little power, you want more. And that's what's, one of our big problems in this country is. Not only you denied access, but you denied a lot of things. However, that takes care of Boston. Okay. That's great. That's great. Now this is going on where? This is, oh, yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah."

awni · 2024-10-03T16:37:21Z

awni
Oct 3, 2024
Maintainer

I do see ending effects with OpenAI whisper which uses PyTorch as a back-end. E.g. running:

import whisper
            
model = whisper.load_model("turbo")                                            
                                                                               
# load audio and pad/trim it to fit 30 seconds
data = whisper.load_audio("conal.mp3")                                         
data = data[-1_000_000:]                                                       
        
# make log-Mel spectrogram and move to the same device as the model            
result = model.transcribe(data) 
print(result["text"])

gives the following where the last few sentences are just gibberish:

 freedom of speech, this and that. That's all common law, but that's still pretty much intact. But property and land, we've made a guard out of property that's raised hell with access. Whether it's land access, you know, agricultural land, or simply access to waterfront. We've turned it inside out and it's no longer recognizable. And this is what I'm saying. The founding fathers, those articles of confederation, they were smarter than we think they were. They saw these things. They saw the limits of power because once you get little power, you want more. And that's what's one of our big problems in this country is. Not only you denied access, but you denied a lot of things. However, that takes care of Boston. Okay. That's great. That's great. Now this is going on where? This is... Oh, yeah. That's great. If you're still a with us, go to school. If you want to so, look for the first time that you're loaded with us. What do you want? Do you want to have the Coca-Cola? I don't know which code for your history. I don't know what you have all the issues of but I feel like though. There's a way of cra gevacy. There's a bit of Erin. We want to have And a watercolour welcome. If I can just get a littleRPunt. A big box called cl Kad uwap.

One thing that helps is setting condition_on_previous_text=False. So running this:

result = mlx_whisper.transcribe(
    "conal.mp3",
    path_or_hf_repo="mlx-community/whisper-large-v3-turbo",
    condition_on_previous_text=False,
)

the ending is much better:

 It's beyond recognition. you know we have common law habeas corpus, jurei, may fears freedom of speech, this and that that's all common law but that's still pretty much intact but property and land we've made a god out of property that's raised hell with access whether it's land access, agricultural land or simply access to waterfront we've turned it inside out and it's no longer recognizable and this is what I'm saying the founding fathers as those Articles of Confederation, they were smarter than we think they were. They were more, they saw these things. They saw the limits of power because once you get little power, you want more. And that's what's one of our big problems in this country is. Not only you denied access, but you denied a lot of things. However, that takes care of Boston. Okay, that's great. That's great. Now, this is going on where? this is oh yeah

This is anecdotally in line with what I've heard before which is that the conditioning can be more accurate sometimes but also cause repetitions in the output.

1 reply

nattaylor Oct 8, 2024
Author

Wow! Thanks for digging into this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transcription concluded with inaccurate "Yeah" repeated 27x #1010

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Transcription concluded with inaccurate "Yeah" repeated 27x #1010

nattaylor Oct 2, 2024

Replies: 1 comment · 1 reply

awni Oct 3, 2024 Maintainer

nattaylor Oct 8, 2024 Author

nattaylor
Oct 2, 2024

Replies: 1 comment 1 reply

awni
Oct 3, 2024
Maintainer

nattaylor Oct 8, 2024
Author