Don't understand the warning when inputing an np array #28

Ca-ressemble-a-du-fake · 2022-11-04T06:03:25Z

Ca-ressemble-a-du-fake
Nov 4, 2022

Hi,

First thanks for this piece of code! Everytime I call transcribe on a modified model with an np.ndarray I get this warning that I don't understand :
A resampled input causes an unexplained temporal shift in waveform image that will skew the timestamp suppression and may result in inaccurate timestamps. Use audio_for_mask for transcribe() to provide the original audio track as the path or bytes of the audio file.
I tried to input a byte but then I get TypeError: expected np.ndarray (got bytes) from the original whisper.audio.

I don't understand because _load_audio_waveform accepts bytes.

Could you explain what I should do ?
Thank you

Answered by jianfch

Nov 4, 2022

Most audio aren't 16 kHz, so if you're passing in a numpy array into transcribe, it has most likely been resampled to 16 kHz. But a minor shift happens on when using _load_audio_waveform to generate the waveform image from the resampled audio which may effect the accuracy of the suppression.

import whisper
from stable_whisper import _load_audio_waveform
audio_path = 'audio.mp3'
audio_array = whisper.load_audio(audio_path) # resampled to 16 kHz
original_waveform_img = _load_audio_waveform(audio_path, 100, 10000)
resampled_waveform_img = _load_audio_waveform(audio_array, 100, 10000)

portion of the images:

The bottom (resampled) is slightly shifted to the right. Haven't extensively tested i…

View full answer

jianfch · 2022-11-04T17:07:36Z

jianfch
Nov 4, 2022
Maintainer

Most audio aren't 16 kHz, so if you're passing in a numpy array into transcribe, it has most likely been resampled to 16 kHz. But a minor shift happens on when using _load_audio_waveform to generate the waveform image from the resampled audio which may effect the accuracy of the suppression.

import whisper
from stable_whisper import _load_audio_waveform
audio_path = 'audio.mp3'
audio_array = whisper.load_audio(audio_path) # resampled to 16 kHz
original_waveform_img = _load_audio_waveform(audio_path, 100, 10000)
resampled_waveform_img = _load_audio_waveform(audio_array, 100, 10000)

portion of the images:

The bottom (resampled) is slightly shifted to the right. Haven't extensively tested if it affects the accuracy much.

So to provide transcribe with the original (non-resampled) audio:

model.transcribe(audio_array, audio_for_mask=audio_path)
# or
with open(audio_path, 'rb') as f:
    audio_bytes = f.read()
model.transcribe(audio_array, audio_for_mask=audio_bytes)

Since shift is minor, it may likely not make much a difference. So you can also ignore warnings:

import warnings
warnings.filterwarnings('ignore')

1 reply

Ca-ressemble-a-du-fake Nov 4, 2022
Author

What a clear explanation ! Thank you so much I wasn't aware of this at all. Now I understand what to do 👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't understand the warning when inputing an np array #28

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Don't understand the warning when inputing an np array #28

Ca-ressemble-a-du-fake Nov 4, 2022

Replies: 1 comment · 1 reply

jianfch Nov 4, 2022 Maintainer

Ca-ressemble-a-du-fake Nov 4, 2022 Author

Ca-ressemble-a-du-fake
Nov 4, 2022

Replies: 1 comment 1 reply

jianfch
Nov 4, 2022
Maintainer

Ca-ressemble-a-du-fake Nov 4, 2022
Author