You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Because whisper expects the mel input features to be of length 3000, the whisper model throws exceptions. This was a quick workaround I implemented, but there's probably a better way of doing things...
The text was updated successfully, but these errors were encountered:
Thank you for raising the issue.
Did you try transformers==4.30.2 (as in requirements.txt)?
For the quick start with 10s input audio, we noticed the issue when using a more recent transformer version, but it should work on the older transformer version.
However, for anyone wishing to train the model with variable input length larger than 30s, this walkaround with padding to 30s can work. But I believe the positional embedding replacement codes must be commented out.
I will keep this issue open for people to reference. Thank you for pointing this out.
Ah, I was using python3.12 so I had to use a more recent transformers version. Thanks for the help.
Edit:
In order to support smaller audio files, the code in here is not enough. Variable tmp_length (line 212 of the file) will also need to be set to 1500 (instead of self.get_feat_extract_output_lengths(len(x[0]))) in order to avoid size mismatch between tensors in the encoder and decoder.
Because whisper expects the mel input features to be of length 3000, the whisper model throws exceptions. This was a quick workaround I implemented, but there's probably a better way of doing things...
The text was updated successfully, but these errors were encountered: