Model throws ValueError because of audio files #1

kRichard32 · 2024-09-19T23:47:13Z

Because whisper expects the mel input features to be of length 3000, the whisper model throws exceptions. This was a quick workaround I implemented, but there's probably a better way of doing things...

AnfengXu136 · 2024-09-20T00:14:11Z

Thank you for raising the issue.
Did you try transformers==4.30.2 (as in requirements.txt)?
For the quick start with 10s input audio, we noticed the issue when using a more recent transformer version, but it should work on the older transformer version.

However, for anyone wishing to train the model with variable input length larger than 30s, this walkaround with padding to 30s can work. But I believe the positional embedding replacement codes must be commented out.
I will keep this issue open for people to reference. Thank you for pointing this out.

kRichard32 · 2024-09-20T01:15:06Z

Ah, I was using python3.12 so I had to use a more recent transformers version. Thanks for the help.

Edit:
In order to support smaller audio files, the code in here is not enough. Variable tmp_length (line 212 of the file) will also need to be set to 1500 (instead of self.get_feat_extract_output_lengths(len(x[0]))) in order to avoid size mismatch between tensors in the encoder and decoder.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model throws ValueError because of audio files #1

Model throws ValueError because of audio files #1

kRichard32 commented Sep 19, 2024

AnfengXu136 commented Sep 20, 2024 •

edited

Loading

kRichard32 commented Sep 20, 2024 •

edited

Loading

Model throws ValueError because of audio files #1

Model throws ValueError because of audio files #1

Comments

kRichard32 commented Sep 19, 2024

AnfengXu136 commented Sep 20, 2024 • edited Loading

kRichard32 commented Sep 20, 2024 • edited Loading

AnfengXu136 commented Sep 20, 2024 •

edited

Loading

kRichard32 commented Sep 20, 2024 •

edited

Loading