You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to understand how whisperX works. I read through the paper and looked at the code. To me it seems that the audio signal is first transcribed through whisper and then again(!) through wav2vec2 to get the word timestamps. The code that is doing the transcription again using wav2vec2:
then it uses the timestamps from wave2vec2. this page has complete example.
this seems a bit weird to me and I was wondering if someone could explain to me. Why are we using whisper when we are ultimately relying on wave2vec2 for the timestamps (and it does transcription as well)?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hello,
I am trying to understand how whisperX works. I read through the paper and looked at the code. To me it seems that the audio signal is first transcribed through whisper and then again(!) through wav2vec2 to get the word timestamps. The code that is doing the transcription again using wav2vec2:
then it uses the timestamps from wave2vec2. this page has complete example.
this seems a bit weird to me and I was wondering if someone could explain to me. Why are we using whisper when we are ultimately relying on wave2vec2 for the timestamps (and it does transcription as well)?
Beta Was this translation helpful? Give feedback.
All reactions