Double dose of transcription? #1022

siddjain · 2025-01-27T02:26:15Z

siddjain
Jan 27, 2025

Hello,

I am trying to understand how whisperX works. I read through the paper and looked at the code. To me it seems that the audio signal is first transcribed through whisper and then again(!) through wav2vec2 to get the word timestamps. The code that is doing the transcription again using wav2vec2:

emissions, _ = model(waveform_segment.to(device), lengths=lengths)
emissions = torch.log_softmax(emissions, dim=-1)

        emission = emissions[0].cpu().detach()

        blank_id = 0
        for char, code in model_dictionary.items():
            if char == '[pad]' or char == '<pad>':
                blank_id = code

        trellis = get_trellis(emission, tokens, blank_id)
        # path = backtrack(trellis, emission, tokens, blank_id)
        path = backtrack_beam(trellis, emission, tokens, blank_id, beam_width=2)

then it uses the timestamps from wave2vec2. this page has complete example.

this seems a bit weird to me and I was wondering if someone could explain to me. Why are we using whisper when we are ultimately relying on wave2vec2 for the timestamps (and it does transcription as well)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Double dose of transcription? #1022

{{title}}

Replies: 0 comments

Select a reply

Double dose of transcription? #1022

siddjain Jan 27, 2025

Replies: 0 comments

siddjain
Jan 27, 2025