Implement long audio inference for k2 #42

JER-ry · 2024-12-06T17:07:59Z

ReazonSpeech/pkg/k2-asr/src/transcribe.py

Lines 28 to 34 in b6916fd

    
           if duration > TOO_LONG_SECONDS: 
        
               warnings.warn( 
        
                 f"Passing a long audio input ({duration:.1f}s) is not recommended, " 
        
                 "because K2 will require a large amount of memory. " 
        
                 "Read the upstream discussion for more details: " 
        
                 "https://github.com/k2-fsa/icefall/issues/1680" 
        
               )

Since this is implemented for espnet, it's probably worth also copying that here.

ReazonSpeech/pkg/espnet-asr/src/transcribe.py

Lines 59 to 77 in b6916fd

    
           while pos < len(audio.waveform): 
        
               samples = audio.waveform[pos:] 
        
               # If the audio data is very long, find out the longest 
        
               # non-speech region and perform decoding up to that point. 
        
               if len(samples) > window: 
        
                   blank = find_blank(model, samples[:window]) 
        
                   mid = int((blank.start + blank.end) / 2) 
        
                   samples = samples[:mid] 
        
               asr = model(np.pad(samples, PADDING, mode="constant"))[0][0] 
        
               fulltext += asr 
        
               for start, end, text in split_text(model, samples, asr): 
        
                   segments.append(Segment( 
        
                       start_seconds=((pos + start) / audio.samplerate), 
        
                       end_seconds=((pos + end) / audio.samplerate), 
        
                       text=text, 
        
                   ))

fujimotos · 2024-12-08T06:00:46Z

Unfortunately, it's not that easy.

This particular function find_blank() deeply depends on ESPnet's API interface.
For this reason, we cannot just copy it to reazonspeech.k2.asr.

Since this is implemented for espnet, it's probably worth also copying that here

For now, we recommend to use an external VAD model (such as Silero VAD)
to process a long audio sequence with our K2 model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement long audio inference for k2 #42

Implement long audio inference for k2 #42

JER-ry commented Dec 6, 2024

fujimotos commented Dec 8, 2024

Implement long audio inference for k2 #42

Implement long audio inference for k2 #42

Comments

JER-ry commented Dec 6, 2024

fujimotos commented Dec 8, 2024