Skip to content

Commit

Permalink
Add skip for short audio chunks in OpenAIWhisperParser
Browse files Browse the repository at this point in the history
  • Loading branch information
Leonardo Diegues committed Feb 17, 2024
1 parent d7c26c8 commit 26cb778
Showing 1 changed file with 4 additions and 0 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -52,11 +52,15 @@ def lazy_parse(self, blob: Blob) -> Iterator[Document]:
# Need to meet 25MB size limit for Whisper API
chunk_duration = 20
chunk_duration_ms = chunk_duration * 60 * 1000
chunk_duration_threshold = 0.1

# Split the audio into chunk_duration_ms chunks
for split_number, i in enumerate(range(0, len(audio), chunk_duration_ms)):
# Audio chunk
chunk = audio[i : i + chunk_duration_ms]
# Skip chunks that are too short to transcribe
if chunk.duration_seconds <= chunk_duration_threshold:
continue
file_obj = io.BytesIO(chunk.export(format="mp3").read())
if blob.source is not None:
file_obj.name = blob.source + f"_part_{split_number}.mp3"
Expand Down

0 comments on commit 26cb778

Please sign in to comment.