Why perform speaker diarization at the end #1055
bofenghuang
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello @m-bain ,
Thank you for this excellent project!
From what I understand, the current pipeline merges the results of speaker diarization and STT at the end based on timestamps. I'm wondering why we don't just replace VAD with speaker diarization and pass the segments by speaker directly to Whisper (still need to ensure segments are <30s). Is it because we want to keep speaker diarization optional, or have benchmarks shown this approach performs better?
Beta Was this translation helpful? Give feedback.
All reactions