Skip to content
This repository has been archived by the owner on Jun 28, 2021. It is now read-only.
Hygor Araújo edited this page Aug 19, 2013 · 7 revisions

Welcome to the speaker-diarization wiki!

The objective of this wiki is to work as a place to gather information about Speaker Diarization.

What is Speaker Diarization?

Speaker diarization consists of segmenting and clustering a speech recording into speaker homogenous regions, so that given an audio track of a meeting the system will discriminate and label the different speakers automatically ("who spoke when?"). This entails speech/non-speech detection (“when is there speech?”), and overlap detection and resolution (“who is overlapping with whom?”).

The goal

The goal of Speaker Diarization is to segment audio without supervision into speaker-homogeneous regions thus answering the question “who spoke when?”. Getting to know when each speaker is talking in a recording is a useful processing step for many tasks.

Usage

Speaker Diarization has been used for copyright detection, video navigation and retrieval, and specific branches of automatic behavior analysis. In the field of rich transcription, speaker diarization is used both as a stand-alone application that attributes speaker regions to an audio or video file and as a preprocessing task for speech recognition. As a preprocessing step it enables speaker-attributed speech-to-text and allows for different modes of adaptation e. g., vocal tract length normalization (VTLN) and speaker model adaptation.

Clone this wiki locally