The MIT License (MIT)
Copyright (c) 2019-2020 CNRS
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
AUTHORS Hervé Bredin - http://herve.niderb.fr
In this tutorial, you will learn how to setup your own audio dataset for use with pyannote.audio
.
For illustration purposes, we will use the freely available AMI corpus:
@article{Carletta2007,
Title = {{Unleashing the killer corpus: experiences in creating the multi-everything AMI Meeting Corpus}},
Author = {Carletta, Jean},
Journal = {Language Resources and Evaluation},
Volume = {41},
Number = {2},
Year = {2007},
}
Start by cloning the pyannote.audio
repository:
$ git clone https://github.com/pyannote/pyannote-audio.git
$ cd pyannote-audio
$ export TUTORIAL_DIR="$PWD/tutorials/data_preparation"
Download the Headset mix
subset of AMI
. For convenience, we provide a script that does it for you:
$ export DOWNLOAD_TO=/path/to/where/you/want/to/download/ami/database
$ mkdir -p ${DOWNLOAD_TO}
$ bash ${TUTORIAL_DIR}/download_ami.sh ${DOWNLOAD_TO}
This script also fixes some of the files from the dataset that are unreadable with scipy
because of wrongly formatted wav chunks. The audio files are therefore not exactly the same as the original ones. You should end up with a collection of wav
files in the ${DOWNLOAD_TO}/amicorpus
directory:
$ cd ${DOWNLOAD_TO}/amicorpus
$ find . | grep wav | sort | head -n 5
./EN2001a/audio/EN2001a.Mix-Headset.wav
./EN2001b/audio/EN2001b.Mix-Headset.wav
./EN2001d/audio/EN2001d.Mix-Headset.wav
./EN2001e/audio/EN2001e.Mix-Headset.wav
./EN2002a/audio/EN2002a.Mix-Headset.wav
Most pyannote.audio
tutorials rely on noise extracted from the MUSAN corpus for data augmentation.
For convenience, we also provide a script that downloads it for you.
$ bash ${TUTORIAL_DIR}/download_musan.sh ${DOWNLOAD_TO}
You should end up with a collection of wav
files in the ${DOWNLOAD_TO}/musan
directory:
$ cd ${DOWNLOAD_TO}/musan
$ find . | grep wav | sort | head -n 5
./music/fma/music-fma-0000.wav
./music/fma/music-fma-0001.wav
./music/fma/music-fma-0002.wav
./music/fma/music-fma-0003.wav
./music/fma/music-fma-0004.wav
pyannote.audio
relies on pyannote.database
that itself relies on a configuration file that indicates where files are located.
For convenience, we also provide such a file that needs to be copied at the root of the directory that contains the datasets.
$ cp ${TUTORIAL_DIR}/database.yml ${DOWNLOAD_TO}
See pyannote.database
documentation for other possible locations for database.yml
.
Your dataset needs to come with annotations to be of any use for training, in the form of RTTM and UEM files.
"who speaks when" annotations should be provided using the RTTM file format. Each line in this file must follow the following convention:
SPEAKER {uri} 1 {start} {duration} <NA> <NA> {identifier} <NA> <NA>
where
{uri}
stands for "unique resource identifier" (think of it as the filename),{start}
is the start time (elapsed time since the beginning of the file, in seconds) of the speech turn,{duration}
is its duration (in seconds),{identifier}
is the unique speaker identifier.
For convenience, we also provide RTTM reference files for AMI dataset. Here what it looks like for the training subset:
$ head -n 10 ${TUTORIAL_DIR}/AMI/MixHeadset.train.rttm
SPEAKER ES2002b.Mix-Headset 1 14.4270 1.3310 <NA> <NA> FEE005 <NA> <NA>
SPEAKER ES2002b.Mix-Headset 1 16.9420 1.0360 <NA> <NA> FEE005 <NA> <NA>
SPEAKER ES2002b.Mix-Headset 1 19.4630 1.0600 <NA> <NA> FEE005 <NA> <NA>
SPEAKER ES2002b.Mix-Headset 1 23.0420 41.1690 <NA> <NA> FEE005 <NA> <NA>
SPEAKER ES2002b.Mix-Headset 1 27.3530 1.2570 <NA> <NA> MEE008 <NA> <NA>
SPEAKER ES2002b.Mix-Headset 1 30.4780 0.4810 <NA> <NA> MEE007 <NA> <NA>
SPEAKER ES2002b.Mix-Headset 1 30.4970 3.6660 <NA> <NA> MEE008 <NA> <NA>
SPEAKER ES2002b.Mix-Headset 1 36.4310 2.5000 <NA> <NA> MEE008 <NA> <NA>
SPEAKER ES2002b.Mix-Headset 1 100.5660 49.9540 <NA> <NA> FEE005 <NA> <NA>
SPEAKER ES2002b.Mix-Headset 1 116.6600 0.4940 <NA> <NA> MEE008 <NA> <NA>
pyannote-audio
which part were actually annotated.
- If you do not provide this file,
pyannote-audio
assumes that the whole file was annotated and therefore everything that is outside of a speech turn is considered non-speech. - If you do provide this file,
pyannote-audio
will only consider as non-speech those regions that are within the limits defined in the UEM file.
Each line in this file must follow the following convention:
{uri} 1 {start} {end}
Here is what it looks like for AMI
training subset
$ head -n 10 ${TUTORIAL_DIR}/AMI/MixHeadset.train.uem
EN2001a.Mix-Headset 1 0.000 5250.240063
EN2001b.Mix-Headset 1 0.000 3451.882687
EN2001d.Mix-Headset 1 0.000 3546.378688
EN2001e.Mix-Headset 1 0.000 4069.738688
EN2003a.Mix-Headset 1 0.000 2240.277375
EN2004a.Mix-Headset 1 0.000 3445.674687
EN2005a.Mix-Headset 1 0.000 5415.234688
EN2006a.Mix-Headset 1 0.000 3525.493375
EN2006b.Mix-Headset 1 0.000 3052.258688
EN2009b.Mix-Headset 1 0.000 2474.282688
UEM
file (even if it covers the whole duration of each file).
pyannote.database
relies on the same configuration file to define train/development/test splits. Once again, we have prepared this for you, following the official split defined here.
All you have to do is to copy the annotation files at the root of the directory that contains the datasets.
# replace TUTORIAL_DIR placeholder with actual location
$ cp -r ${TUTORIAL_DIR}/AMI ${DOWNLOAD_TO}
It should look like this:
# have a quick look at what it looks like
$ cat ${DOWNLOAD_TO}/database.yml
Databases:
AMI: ./amicorpus/*/audio/{uri}.wav
MUSAN: ./musan/{uri}.wav
Protocols:
AMI:
SpeakerDiarization:
MixHeadset:
train:
uri: ./AMI/MixHeadset.train.lst
annotation: ./AMI/MixHeadset.train.rttm
annotated: ./AMI/MixHeadset.train.uem
development:
uri: ./AMI/MixHeadset.development.lst
annotation: ./AMI/MixHeadset.development.rttm
annotated: ./AMI/MixHeadset.development.uem
test:
uri: ./AMI/MixHeadset.test.lst
annotation: ./AMI/MixHeadset.test.rttm
annotated: ./AMI/MixHeadset.test.uem
MUSAN:
Collection:
BackgroundNoise:
uri: ./MUSAN/background_noise.txt
Noise:
uri: ./MUSAN/noise.txt
Music:
uri: ./MUSAN/music.txt
Speech:
uri: ./MUSAN/speech.txt
You might want to have a look at pyannote.database
documentation to learn more about other features provided by pyannote.database
and the syntax of its configuration file.
The final step is to tell pyannote.database
about the location of the configuration file:
# tell pyannote.database about the location of the configuration file
$ export PYANNOTE_DATABASE_CONFIG=${DOWNLOAD_TO}/database.yml
Congratulations: you have just defined a new experimental protocol called AMI.SpeakerDiarization.MixHeadset
and a bunch of MUSAN
collections that will be used in other tutorials.
That's all folks!