Source code to create RixVox v2 based on the Swedish Parliament's (Riksdagen) media recordings and text protocols. The pipeline consists of modules to:
- locate speeches in audio based on text protocols.
- enhance accuracy of start/end times for the speeches with diarization.
- force align text protocol of speech with audio of speech to get sentence/word timestamps.
- create audio chunks ready for ASR training (up to 30 seconds chunks).
- assess the quality of chunks/alignments via machine transcription of chunks and BLEU/WER scores.
The dataset is derived from two primary sources.
- A hard drive of older recordings from 1966-2002 that KBLab received from Riksdagen. These recordings were previously digitized in collaboration with The National Library of Sweden. There are 6825 audio files, each about 3 to 5 hours in length. For this material we have no other metadata aside from the possible date(s) they were recorded.
- Riksdagen's Web TV with recordings from 2000-2024. The parliament has its own Web TV that uploads recordings of parliamentary sessions. The media files are accessible via an API at the endpoint:
https://data.riksdagen.se/dokumentstatus/{dok_id}.json?utformat=json&utdata=debatt,media
. Wheredok_id
is the document id of the debate. For reference, here is the debate with id HA01KU20.
Download the text protocols and metadata about speakers/persons from SWERIK's repo. This project used v.1.0.0 of the Swedish Parliament Corpus.
scripts/riksdagen_old/riksdagen_corpus.py
- Download the text protocols of speeches from Riksdagens open data:
bash scripts/utils/download_modern_speeches.sh
. - Preprocess the text protocols: `python scripts/riksdagen_web/preprocess_speech_metadata.py
- Download metadata about media recordings based on the document ids of text protocols:
python scripts/riksdagen_web/download_audio_metadata.py
, and join together this information with text protocol metadata. - Download the media files:
python scripts/riksdagen_web/download_audio.py
. - Perform fuzzy string matching between wav2vec2 machine transcription and text protocols to determine approximate start/end timestamp of each speech: `python scripts/riksdagen_web/fuzzy_matcher.py.
- Perform diarization on audio files to obtain more accurate start/end timestamp of each speech via speaker segments: `python scripts/diarization_pyannote.py.
scripts/diarization_preprocess.py
scripts/riksdagen_web/diarization_matcher.py
scripts/riksdagen_web/dataset_to_json.py
scripts/alignment_probs_writer.py
scripts/align_transcript_pytorch.py
scripts/create_chunks.py
scripts/lang_detect_whisper.py
scripts/transcribe_wav2vec2.py
scripts/transcribe_whisper.py
scripts/json_to_parquet.py