GitHub - kb-labb/rixvox: Pipeline for creating an automatic speech recognition dataset from the Riksdag's recordings and transcripts

Rixvox v2: an automatic speech recognition dataset for Swedish

Source code to create RixVox v2 based on the Swedish Parliament's (Riksdagen) media recordings and text protocols. The pipeline consists of modules to:

locate speeches in audio based on text protocols.
enhance accuracy of start/end times for the speeches with diarization.
force align text protocol of speech with audio of speech to get sentence/word timestamps.
create audio chunks ready for ASR training (up to 30 seconds chunks).
assess the quality of chunks/alignments via machine transcription of chunks and BLEU/WER scores.

Source data

The dataset is derived from two primary sources.

A hard drive of older recordings from 1966-2002 that KBLab received from Riksdagen. These recordings were previously digitized in collaboration with The National Library of Sweden. There are 6825 audio files, each about 3 to 5 hours in length. For this material we have no other metadata aside from the possible date(s) they were recorded.
Riksdagen's Web TV with recordings from 2000-2024. The parliament has its own Web TV that uploads recordings of parliamentary sessions. The media files are accessible via an API at the endpoint: https://data.riksdagen.se/dokumentstatus/{dok_id}.json?utformat=json&utdata=debatt,media. Where dok_id is the document id of the debate. For reference, here is the debate with id HA01KU20.

Instructions

Riksdagen old recordings 1966-2002

Download the text protocols and metadata about speakers/persons from SWERIK's repo. This project used v.1.0.0 of the Swedish Parliament Corpus.

scripts/riksdagen_old/riksdagen_corpus.py

Riksdagen web 2000-2024

Download the text protocols of speeches from Riksdagens open data: bash scripts/utils/download_modern_speeches.sh.
Preprocess the text protocols: `python scripts/riksdagen_web/preprocess_speech_metadata.py
Download metadata about media recordings based on the document ids of text protocols: python scripts/riksdagen_web/download_audio_metadata.py, and join together this information with text protocol metadata.
Download the media files: python scripts/riksdagen_web/download_audio.py.
Perform fuzzy string matching between wav2vec2 machine transcription and text protocols to determine approximate start/end timestamp of each speech: `python scripts/riksdagen_web/fuzzy_matcher.py.
Perform diarization on audio files to obtain more accurate start/end timestamp of each speech via speaker segments: `python scripts/diarization_pyannote.py.
scripts/diarization_preprocess.py
scripts/riksdagen_web/diarization_matcher.py
scripts/riksdagen_web/dataset_to_json.py
scripts/alignment_probs_writer.py
scripts/align_transcript_pytorch.py
scripts/create_chunks.py
scripts/lang_detect_whisper.py
scripts/transcribe_wav2vec2.py
scripts/transcribe_whisper.py
scripts/json_to_parquet.py

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.vscode		.vscode
scripts		scripts
src/rixvox		src/rixvox
.gitignore		.gitignore
README.md		README.md
debug_align.py		debug_align.py
normalize_map_text.py		normalize_map_text.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rixvox v2: an automatic speech recognition dataset for Swedish

Source data

Instructions

Riksdagen old recordings 1966-2002

Riksdagen web 2000-2024

About

Releases

Packages

Languages

kb-labb/rixvox

Folders and files

Latest commit

History

Repository files navigation

Rixvox v2: an automatic speech recognition dataset for Swedish

Source data

Instructions

Riksdagen old recordings 1966-2002

Riksdagen web 2000-2024

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages