Speech Command Recognition: Text-to-Speech and Speech Corpus Scraping Are All You Need
Presentation on the 3rd International Conference on Robotics, Automation, and Artificial Intelligence (RAAI 2023)
Speech Command Recognition: Text-to-Speech and Speech Corpus Scraping Are All You Need
In this project, we used Piper to generate synthetic speech commands. Piper is a fast, local neural text to speech system. It provides five voices for the Kazakh language. The list of available models for other languages can be found here and the corresponding demos are given here. To generate synthetic speech commands for Kazakh, download and unzip the model from Google Drive. Then, open the synthetic_data_generation.ipynb
notebook, update the path to the model, and run all cells.
To automatically extract speech commands from a large-scale speech corpus, we used Vosk Speech Recognition Toolkit. The example code is given in speech_corpus_scraping.ipynb
notebook.
To increase the dataset size further, you can apply audio augmentation methods to the synthetic dataset and also to the speech corpus scraped dataset. The details can be found in the data_augmentation.ipynb
notebook.
The details of training, validation, and testing of the model can be found in the Keyword-MLP directory.
Video tutorials for each step of the project on our YouTube channel
@article{Kuzdeuov2023,
author = "Askat Kuzdeuov and Shakhizat Nurgaliyev and Diana Turmakhan and Nurkhan Laiyk and Huseyin Atakan Varol",
title = "{Speech Command Recognition: Text-to-Speech and Speech Corpus Scraping Are All You Need}",
year = "2023",
month = "5",
url = "https://www.techrxiv.org/articles/preprint/Speech_Command_Recognition_Text-to-Speech_and_Speech_Corpus_Scraping_Are_All_You_Need/22717657",
doi = "10.36227/techrxiv.22717657.v1"
}