This repository provides a dataset and a text-to-speech (TTS) model for the paper
KazEmoTTS:
A Dataset for Kazakh Emotional Text-to-Speech Synthesis
Emotion | # recordings | Narrator F1 | Narrator M1 | Narrator M2 | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Total (h) | Mean (s) | Min (s) | Max (s) | Total (h) | Mean (s) | Min (s) | Max (s) | Total (h) | Mean (s) | Min (s) | Max (s) | ||
neutral | 9,385 | 5.85 | 5.03 | 1.03 | 15.51 | 4.54 | 4.77 | 0.84 | 16.18 | 2.30 | 4.69 | 1.02 | 15.81 |
angry | 9,059 | 5.44 | 4.78 | 1.11 | 14.09 | 4.27 | 4.75 | 0.93 | 17.03 | 2.31 | 4.81 | 1.02 | 15.67 |
happy | 9,059 | 5.77 | 5.09 | 1.07 | 15.33 | 4.43 | 4.85 | 0.98 | 15.56 | 2.23 | 4.74 | 1.09 | 15.25 |
sad | 8,980 | 5.60 | 5.04 | 1.11 | 15.21 | 4.62 | 5.13 | 0.72 | 18.00 | 2.65 | 5.52 | 1.16 | 18.16 |
scared | 9,098 | 5.66 | 4.96 | 1.00 | 15.67 | 4.13 | 4.51 | 0.65 | 16.11 | 2.34 | 4.96 | 1.07 | 14.49 |
surprised | 9,179 | 5.91 | 5.09 | 1.09 | 14.56 | 4.52 | 4.92 | 0.81 | 17.67 | 2.28 | 4.87 | 1.04 | 15.81 |
Narrator | # recordings | Duration (h) |
---|---|---|
F1 | 24,656 | 34.23 |
M1 | 19,802 | 26.51 |
M2 | 10,302 | 14.11 |
Total | 54,760 | 74.85 |
First, you need to build the monotonic_align code:
cd model/monotonic_align; python setup.py build_ext --inplace; cd ../..
Note: Python version is 3.9.13
You need to download the KazEmoTTS dataset and customize it, as in filelists/all_spk, by executing data_preparation.py:
python data_preparation.py -d <path_to_KazEmoTTS_dataset>
To initiate the training process, you must specify the path to the model configurations, which can be found in configs/train_grad.json, and designate a directory for checkpoints, typically located at logs/train_logs, to specify the GPU you will be using.
CUDA_VISIBLE_DEVICES=YOUR_GPU_ID
python train_EMA.py -c <configs/train_grad.json> -m <checkpoint>
If you intend to utilize a pre-trained model, you will need to download the necessary checkpoints TTS, vocoder for both the TTS model based on GradTTS and HiFi-GAN.
To conduct inference, follow these steps:
- Create a text file containing the sentences you wish to synthesize, such as
filelists/inference_generated.txt
. - Specify the
txt
file format as follows:text|emotion id|speaker id
. - Adjust the path to the HiFi-Gan checkpoint in
inference_EMA.py
. - Set the classifier guidance level to 100 using the
-g
parameter.
python inference_EMA.py -c <config> -m <checkpoint> -t <number-of-timesteps> -g <guidance-level> -f <path-for-text> -r <path-to-save-audios>
You can listen to some synthesized samples here.
We kindly urge you, if you incorporate our dataset and/or model into your work, to cite our paper as a gesture of recognition for its valuable contribution. The act of referencing the relevant sources not only upholds academic honesty but also ensures proper acknowledgement of the authors' efforts. Your citation in your research significantly contributes to the continuous progress and evolution of the scholarly realm. Your endorsement and acknowledgement of our endeavours are genuinely appreciated.
@misc{abilbekov2024kazemotts,
title={KazEmoTTS: A Dataset for Kazakh Emotional Text-to-Speech Synthesis},
author={Adal Abilbekov and Saida Mussakhojayeva and Rustem Yeshpanov and Huseyin Atakan Varol},
year={2024},
eprint={2404.01033},
archivePrefix={arXiv},
primaryClass={eess.AS}
}