- Note: This repository has been updated with our latest paper "MSpeC-Net : Multi-Domain Speech Conversion Network".
- Demo link: Click Here
Improving the intelligibility of Non-Audible-Murmur and Whisper speech via following different types of conversations:
- NAM-to-WHiSPer (NAM2WHSP)
- WHiSPer-to-SPeeCH (WHSP2SPCH)
- NAM-to-SPeeCH (NAM2SPCH)
The patients that are suffering from the vocal fold paralysis, neurological disorders etc. may not be able to produce normal speech due to the partial or complete absence of vocal fold vibrations (i.e., voicing). Losing the natural way of producing the speech will affect one’s life extremely, since speech is the most natural and powerful form of communication among humans. Furthermore, interesting applications of the whispered speech communications are, private conversation in public using cell phone, conversation in quiet environments like a library, a hospital, a meeting room, etc. Therefore, current trend lies on developing silent speech interfaces. And to achieve this we suggest to focus on Non-Audible-Murmur (NAM) and Whisper Speech.
Hence, the aim of the present work is to improve intelligibility of NAM and whispered speech via normal speech conversion using Machine Learning(ML)-based approaches in order to improve the quality of communication.
For more details please refer below mentioned research papers.
Index | Type of Conversion | Paper Title | Paper-Link | Demo-Link |
---|---|---|---|---|
(1) | NAM2WHSP, WHSP2SPCH, NAM2SPCH | MSpeC-Net: Multi-Domain Speech Conversion Network | [Link] | [Link] |
(2) | WHSP2SPCH | Novel Inception-GAN for Whisper-to-Normal Speech Conversion | [Link] | NA |
(3) | WHSP2SPCH | Effectiveness of Cross-Domain Architectures for Whisper-to-Normal Speech Conversion | [Link] | NA |
(4) | WHSP2SPCH | Novel mmse discogan for cross-domain whisper-to-speech conversion | [Link] | NA |
- MMSE-GAN
- DiscGAN
- CycleGAN
- CNN-GAN
- Inception-GAN
- MSpeC-Net
- Linux, MacOS
- Python 3
- CPU or NVIDIA GPU + CUDA CuDNN
- SoX (Command line utility that can convert various formats of computer audio files in to other formats.)
- MatLab
dataset
|
└───features
└───US_102
| └───batches
| | | f0
| | | mcc
| | | VUV
| |
| └───Normal
| | | f0
| | | mcc
| | | fv
| |
| └───Whisper
| | f0
| | mcc
| | fv
└───MSpeC-Net
└───batches
| | f0
| | mcc
| | VUV
|
└───Normal
| | f0
| | mcc
| | fv
|
└───Whisper
| | f0
| | mcc
| | fv
|
└───NAM
| f0
| mcc
| fv
results
|
└───checkpoints
| | mcc
| | f0
| | vuv
|
└───mask
| | mcc
| | f0
| | vuv
|
└───converted_f0
|
└───converted_wav
|
└───mcc
|
└───tmp
torchvision==0.2.1
visdom==0.1.8.8
matplotlib==3.1.2
numpy==1.17.4
scipy==1.3.2
torch==1.3.1
These libraries can be installed via following command:
pip install -r requirements.txt
For whisper-to-normal speech conversion, we rely on wTIMIT dataset. Please download this dataset from here. And follow below steps:
- Download the dataset put the .wav files in the
dataset/data/{speaker_name}/{Normal/Whisper}
. - In this GitHub repository we use Ahocoder to extract MCEP/MCC features. However, Ahocoder accepts input .wav files with 16,000 Hz frequency only. Thereofore, run following two terminal commands to each .wav files.
sox {path + file1.wav} -t -wav {path + file1.wav}
(Only if downloaded files has .WAV extention.)sox {path + file1.wav} -r 16000 -c 1 -b 16 {path + file1.wav}
For MSpeC-Net related experiments we use CSTR-NAM-TIMIT dataset, which is available here. And follow below steps:
- Download the dataset and put the .wav files in the
dataset/data/MSpeC-Net/{Normal/Whisper/NAM}/wav
- Follow the same 2nd step described above.
Either run the follwing single shell script or follow the steps described below:
./run.sh
OR Follow these steps
Now, run follwing commads in the terminal:
cd scripts
./feature_extraction.sh
Run below mentioned MatLab scripts in order.
- concatenation_VCC2018.m
- batch.m
- concatenation_f0.m
- batch_f0.m
- concatenation_vuv.m
- batch_vuv.m
Yay! Finally it's time to do the training. Now, go to py_src
directory and run below mentioned commads in terminal.
python -m visdom.server
python MMSE_GAN.py -tr -te
python MMSE_GAN_F0.py -tr -te
python DNN_vuv.py -tr -te
Note: These python scripts are highly customizable according to the user choises. Such as traininng/testing, parallel/non-parallel training, paths to the features/checkpoint, learning rate, number of epoch, validation interval, check point interval etc.
As decribed in the research papers, we use MCD and F0_RMSE as objective evalution metrc. Hence, run follwing two MatLab scripts:
- Testing_MCD.m
- f0obective_cal.m
Finally, we are at the last stage. We will use converted mcc and predicted F0 feature as an input to Ahocoder and it will generate the speech from these features. Hence, run follwing command in terminal (scripts directory):
./synthesis.sh
Currently, this repository provides the necessary codes for MSpeC-Net training and testing. One can train MSpeC-Net model by running following python script:
python MSpeC-Net.py -tr -te
However, for more details on how to use different matlab scripts and feature extraction shell scripts please refer the steps shown in WHSP2SPCH conversions. Righ now, we are working on it to make the final end-to-end system. Soon we will publish it.
If you are interested in contributing, please get in touch with author of the repository.
If you use this code for your research, please cite our papers.
@inproceedings{parmar2019effectiveness,
title={Effectiveness of Cross-Domain Architectures for Whisper-to-Normal Speech Conversion},
author={Parmar, Mihir and Doshi, Savan and Shah, Nirmesh J and Patel, Maitreya and Patil, Hemant A},
booktitle={2019 27th European Signal Processing Conference (EUSIPCO)},
pages={1--5},
year={2019},
organization={IEEE}
}
@inproceedings{Patel2019,
author={Maitreya Patel and Mihir Parmar and Savan Doshi and Nirmesh Shah and Hemant Patil},
title={{Novel Inception-GAN for Whispered-to-Normal Speech Conversion}},
year=2019,
booktitle={Proc. 10th ISCA Speech Synthesis Workshop},
pages={87--92},
doi={10.21437/SSW.2019-16},
url={http://dx.doi.org/10.21437/SSW.2019-16}
}