-
Notifications
You must be signed in to change notification settings - Fork 83
training_hifigan
[TOC]
We recommend using Anaconda to set up your own python virtual environment.
# in case of pip install error, change the pip source may help
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
# build virtual environment
conda env create -f environment.yaml
# activate virtual environment
conda activate maas
Make sure your data be organized by the structure below.
.
├── interval
│ ├── 500001.interval
│ ├── 500002.interval
│ ├── 500003.interval
│ ├── ...
│ └── 600010.interval
├── prosody
│ └── prosody.txt
└── wav
├── 500001.wav
├── 500002.wav
├── ...
└── 600010.wav
.
├── txt
│ └── prosody.txt
└── wav
├── 1.wav
├── 2.wav
├── ...
└── 9000.wav
For quick start: A demo dataset is available on DAMO.NLS.KAN-TTS.OpenDataset.
Modify audio config to fit your data, the demo config file: kantts/configs/audio_config_24k.yaml
Run data_process
, since we are preparing features for vocoder training, --skip_script
can be added to the command below which could save you much time :)
CAUTION:
If the OUTPUT_DATA_FEATURE_PATH
has already existed(Sambert data processing do the same thing), you can skip the following step.
python kantts/preprocess/data_process.py --voice_input_dir YOUR_DATA_PATH --voice_output_dir OUTPUT_DATA_FEATURE_PATH --audio_config AUDIO_CONFIG_PATH --speaker YOUR_SPEKER_NAME --skip_script
Then you get the features for HifiGAN training.
.
├── badlist.txt
├── data_process_stdout.log
├── mel/
├── trim_mel/
├── trim_wav/
└── wav/
Our training recipe is config driven, a default HifiGAN model config can be found kantts/configs/hifigan_v1_24k.yaml
, you can do some modifications on that config and create your own HifiGAN model :)
Now you have got the sword and shield(data and model :-|), go have a try.
# The --root_dir can be multiple args for universal vococer training
CUDA_VISIBLE_DEVICES=0 python kantts/bin/train_hifigan.py --model_config YOUR_MODEL_CONFIG --root_dir OUTPUT_DATA_FEATURE_PATH --stage_dir TRAINING_STAGE_PATH
If your GPU devices are enough, you can use distributed training, which is a lot faster than single GPU training. For example, assign GPU device indexes with CUDA_VISIBLE_DEVICES
system variable, --nproc_per_node
denotes the count of GPU devices.
CUDA_VISIBLE_DEVICES=0,1,2,4 python -m pytorch.distributed.launch --nproc_per_node=4 kantts/bin/train_hifigan.py --model_config YOUR_MODEL_CONFIG --root_dir OUTPUT_DATA_FEATURE_PATH --stage_dir TRAINING_STAGE_PATH
--resume_path
can be used to resume training with a pre-trained model, or continue training from a previous checkpoint.
# The --root_dir can be multiple args for universal vococer training
CUDA_VISIBLE_DEVICES=0 python kantts/bin/train_hifigan.py --model_config YOUR_MODEL_CONFIG --root_dir OUTPUT_DATA_FEATURE_PATH --stage_dir TRAINING_STAGE_PATH --resume_path CHECKPOINT_PATH
After training is done, your TRAINING_STAGE_PATH
looks like below
.
├── ckpt/
│ ├── checkpoint_120000.pth
│ ├── checkpoint_130000.pth
│ ├── ...
│ └── checkpoint_200000.pth <---- this is the latest checkpoint
├── config.yaml
├── log/
└── stdout.log
Time to test your powerful model, prepare a validation mel file, then run the command below.
--input_mel
could be a mel in valid dataset, or a predicted mel file from the Sambert
.
python kantts/bin/infer_hifigan.py --ckpt YOUR_CKPT_FILE --input_mel YOUR_TEST_MEL --output_dir OUTPUT_PATH_TO_STORE_WAV
Enjoy the synthesized audio :)
XXXXXX XXXXXX
XXXXXX XXXXXX
Our implementation refers to the following repositories and papers.
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis
Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech