training_hifigan

HifiGAN training tutorial

[TOC]

Environment setup

We recommend using Anaconda to set up your own python virtual environment.

# in case of pip install error, change the pip source may help
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple

# build virtual environment
conda env create -f environment.yaml

# activate virtual environment
conda activate maas

Data processing

Make sure your data be organized by the structure below.

mit-style data

.
├── interval
│   ├── 500001.interval
│   ├── 500002.interval
│   ├── 500003.interval
│   ├── ...
│   └── 600010.interval
├── prosody
│   └── prosody.txt
└── wav
    ├── 500001.wav
    ├── 500002.wav
    ├── ...
    └── 600010.wav

general data

.
├── txt
│   └── prosody.txt
└── wav
    ├── 1.wav
    ├── 2.wav
    ├── ...
    └── 9000.wav

For quick start: A demo dataset is available on DAMO.NLS.KAN-TTS.OpenDataset.

Modify audio config to fit your data, the demo config file: kantts/configs/audio_config_24k.yaml

Run data_process, since we are preparing features for vocoder training, --skip_script can be added to the command below which could save you much time :)

CAUTION: If the OUTPUT_DATA_FEATURE_PATH has already existed(Sambert data processing do the same thing), you can skip the following step.

python kantts/preprocess/data_process.py --voice_input_dir YOUR_DATA_PATH --voice_output_dir OUTPUT_DATA_FEATURE_PATH --audio_config AUDIO_CONFIG_PATH --speaker YOUR_SPEKER_NAME --skip_script

Then you get the features for HifiGAN training.

.
├── badlist.txt
├── data_process_stdout.log
├── mel/
├── trim_mel/
├── trim_wav/
└── wav/

Training

Our training recipe is config driven, a default HifiGAN model config can be found kantts/configs/hifigan_v1_24k.yaml, you can do some modifications on that config and create your own HifiGAN model :)

Now you have got the sword and shield(data and model :-|), go have a try.

# The --root_dir can be multiple args for universal vococer training
CUDA_VISIBLE_DEVICES=0 python kantts/bin/train_hifigan.py --model_config YOUR_MODEL_CONFIG --root_dir OUTPUT_DATA_FEATURE_PATH --stage_dir TRAINING_STAGE_PATH

Distributed training

If your GPU devices are enough, you can use distributed training, which is a lot faster than single GPU training. For example, assign GPU device indexes with CUDA_VISIBLE_DEVICES system variable, --nproc_per_node denotes the count of GPU devices.

CUDA_VISIBLE_DEVICES=0,1,2,4 python -m pytorch.distributed.launch --nproc_per_node=4 kantts/bin/train_hifigan.py --model_config YOUR_MODEL_CONFIG --root_dir OUTPUT_DATA_FEATURE_PATH --stage_dir TRAINING_STAGE_PATH

Resume training

--resume_path can be used to resume training with a pre-trained model, or continue training from a previous checkpoint.

# The --root_dir can be multiple args for universal vococer training
CUDA_VISIBLE_DEVICES=0 python kantts/bin/train_hifigan.py --model_config YOUR_MODEL_CONFIG --root_dir OUTPUT_DATA_FEATURE_PATH --stage_dir TRAINING_STAGE_PATH --resume_path CHECKPOINT_PATH

After training is done, your TRAINING_STAGE_PATH looks like below

.
├── ckpt/
│   ├── checkpoint_120000.pth
│   ├── checkpoint_130000.pth
│   ├── ...
│   └── checkpoint_200000.pth      <---- this is the latest checkpoint
├── config.yaml
├── log/
└── stdout.log

Inference

Time to test your powerful model, prepare a validation mel file, then run the command below.

--input_mel could be a mel in valid dataset, or a predicted mel file from the Sambert.

python kantts/bin/infer_hifigan.py --ckpt YOUR_CKPT_FILE --input_mel YOUR_TEST_MEL --output_dir OUTPUT_PATH_TO_STORE_WAV

Enjoy the synthesized audio :)

Pretrained Model (TODO)

XXXXXX XXXXXX

Plugins (TODO)

XXXXXX XXXXXX

References

Our implementation refers to the following repositories and papers.

jik876/hifi-gan

kan-bayashi/ParallelWaveGAN

mozilla/TTS

espnet/espnet

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis

Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech

UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

GAN Vocoder: Multi-Resolution Discriminator Is All You Need

Provide feedback

Saved searches

Use saved searches to filter your results more quickly