Skip to content

training_sambert

Jin edited this page Aug 18, 2022 · 6 revisions

Sambert training tutorial

[TOC]

Environment setup

we recommend using Anaconda to set up your own python virtual environment.

# in case of pip install error, change the pip source may help
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple

# build virtual environment
conda env create -f environment.yaml

# activate virtual environment
conda activate maas

Data processing

Currently, we support mit-style data only (general data is on the road). Make sure your data be organized by the structure below.

mit-style data

.
├── interval
│   ├── 500001.interval
│   ├── 500002.interval
│   ├── 500003.interval
│   ├── ...
│   └── 600010.interval
├── prosody
│   └── prosody.txt
└── wav
    ├── 500001.wav
    ├── 500002.wav
    ├── ...
    └── 600010.wav

general data (TODO)

.
├── txt
│   └── prosody.txt
└── wav
    ├── 1.wav
    ├── 2.wav
    ├── ...
    └── 9000.wav

For quick start: A demo dataset is available on opentts_tiny_dataset.

Modify audio config to fit your data, the demo config file: kantts/configs/audio_config_24k.yaml

Run data_process, you need to pass a speaker name through --speaker, since our demo dataset is unnamed, you can name it as you like (Paige, Lily... whatever).

python kantts/preprocess/data_process.py --voice_input_dir YOUR_DATA_PATH --voice_output_dir OUTPUT_DATA_FEATURE_PATH --audio_config AUDIO_CONFIG_PATH --speaker YOUR_SPEKER_NAME

Then you get the features for training.

.
├── Script.xml
├── am_train.lst
├── am_valid.lst
├── badlist.txt
├── duration/
├── energy/
├── f0/
├── mel/
├── raw_duration/
├── raw_metafile.txt
├── trim_mel/
├── trim_wav/
└── wav/

Training

Our training recipe is config driven, a default Sambert model config can be found kantts/configs/sambert_24k.yaml, you can do some modifications on that config and create your own Sambert model :)

One more thing you must notice is that in kantts/configs/sambert_24k.yaml , speaker_list field should be modified to the speaker name of your dataset, as below

linguistic_unit: 
  cleaners: english_cleaners
  lfeat_type_list: sy,tone,syllable_flag,word_segment,emo_category,speaker_category
  speaker_list: YOUR_SPEAKER_NAME

Now you have got the sword and shield(data and model :-|), go have a try.

# The --root_dir can be multiple args for universal vococer training
CUDA_VISIBLE_DEVICES=0 python kantts/bin/train_sambert.py --model_config YOUR_MODEL_CONFIG  --root_dir OUTPUT_DATA_FEATURE_PATH --stage_dir TRAINING_STAGE_PATH

Distributed training

If your GPU devices are enough, you can use distributed training, which is a lot faster than single GPU training. For example, assign GPU device indexes with CUDA_VISIBLE_DEVICES system variable, --nproc_per_node denotes the count of GPU devices.

CUDA_VISIBLE_DEVICES=0,1,2,4 python -m pytorch.distributed.launch --nproc_per_node=4 kantts/bin/train_sambert.py --audio_config YOUR_AUDIO_CONFIG --model_config YOUR_MODEL_CONFIG  --root_dir ~OUTPUT_DATA_FEATURE_PATH --stage_dir TRAINING_STAGE_PATH

Resume training

--resume_path can be used to resume training with a pre-trained model, or continue training from a previous checkpoint.

# The --root_dir can be multiple args for universal vococer training
CUDA_VISIBLE_DEVICES=0 python kantts/bin/train_sambert.py --audio_config YOUR_AUDIO_CONFIG --model_config YOUR_MODEL_CONFIG  --root_dir ~OUTPUT_DATA_FEATURE_PATH --stage_dir TRAINING_STAGE_PATH --resume_path CHECKPOINT_PATH

After training is done, your TRAIING_STAGE_PATH looks like below

.
├── ckpt/
├── config.yaml
├── log/
└── stdout.log

Model checkpoints are stored in ckpt directory,

./ckpt
├── checkpoint_10000.pth
├── checkpoint_12000.pth
├── checkpoint_14000.pth
├── checkpoint_16000.pth
├── checkpoint_18000.pth
├── checkpoint_2000.pth
├── checkpoint_4000.pth
├── checkpoint_6000.pth
└── checkpoint_8000.pth

Inference

Time to test your powerful model, prepare a validation sequence file.

Here we randomly pick 10 sequences from the validation dataset and use them to generate melspec.

cat OUTPUT_DATA_FEATURE_PATH/am_valid.lst | shuf -n 10 > OUTPUT_DATA_FEATURE_PATH/test_sequence.lst

Then run the command below to infer sambert.

CUDA_VISIBLE_DEVICES=0 python kantts/bin/infer_sambert.py --sentence OUTPUT_DATA_FEATURE_PATH/test_sequence.lst --ckpt YOUR_MODEL_CKPT --output_dir OUTPUT_PATH_TO_STORE_MEL

Then you'll get the predicted melspec files under OUTPUT_PATH_TO_STORE_MEL .

Pretrained Model (TODO)

XXXXXX XXXXXX

Plugins (TODO)

XXXXXX XXXXXX

References (TODO)

Our implementation refers to the following repositories and papers.

[ming024's FastSpeech2 Implementation] (https://github.com/ming024/FastSpeech2)

[FastPitch] (https://arxiv.org/abs/2006.06873)

[FastSpeech2] (https://arxiv.org/abs/2006.04558)

[Robutrans] (https://aaai.org/ojs/index.php/AAAI/article/view/6337/6193)

[Bert] (https://arxiv.org/abs/1810.04805)

Clone this wiki locally