Model Card for Model ID

license

Model Card for Model ID

HuggingFace 🤗 - Repository (Please Check HuggingFace, I often update there + the checkpoints I train)

DDP is very un-stable, please use the single-gpu training script - if you still want to do it, I suggest uncommenting the grad clipping lines; that should help a lot.

This Vocoder, is a combination of HiFTnet and Ringformer. it supports Ring Attention, Conformer and Neural Source Filtering etc. This repository is experimental, expect some bugs and some hardcoded params.

The default setting is 44.1khz - 128 Mel bins. but I have provided the necessary script for the 24khz version in the LibriTTS checkpoint's folder.

Huge Thanks to Johnathan Duering for his help. I mostly implemented this based on his STTS2 Fork.

NOTE:

There are Three checkpoints available so far where you can grab them from 🤗:

RiFornet 24khz (trained for roughly 117K~ steps on LibriTTS (360 + 100) and 40 hours of other English datasets.)
RiFornet 44.1khz (trained for roughly 280K~ steps on a Large (more than 1100 hours) private Multilingual dataset, covering Arabic, Persian, Japanese, English, Russian and also Singing voice in Chinese and Japanese with Quranic recitations in Arabic. This is the best checkpoint available so far.
HiFTNet 44.1khz (trained for ~100K steps, on a similar dataset to RiFornet 44.1khz, but slightly smaller and no singing voice.)

Ideally I wanted to train them all up to 1M steps, but I don't think I can do that for a while. so, while the quality should be reasonably good, you may still want to fine-tune them on your downstream task.

Pre-requisites

Python >= 3.10
Clone this repository:

git clone https://github.com/Respaired/RiFornet_Vocoder
cd RiFornet_Vocoder/Ringformer

Install python requirements:

pip install -r requirements.txt

Training

CUDA_VISIBLE_DEVICES=0 python train_single_gpu.py --config config_v1.json --[args]

For the F0 model training, please refer to yl4579/PitchExtractor. This repo includes a pre-trained F0 model on a Mixture of Multilingual data for the previously mentioned configuration. I'm going to quote the HiFTnet's Author: "Still, you may want to train your own F0 model for the best performance, particularly for noisy or non-speech data, as we found that F0 estimation accuracy is essential for the vocoder performance."

Inference

Please refer to the notebook inference.ipynb for details.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
RingFormer		RingFormer
.gitattributes		.gitattributes
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Model Card for Model ID

Pre-requisites

Training

Inference

About

Releases

Packages

Languages

Respaired/RiFornet_Vocoder

Folders and files

Latest commit

History

Repository files navigation

Model Card for Model ID

Pre-requisites

Training

Inference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages