-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it intended to be Zero-shot TTS #2
Comments
This architecture supports zero-shot text-to-speech (TTS) capabilities. However, its primary design goal is to achieve a lightweight and fast system. Therefore, the performance for unseen speakers cannot be guaranteed. If we were to scale up the model to 1 billion parameters and train it on a dataset exceeding ten million hours, just like natural speech3, it might potentially enhance its zero-shot performance. |
Completely agree with you. Just one more thing how the samples coming out so far ? |
Thank you for sharing this project! |
I will release the pretrained checkpoints within one to two weeks. Currently, I am fine-tuning the network structure in flow-matching. I've discovered that substituting some of the DiT with convolutional layers yields better results under a smaller parameter count and significantly accelerates convergence. |
In addition to expanding the dataset, scaling up model parameters involves increasing both the width and depth of the model. This can be achieved by modifying the |
@KdaiP I have been also training bit bigger model 72M params on LibriTTS (english) + Our own (Hindi) dataset (total around 800 hr) at a batch size of 8 without gradient accumulation. Till 72k get decent result at least listenable and understandable, but my main interest is unseen Zero-shot and emotion + prosody transfer from the prompt. |
For me, the things that matter most are how well the model performs in multi-lingual form and how well it captures prosody from reference audio, especially cross-lingual prosody how well it transfers one lang. speaker prosody to others, for me speaker component and timbre are not that important as we can make any TTS a zero-shot by applying any VC model. |
@KdaiP model seems powerful for my understanding, I have trained it on 1k hours of multi-lingual data with 78M params and it worked decently. |
Thank you for your interest in StableTTS! DDSP6.0 and ReFlow-VAE-SVC have already use reflow (which is very similar to flow-matching) to do voice conversion and have got decent results. I recommend checking out these two repositories for more information(≧▽≦). |
Could you please tell me the detailed configuration of your 78M model, such as parameters in StableTTS/config.py? Very appreciated. |
Hi, we have released a new 31M model with bug fixes and audio quality improvement. It is much better than the 78M model I mentioned in the past. |
Hi @KdaiP nice work, just like to know is this architecture is intended to support zero-shot TTS or normal multi-speaker kind of TTS,
The text was updated successfully, but these errors were encountered: