Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it intended to be Zero-shot TTS #2

Open
rishikksh20 opened this issue Apr 2, 2024 · 11 comments
Open

Is it intended to be Zero-shot TTS #2

rishikksh20 opened this issue Apr 2, 2024 · 11 comments

Comments

@rishikksh20
Copy link

Hi @KdaiP nice work, just like to know is this architecture is intended to support zero-shot TTS or normal multi-speaker kind of TTS,

@KdaiP
Copy link
Owner

KdaiP commented Apr 2, 2024

This architecture supports zero-shot text-to-speech (TTS) capabilities. However, its primary design goal is to achieve a lightweight and fast system. Therefore, the performance for unseen speakers cannot be guaranteed.

If we were to scale up the model to 1 billion parameters and train it on a dataset exceeding ten million hours, just like natural speech3, it might potentially enhance its zero-shot performance.

@rishikksh20
Copy link
Author

Completely agree with you. Just one more thing how the samples coming out so far ?

@eschmidbauer
Copy link

Thank you for sharing this project!
i have a model training on a limited dataset and im getting decent results after a few hours of training. You mentioned "If we were to scale up the model to 1 billion parameters...",
Can you elaborate how to scale up the model parameters? does that just mean a larger dataset?

@KdaiP
Copy link
Owner

KdaiP commented Apr 5, 2024

Completely agree with you. Just one more thing how the samples coming out so far ?

I will release the pretrained checkpoints within one to two weeks.

Currently, I am fine-tuning the network structure in flow-matching. I've discovered that substituting some of the DiT with convolutional layers yields better results under a smaller parameter count and significantly accelerates convergence.

@KdaiP
Copy link
Owner

KdaiP commented Apr 5, 2024

Thank you for sharing this project! i have a model training on a limited dataset and im getting decent results after a few hours of training. You mentioned "If we were to scale up the model to 1 billion parameters...", Can you elaborate how to scale up the model parameters? does that just mean a larger dataset?

In addition to expanding the dataset, scaling up model parameters involves increasing both the width and depth of the model. This can be achieved by modifying the ModelConfig in config.py. For example, you could set hidden_channels to 1024, filter_channels to 2048, and n_layers to 12.

@rishikksh20
Copy link
Author

rishikksh20 commented Apr 5, 2024

@KdaiP I have been also training bit bigger model 72M params on LibriTTS (english) + Our own (Hindi) dataset (total around 800 hr) at a batch size of 8 without gradient accumulation. Till 72k get decent result at least listenable and understandable, but my main interest is unseen Zero-shot and emotion + prosody transfer from the prompt.

@rishikksh20
Copy link
Author

For me, the things that matter most are how well the model performs in multi-lingual form and how well it captures prosody from reference audio, especially cross-lingual prosody how well it transfers one lang. speaker prosody to others, for me speaker component and timbre are not that important as we can make any TTS a zero-shot by applying any VC model.

@rishikksh20
Copy link
Author

rishikksh20 commented Apr 17, 2024

@KdaiP model seems powerful for my understanding, I have trained it on 1k hours of multi-lingual data with 78M params and it worked decently.
I have a thought about weather this model will transform to do Speech to Speech Voice conversion, where we give input semantic token and target speaker latent and it converts semantic to target speaker's Vocos latent. We don't require any duration modeling at all. It's like one to one mapping.

@KdaiP
Copy link
Owner

KdaiP commented Apr 17, 2024

@KdaiP model seems powerful for my understanding, I have trained it on 1k hours of multi-lingual data with 78M params and it worked decently. I have a thought about weather this model will transform to do Speech to Speech Voice conversion, where we give input semantic token and target speaker latent and it converts semantic to target speaker's Vocos latent. We don't require any duration modeling at all. It's like one to one mapping.

Thank you for your interest in StableTTS! DDSP6.0 and ReFlow-VAE-SVC have already use reflow (which is very similar to flow-matching) to do voice conversion and have got decent results. I recommend checking out these two repositories for more information(≧▽≦).

@r666ay
Copy link

r666ay commented Aug 5, 2024

@KdaiP model seems powerful for my understanding, I have trained it on 1k hours of multi-lingual data with 78M params and it worked decently. I have a thought about weather this model will transform to do Speech to Speech Voice conversion, where we give input semantic token and target speaker latent and it converts semantic to target speaker's Vocos latent. We don't require any duration modeling at all. It's like one to one mapping.

Could you please tell me the detailed configuration of your 78M model, such as parameters in StableTTS/config.py? Very appreciated.

@KdaiP
Copy link
Owner

KdaiP commented Sep 13, 2024

@KdaiP model seems powerful for my understanding, I have trained it on 1k hours of multi-lingual data with 78M params and it worked decently. I have a thought about weather this model will transform to do Speech to Speech Voice conversion, where we give input semantic token and target speaker latent and it converts semantic to target speaker's Vocos latent. We don't require any duration modeling at all. It's like one to one mapping.

Could you please tell me the detailed configuration of your 78M model, such as parameters in StableTTS/config.py? Very appreciated.

Hi, we have released a new 31M model with bug fixes and audio quality improvement. It is much better than the 78M model I mentioned in the past.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants