Is it intended to be Zero-shot TTS #2

rishikksh20 · 2024-04-02T07:47:48Z

Hi @KdaiP nice work, just like to know is this architecture is intended to support zero-shot TTS or normal multi-speaker kind of TTS,

KdaiP · 2024-04-02T08:55:37Z

This architecture supports zero-shot text-to-speech (TTS) capabilities. However, its primary design goal is to achieve a lightweight and fast system. Therefore, the performance for unseen speakers cannot be guaranteed.

If we were to scale up the model to 1 billion parameters and train it on a dataset exceeding ten million hours, just like natural speech3, it might potentially enhance its zero-shot performance.

rishikksh20 · 2024-04-02T09:05:56Z

Completely agree with you. Just one more thing how the samples coming out so far ?

eschmidbauer · 2024-04-04T20:28:38Z

Thank you for sharing this project!
i have a model training on a limited dataset and im getting decent results after a few hours of training. You mentioned "If we were to scale up the model to 1 billion parameters...",
Can you elaborate how to scale up the model parameters? does that just mean a larger dataset?

KdaiP · 2024-04-05T05:20:03Z

Completely agree with you. Just one more thing how the samples coming out so far ?

I will release the pretrained checkpoints within one to two weeks.

Currently, I am fine-tuning the network structure in flow-matching. I've discovered that substituting some of the DiT with convolutional layers yields better results under a smaller parameter count and significantly accelerates convergence.

KdaiP · 2024-04-05T05:28:05Z

Thank you for sharing this project! i have a model training on a limited dataset and im getting decent results after a few hours of training. You mentioned "If we were to scale up the model to 1 billion parameters...", Can you elaborate how to scale up the model parameters? does that just mean a larger dataset?

In addition to expanding the dataset, scaling up model parameters involves increasing both the width and depth of the model. This can be achieved by modifying the ModelConfig in config.py. For example, you could set hidden_channels to 1024, filter_channels to 2048, and n_layers to 12.

rishikksh20 · 2024-04-05T05:28:46Z

@KdaiP I have been also training bit bigger model 72M params on LibriTTS (english) + Our own (Hindi) dataset (total around 800 hr) at a batch size of 8 without gradient accumulation. Till 72k get decent result at least listenable and understandable, but my main interest is unseen Zero-shot and emotion + prosody transfer from the prompt.

rishikksh20 · 2024-04-05T05:47:49Z

For me, the things that matter most are how well the model performs in multi-lingual form and how well it captures prosody from reference audio, especially cross-lingual prosody how well it transfers one lang. speaker prosody to others, for me speaker component and timbre are not that important as we can make any TTS a zero-shot by applying any VC model.

rishikksh20 · 2024-04-17T08:02:53Z

@KdaiP model seems powerful for my understanding, I have trained it on 1k hours of multi-lingual data with 78M params and it worked decently.
I have a thought about weather this model will transform to do Speech to Speech Voice conversion, where we give input semantic token and target speaker latent and it converts semantic to target speaker's Vocos latent. We don't require any duration modeling at all. It's like one to one mapping.

KdaiP · 2024-04-17T09:09:08Z

@KdaiP model seems powerful for my understanding, I have trained it on 1k hours of multi-lingual data with 78M params and it worked decently. I have a thought about weather this model will transform to do Speech to Speech Voice conversion, where we give input semantic token and target speaker latent and it converts semantic to target speaker's Vocos latent. We don't require any duration modeling at all. It's like one to one mapping.

Thank you for your interest in StableTTS! DDSP6.0 and ReFlow-VAE-SVC have already use reflow (which is very similar to flow-matching) to do voice conversion and have got decent results. I recommend checking out these two repositories for more information(≧▽≦).

r666ay · 2024-08-05T07:27:28Z

@KdaiP model seems powerful for my understanding, I have trained it on 1k hours of multi-lingual data with 78M params and it worked decently. I have a thought about weather this model will transform to do Speech to Speech Voice conversion, where we give input semantic token and target speaker latent and it converts semantic to target speaker's Vocos latent. We don't require any duration modeling at all. It's like one to one mapping.

Could you please tell me the detailed configuration of your 78M model, such as parameters in StableTTS/config.py? Very appreciated.

KdaiP · 2024-09-13T02:36:49Z

@KdaiP model seems powerful for my understanding, I have trained it on 1k hours of multi-lingual data with 78M params and it worked decently. I have a thought about weather this model will transform to do Speech to Speech Voice conversion, where we give input semantic token and target speaker latent and it converts semantic to target speaker's Vocos latent. We don't require any duration modeling at all. It's like one to one mapping.

Could you please tell me the detailed configuration of your 78M model, such as parameters in StableTTS/config.py? Very appreciated.

Hi, we have released a new 31M model with bug fixes and audio quality improvement. It is much better than the 78M model I mentioned in the past.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it intended to be Zero-shot TTS #2

Is it intended to be Zero-shot TTS #2

rishikksh20 commented Apr 2, 2024

KdaiP commented Apr 2, 2024

rishikksh20 commented Apr 2, 2024

eschmidbauer commented Apr 4, 2024

KdaiP commented Apr 5, 2024

KdaiP commented Apr 5, 2024

rishikksh20 commented Apr 5, 2024 •

edited

Loading

rishikksh20 commented Apr 5, 2024

rishikksh20 commented Apr 17, 2024 •

edited

Loading

KdaiP commented Apr 17, 2024

r666ay commented Aug 5, 2024

KdaiP commented Sep 13, 2024

Is it intended to be Zero-shot TTS #2

Is it intended to be Zero-shot TTS #2

Comments

rishikksh20 commented Apr 2, 2024

KdaiP commented Apr 2, 2024

rishikksh20 commented Apr 2, 2024

eschmidbauer commented Apr 4, 2024

KdaiP commented Apr 5, 2024

KdaiP commented Apr 5, 2024

rishikksh20 commented Apr 5, 2024 • edited Loading

rishikksh20 commented Apr 5, 2024

rishikksh20 commented Apr 17, 2024 • edited Loading

KdaiP commented Apr 17, 2024

r666ay commented Aug 5, 2024

KdaiP commented Sep 13, 2024

rishikksh20 commented Apr 5, 2024 •

edited

Loading

rishikksh20 commented Apr 17, 2024 •

edited

Loading