You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to train a TTS but I am wondering about the style of the speakers? My dataset contains multiple speakers with different speaking styles. Does the model retain the style for each voice or it uses only one style or it depends on the refer audio. For example In my dataset it contains Indian speaker who pauses nervously in conversation. When i train it with all the dataset and use one audio from that speaker and infer will it inhabit the nervous speaking style? Please I dearly wait for your response and thanks for this great repo
The text was updated successfully, but these errors were encountered:
Please better to ask the question was can you train it with a narrator and conversational voice and get the two speaking style or I will need to train separate models to achieve that?
Hi, in StableTTS, the Mel spectrogram (with a time length of t) is compressed into a global condition embedding with a time length of 1. By visualizing this embedding, you can observe that it clusters according to the speaker ID, meaning the model retains some speaker-specific characteristics. However, the embedding also contains other features, such as emotion, since these traits are not explicitly disentangled in the current setup.
If you're looking for better control over the emotional aspect of the generated speech, I would recommend checking out these papers for more advanced approaches:
I am trying to train a TTS but I am wondering about the style of the speakers? My dataset contains multiple speakers with different speaking styles. Does the model retain the style for each voice or it uses only one style or it depends on the refer audio. For example In my dataset it contains Indian speaker who pauses nervously in conversation. When i train it with all the dataset and use one audio from that speaker and infer will it inhabit the nervous speaking style? Please I dearly wait for your response and thanks for this great repo
The text was updated successfully, but these errors were encountered: