Question about voice speaker Style #12

MavisHoot · 2024-04-27T04:23:13Z

I am trying to train a TTS but I am wondering about the style of the speakers? My dataset contains multiple speakers with different speaking styles. Does the model retain the style for each voice or it uses only one style or it depends on the refer audio. For example In my dataset it contains Indian speaker who pauses nervously in conversation. When i train it with all the dataset and use one audio from that speaker and infer will it inhabit the nervous speaking style? Please I dearly wait for your response and thanks for this great repo

MavisHoot · 2024-04-27T04:26:05Z

Please better to ask the question was can you train it with a narrator and conversational voice and get the two speaking style or I will need to train separate models to achieve that?

KdaiP · 2024-09-13T01:44:58Z

Hi, in StableTTS, the Mel spectrogram (with a time length of t) is compressed into a global condition embedding with a time length of 1. By visualizing this embedding, you can observe that it clusters according to the speaker ID, meaning the model retains some speaker-specific characteristics. However, the embedding also contains other features, such as emotion, since these traits are not explicitly disentangled in the current setup.

If you're looking for better control over the emotional aspect of the generated speech, I would recommend checking out these papers for more advanced approaches:

DC CoMixTTS
DiCLET-TTS

I hope this helps! Let me know if you have any further questions!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about voice speaker Style #12

Question about voice speaker Style #12

MavisHoot commented Apr 27, 2024

MavisHoot commented Apr 27, 2024

KdaiP commented Sep 13, 2024

Question about voice speaker Style #12

Question about voice speaker Style #12

Comments

MavisHoot commented Apr 27, 2024

MavisHoot commented Apr 27, 2024

KdaiP commented Sep 13, 2024