Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We are trying to use sentiment text to make the generated audio have different emotions #606

Open
5 tasks done
xwan07017 opened this issue Dec 8, 2024 · 7 comments
Open
5 tasks done
Labels
enhancement New feature or request

Comments

@xwan07017
Copy link

Checks

  • This template is only for feature request.
  • I have thoroughly reviewed the project documentation but couldn't find any relevant information that meets my needs.
  • I have searched for existing issues, including closed ones, and found not discussion yet.
  • I confirm that I am using English to submit this report in order to facilitate communication.

1. Is this request related to a challenge you're experiencing? Tell us your story.

We are trying to use sentiment text to make the generated audio have different emotions

2. What is your suggested solution?

We are trying to use sentiment text to make the generated audio have different emotions

3. Additional context or comments

We tried to add emotional features for training. The data set was an English data set of about 330 hours and trained for 200k steps, but the effect of emotional guidance was not very good. If you are interested, we can discuss it together.

4. Can you help us with this feature?

  • I am interested in contributing to this feature.
@xwan07017 xwan07017 added the enhancement New feature or request label Dec 8, 2024
@xwan07017 xwan07017 reopened this Dec 8, 2024
@SWivid
Copy link
Owner

SWivid commented Dec 8, 2024

Hi @xwan07017 , you mean to finetune with emotional data to enhance the ability and leverage the reference speech to control,
or with tokens that indicate each emotion to control.

Thought both will work. If you would like to share some observations, you may drop a zip in this issue or send with email if not proper here.
We have tried some other controls with later method. It works to some extent.

@xwan07017
Copy link
Author

we input reference speech and emotion feature(like happy or sad ...), then we got a new audio with a happy emotion(or sad ...) and the same timbre as the reference speech.

@SWivid
Copy link
Owner

SWivid commented Dec 8, 2024

so you have introduce a new embedding concatenated (model structure modified) or the emotion features are served as new tokens
if the former one, it is like to train from scratch since input distribution changed (will need to train longer)
if later one, thought will work (but hard to diganose if no samples)

@xwan07017
Copy link
Author

@SWivid Yes, we introduce a new embedding concatenated (model structure modified), So we collected a new dataset and trained it from scratch.

@MithrilMan
Copy link
Contributor

MithrilMan commented Dec 9, 2024

None of these solution would handle the case where I want to pass emotions between delimited text (e.g. xml style) like in the example below, correct?

e.g.

<angry>How do you dare?</angry><sarcastic>Didn't your mother tell you not to play with fire?</sarcastic>

@niknah
Copy link

niknah commented Dec 9, 2024

An alternative...
https://github.com/niknah/ComfyUI-F5-TTS?tab=readme-ov-file#multi-voices
This is a custom node for ComfyUI.

Speak in a sad voice, happy voice, record into different files.
Then use...

{happy} This is the narrator
{sad} Hello World this is the end

@MithrilMan
Copy link
Contributor

yes but that's not a model feature but an app feature that uses multiple files (available on gradio demo too in this repository)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants