You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@neonbjb Is there any possibility for such problem to get solved in the new version?
Kindly if you can guide.
I was actually going to ask for this by a different mechanism
You say elsewhere that you have a plan for moving between two voices based on their latents
I had intended to just make a collection of voices by reading some source text in varying emotional cadences
If you were to just start by giving us the ability to smoothly shift between voices based on a token, then the question of figuring out when to do it could be pushed downstream
Consider the case of using this for generating video game dialogue, based on the way some or another character moves through a dialogue tree, or repeated variations for a character in a game like Civ, based on the reaction between the two nations (awe, disgust, fear, colloquial, hatred, distrust, etc.) At that point, emotional inference can come from just whatever is happening in the game, and need not - indeed, should not - come from the text at all.
I'm not saying that should never come from the text? But I am saying that I think they're separate problems.
If you would help us create text that can, itself, indicate when it's time to shift from voice 1 to voice 2, I think that's a big step in the right direction. Don't need anything fancy like easing; a linear interpolation between located tags would be enough.
No need to figure out what the "right" emotional cues are. They'll vary character to character. Just let us have string labels and we can figure it out.
Here's one quick example of how it could be done. It's maybe a little counter-intuitive, but I think it'd work really well.
Add some tag (here I'm using [voice: ... ] but whatever works) which effectively means "at this tag, start transitioning from the voice I'm in to the voice I'm naming here; the transition finishes at the next tag."
So, notice that I repeat newscasterExcited. That's because it's neutral here for the first five words, then at the first newscaster excited it starts tweening from the neutral it started in to the excited that's being requested; it then tweens from excited to excited, meaning it's not actually tweening, but just staying excited there. We do the same thing at the end for newscasterVeryAngry. That notation also means you can double-state for an immediate switch, as the tween phase is over a zero length band. (Alternatlely, you could have a more complex parser and start adding flags, but it's unnecessary, and I'd advise against it.)
python do_tts.py --text "I'm going to speak this [voice:newscasterExcited] and
it's going to go great, [voice:newscasterExcited] like super great,
[voice:newscasterSad] but it will get replaced with better
[voice:newscasterAngry] and that person will feel my [voice:newscasterVeryAngry]
ultimate indignant wrath [voice:newscasterVeryAngry] and my vengeance will be
done!" --voice newscasterNeutral
@neonbjb Is there any possibility for such problem to get solved in the new version?
Kindly if you can guide.
You say elsewhere that you have a plan for moving between two voices based on their latents
I had intended to just make a collection of voices by reading some source text in varying emotional cadences
If you were to just start by giving us the ability to smoothly shift between voices based on a token, then the question of figuring out when to do it could be pushed downstream
Consider the case of using this for generating video game dialogue, based on the way some or another character moves through a dialogue tree, or repeated variations for a character in a game like Civ, based on the reaction between the two nations (awe, disgust, fear, colloquial, hatred, distrust, etc.) At that point, emotional inference can come from just whatever is happening in the game, and need not - indeed, should not - come from the text at all.
I'm not saying that should never come from the text? But I am saying that I think they're separate problems.
If you would help us create text that can, itself, indicate when it's time to shift from voice 1 to voice 2, I think that's a big step in the right direction. Don't need anything fancy like easing; a linear interpolation between located tags would be enough.
No need to figure out what the "right" emotional cues are. They'll vary character to character. Just let us have string labels and we can figure it out.
Here's one quick example of how it could be done. It's maybe a little counter-intuitive, but I think it'd work really well.
Add some tag (here I'm using
[voice: ... ]
but whatever works) which effectively means "at this tag, start transitioning from the voice I'm in to the voice I'm naming here; the transition finishes at the next tag."So, notice that I repeat
newscasterExcited
. That's because it's neutral here for the first five words, then at the first newscaster excited it starts tweening from the neutral it started in to the excited that's being requested; it then tweens from excited to excited, meaning it's not actually tweening, but just staying excited there. We do the same thing at the end fornewscasterVeryAngry
. That notation also means you can double-state for an immediate switch, as the tween phase is over a zero length band. (Alternatlely, you could have a more complex parser and start adding flags, but it's unnecessary, and I'd advise against it.)Originally posted by @StoneCypher in #10 (comment)
The text was updated successfully, but these errors were encountered: