Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combine voices #44

Open
dobrosketchkun opened this issue May 14, 2022 · 12 comments
Open

Combine voices #44

dobrosketchkun opened this issue May 14, 2022 · 12 comments

Comments

@dobrosketchkun
Copy link

It's too small of an issue to create a pull request, I guess.

In your .ipynb file you have this cell:

# You can also combine conditioning voices. Combining voices produces a new voice
# with traits from all the parents.
#
# Lets see what it would sound like if Picard and Kirk had a kid with a penchant for philosophy:
voice_samples, conditioning_latents = load_voices(['pat', 'william'])

gen = tts.tts_with_preset("They used to say that if man was meant to fly, he’d have wings. But he did fly. He discovered he had to.", 
                          voice_samples=None, conditioning_latents=None, preset=preset)
torchaudio.save('captain_kirkard.wav', gen.squeeze(0).cpu(), 24000)
IPython.display.Audio('captain_kirkard.wav')

I think voice_samples=None, conditioning_latents=None bit supposed to be voice_samples=voice_samples, conditioning_latents=conditioning_latents because otherwise it won't work.

@neonbjb
Copy link
Owner

neonbjb commented May 14, 2022

🤦thank you! I broke this in 2ca4ea9

I've fixed the live colab. The repo fix will need to wait until I wrap up some local development.

@dobrosketchkun
Copy link
Author

Nice!
And while we are on this matter, what do you think about giving the user the ability to save random voice? Since it is pulled from a latent space there are no wavs, but you can just save and then load a tensor or something?

@neonbjb
Copy link
Owner

neonbjb commented May 14, 2022

So I originally intended to do this. However, I discovered that for some reason the random voice latents do not consistently produce the same voice. So if you feed the same random voice latent into the model for the same text, you will get two different voices.

I can't explain this. I need to do some further investigation, but haven't found the time.

@dobrosketchkun
Copy link
Author

However, I discovered that for some reason the random voice latents do not consistently produce the same voice.

Same here. I thought it's on me since I'm not really a programmer, but that's how it is, I guees.

@davidhhh123
Copy link

How can I clone the voice of one audio into another? without text

@neonbjb
Copy link
Owner

neonbjb commented Jun 22, 2022

You can't.

@davidhhh123
Copy link

what a pity, can you explain, I didn’t understand the timestep_independent function a little, what it does with the data

@davidhhh123
Copy link

I want to understand architecture better

@neonbjb
Copy link
Owner

neonbjb commented Jun 22, 2022

Diffusion models work by iteratively refining an input from pure Gaussian noise to a desired target space. Those iterations are referred to as "timesteps". In the case of Tortoise, there are some components of the network that produce the same output regardless of what timestep you are on. So for those computations, it is more efficient to do them once and re-use their outputs then to re-compute them for every timestep. This is the purpose of the "timestep_independent" function. It performs every computation that does not rely on the timestep signal.

@davidhhh123
Copy link

thanks for the reply: can i ask a few more questions, does get_conditioning_latents fetch the data to clone? and what does an autoregressive model do.

@neonbjb
Copy link
Owner

neonbjb commented Jun 23, 2022

get_conditioning_latents transforms the voice sample that you provide the model into a vector representation that the AR and diffusion models can use.

For what an AR model does: no disrespect meant, but you should Google this. I am not nearly good enough with words to outcompete all the great content out there on this subject. I'd also watch a video on DALL-E (1) or read the paper.

@davidhhh123
Copy link

thank you so much

zachwe pushed a commit to zachwe/tortoise-tts that referenced this issue Sep 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants