Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

YourTTS checkpoint: Dutch, French, German, Italian, Portuguese, Polish, Spanish, and English #2735

Closed
wants to merge 5 commits into from

Conversation

freds0
Copy link
Contributor

@freds0 freds0 commented Jul 2, 2023

In this pull request, I have added a new checkpoint for the YourTTS model, which was trained in multiple languages, including Dutch, French, German, Italian, Portuguese, Polish, Spanish, and English.
To provide more context, the paper is available at the following link: CML-TTS A Multilingual Dataset for Speech Synthesis in Low-Resource Languages. The model was trained using the CML-TTS dataset and the LibriTTS dataset in English. I would also like to inform you that samples generated using this checkpoint can be verified by accessing the following link: https://freds0.github.io/CML-TTS-Dataset/

@CLAassistant
Copy link

CLAassistant commented Jul 2, 2023

CLA assistant check
All committers have signed the CLA.

@erogol
Copy link
Member

erogol commented Jul 2, 2023

Hey @freds0 this is awesome. Would you mind if I move the model somewhere more convenient? It is not very reliable to keep it in gdrive.

@freds0
Copy link
Contributor Author

freds0 commented Jul 2, 2023

@erogol That sounds like a great idea! It would be great to send it to a more reliable drive. Thanks for the suggestion.

@erogol
Copy link
Member

erogol commented Jul 4, 2023

To use the training speakers, speakers,pth should have the speaker embeddings too. Or we can release it with only voice cloning.

@freds0
Copy link
Contributor Author

freds0 commented Jul 4, 2023

I can share, but I didn't find it in my backups. It is likely that I will need to generate again!

@erogol
Copy link
Member

erogol commented Jul 5, 2023

@freds0 your call, if it is too much work, we can release with voice cloning.

@freds0
Copy link
Contributor Author

freds0 commented Jul 11, 2023

Hi @erogol , all embeddings were extracted, and are available at the following link at google drive:
https://drive.google.com/drive/folders/1bS_9-7QFmGWeAd6wtqnnjSS_-wV8FBNP

Or onedrive:

https://ufmtbr-my.sharepoint.com/:f:/g/personal/fredoliveira_ufmt_br/EnrCG5tSIiBDqfPlTfPjAGsBqjZWNkjBOd7-MCoxdJaeyQ?e=DFewo8

Is this really what you need?

@erogol
Copy link
Member

erogol commented Jul 14, 2023

I'll give it a try next Monday. Thanks for sharing 👍

@erogol
Copy link
Member

erogol commented Jul 24, 2023

@freds0 those files are crazy big. So I'll go with only voice cloning.

@itsjamie
Copy link

I'm relatively new to this field, is there some documentation somewhere that documents how I would go about consuming these myself?

What I've tried is;

  • renaming the best_model.pth to model.pth.

I tried with the provided initial file extracted into the user directory where the other models are downloaded, and hit the following:

Traceback (most recent call last):
  File "/Users/jstackhouse/anaconda3/envs/tts/bin/tts", line 8, in <module>
    sys.exit(main())
  File "/Users/jstackhouse/TTS/TTS/bin/synthesize.py", line 385, in main
    synthesizer = Synthesizer(
  File "/Users/jstackhouse/TTS/TTS/utils/synthesizer.py", line 91, in __init__
    self._load_tts(tts_checkpoint, tts_config_path, use_cuda)
  File "/Users/jstackhouse/TTS/TTS/utils/synthesizer.py", line 185, in _load_tts
    self.tts_model = setup_tts_model(config=self.tts_config)
  File "/Users/jstackhouse/TTS/TTS/tts/models/__init__.py", line 13, in setup_model
    model = MyModel.init_from_config(config=config, samples=samples)
  File "/Users/jstackhouse/TTS/TTS/tts/models/vits.py", line 1797, in init_from_config
    speaker_manager = SpeakerManager.init_from_config(config, samples)
  File "/Users/jstackhouse/TTS/TTS/tts/utils/speakers.py", line 113, in init_from_config
    speaker_manager = SpeakerManager(
  File "/Users/jstackhouse/TTS/TTS/tts/utils/speakers.py", line 63, in __init__
    super().__init__(
  File "/Users/jstackhouse/TTS/TTS/tts/utils/managers.py", line 149, in __init__
    self.load_embeddings_from_list_of_files(embedding_file_path)
  File "/Users/jstackhouse/TTS/TTS/tts/utils/managers.py", line 227, in load_embeddings_from_list_of_files
    ids, clip_ids, embeddings, embeddings_by_names = self.read_embeddings_from_file(file_path)
  File "/Users/jstackhouse/TTS/TTS/tts/utils/managers.py", line 194, in read_embeddings_from_file
    speakers = sorted({x["name"] for x in embeddings.values()})
  File "/Users/jstackhouse/TTS/TTS/tts/utils/managers.py", line 194, in <setcomp>
    speakers = sorted({x["name"] for x in embeddings.values()})
TypeError: 'int' object is not subscriptable

I assume this is because of what @erogol initially said where the speakers.pth file doesn't contain the embeddings.

With the provided JSON files, how would I go about recreating a working file containing these embedding?

I tried removing the speakers.pth, and instead using a format of speakers.json like the original YourTTS model's folder, but with the model for Spanish.

But doing that I hit:

Traceback (most recent call last):
  File "/Users/jstackhouse/anaconda3/envs/tts/bin/tts", line 8, in <module>
    sys.exit(main())
  File "/Users/jstackhouse/TTS/TTS/bin/synthesize.py", line 385, in main
    synthesizer = Synthesizer(
  File "/Users/jstackhouse/TTS/TTS/utils/synthesizer.py", line 91, in __init__
    self._load_tts(tts_checkpoint, tts_config_path, use_cuda)
  File "/Users/jstackhouse/TTS/TTS/utils/synthesizer.py", line 190, in _load_tts
    self.tts_model.load_checkpoint(self.tts_config, tts_checkpoint, eval=True)
  File "/Users/jstackhouse/TTS/TTS/tts/models/vits.py", line 1721, in load_checkpoint
    self.load_state_dict(state["model"], strict=strict)
  File "/Users/jstackhouse/anaconda3/envs/tts/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2150, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for Vits:
	size mismatch for emb_l.weight: copying a param with shape torch.Size([8, 4]) from checkpoint, the shape in current model is torch.Size([0, 4]).

I figure this might be because I haven't loaded the embeddings for every language?

What should I go read? Or what am I missing?

@freds0
Copy link
Contributor Author

freds0 commented Aug 3, 2023

@itsjamie there are two ways to run this model effectively. The first method involves using these speaker embeddings files. Alternatively, you can opt for the second method, which requires providing a reference audio that will be sent to the model. To get started, simply follow the step-by-step instructions provided in this link:

https://colab.research.google.com/drive/1nZuvfW-gjoKJgm_S5_f9ydi5W1xvesCK?usp=sharing

@acul3
Copy link

acul3 commented Aug 4, 2023

hey @freds0 thanks for sharing this,cool stuff

can you share the tensorboard log for this model if possible,
or at least at how many steps the model train

i'm trying to reproduce training using new language using guidence from your paper dan the original yourtts

thank you

@freds0
Copy link
Contributor Author

freds0 commented Aug 4, 2023

@acul3 Unfortunately I didn't save the logs. But to fine-tune a new language, you should mainly look at the alignment chart. When you have something close to the image below, the training can be ended.
images

@freds0
Copy link
Contributor Author

freds0 commented Aug 14, 2023

@erogol I created a version of the embeddings file with just 10 samples of each speaker (250MB). All speakers from the CML-TTS dataset are included, and also all the speakers from LibriTTS. Here is the download link

@erogol
Copy link
Member

erogol commented Aug 14, 2023

@freds0 thanks I'll check. I try to finish my backlog before merging this PR.

@Edresson
Copy link
Contributor

Edresson commented Sep 7, 2023

@erogol @freds0 I have added a training recipe for the YourTTS model trained on CML-TTS paper.

@Edresson Edresson requested a review from erogol September 7, 2023 15:46
@erogol
Copy link
Member

erogol commented Sep 8, 2023

@Edresson you should make a separate PR. I can merge it before we merge this one. (I don't know when I can find time to merge this one. )

@Edresson
Copy link
Contributor

Edresson commented Sep 9, 2023

@Edresson you should make a separate PR. I can merge it before we merge this one. (I don't know when I can find time to merge this one. )

I removed the Recipe from this PR and added it on #2934

@stale
Copy link

stale bot commented Oct 14, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

@stale stale bot added the wontfix This will not be worked on but feel free to help. label Oct 14, 2023
@stale stale bot closed this Oct 22, 2023
@Edresson Edresson reopened this Oct 24, 2023
@stale stale bot closed this Nov 1, 2023
@Edresson Edresson reopened this Nov 7, 2023
@stale stale bot removed the wontfix This will not be worked on but feel free to help. label Nov 7, 2023
Correction in training the Fastspeech/Fastspeech2/FastPitch/SpeedySpeech model using external speaker embedding.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants