-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add VITS 2 model #123
Comments
--> TIME: 2024-11-01 07:59:12 -- STEP: 199/406 -- GLOBAL_STEP: 100200 |
|
Cool, would be happy to add Vits 2! Are you basing it on the initial work from @p0p4k in coqui-ai#3355? Impossible to say why something isn't working without seeing any code. But I'm fine with merging something that just works with one GPU for now. It could be improved later. The original Vits had some issues with multi-GPU as well (#103), are these the same issues? |
I'll try to fix it before a PR. I can say that vits2 is a massive improvement on vits, at least to my ears, the model seems to be way more robust than vits, In my implementation, vits model trained with coqui can be trained as vits 2 by reiniting dp and text encoder at the beginning of the training, which allows me do compare the models. |
Thanks for doing this work guys. If you need any other paper implementation or need assistance with porting to coqui lmk. |
@p0p4k how do we know when to freeze the duration discriminator in vits2 and also when to remove the noise from mas. |
@Marioando For duration discriminator, do you mean freeze before we start it to train or freeze after the MAS is trained for sometime and gives accurate results? |
@p0p4k I thought that the vits 2 paper said they trained the duration disc for 30k step, I reread the paper again and it was duration predictor so we dont need to freeze the duration disc, just freeze the duration predictor after we got good result. Right!? |
Right, I was thinking that initially the mas is still waiting for text embeddings to get to a reasonable place to give the right gt durations and so we can wait for it to stabilize first and then begin the duration disc to train. |
@eginhard I have made a PR for vits2 here is some audio from the model : |
I trained the model using d-vector. |
Hi,
I'm working on adding vits2 model to coqui framework, while testing the implementation, I found out that the model does train well on single gpu, but as soon as the second step in multigpu training all loss are normal , i.e loss0 and loss1, exept loss2 which is the loss of the duration discriminator layer( it became nan). So here is my question , do you think I need to modify the trainer or modify the batch sampler in the model. Also I have made some change to the trainer to filter out null gradient in multigpu but doesnt work . Here is what I have already tryed : decrease learning rate for the duration discriminator, add gradient clipping, decrease batch size to 1 for testing none work on multigpu. The model seems to learn well on single gpu setup.
Thanks
The text was updated successfully, but these errors were encountered: