Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add VITS 2 model #123

Open
Marioando opened this issue Nov 1, 2024 · 11 comments · May be fixed by #137
Open

Add VITS 2 model #123

Marioando opened this issue Nov 1, 2024 · 11 comments · May be fixed by #137
Labels
enhancement New feature or request

Comments

@Marioando
Copy link

Hi,
I'm working on adding vits2 model to coqui framework, while testing the implementation, I found out that the model does train well on single gpu, but as soon as the second step in multigpu training all loss are normal , i.e loss0 and loss1, exept loss2 which is the loss of the duration discriminator layer( it became nan). So here is my question , do you think I need to modify the trainer or modify the batch sampler in the model. Also I have made some change to the trainer to filter out null gradient in multigpu but doesnt work . Here is what I have already tryed : decrease learning rate for the duration discriminator, add gradient clipping, decrease batch size to 1 for testing none work on multigpu. The model seems to learn well on single gpu setup.
Thanks

@Marioando
Copy link
Author

--> TIME: 2024-11-01 07:59:12 -- STEP: 199/406 -- GLOBAL_STEP: 100200
| > loss_disc: 2.293909788131714 (2.353372510354123)
| > loss_disc_real_0: 0.050190214067697525 (0.09111052809573301)
| > loss_disc_real_1: 0.22900593280792236 (0.20297033208698484)
| > loss_disc_real_2: 0.2125558704137802 (0.220549658090625)
| > loss_disc_real_3: 0.2014939934015274 (0.22777624622960785)
| > loss_disc_real_4: 0.2580271363258362 (0.22694031534782008)
| > loss_disc_real_5: 0.23579958081245422 (0.23088655137836034)
| > loss_0: 2.293909788131714 (2.353372510354123)
| > grad_norm_0: tensor(38.8719, device='cuda:0') (tensor(168.8013, device='cuda:0'))
| > loss_gen: 2.4380598068237305 (2.5592831391185973)
| > loss_kl: 3.0022356510162354 (5.0805860691933145)
| > loss_feat: 5.34114408493042 (5.2965420885900745)
| > loss_mel: 20.770143508911133 (21.53628662722793)
| > loss_duration: 1.849429965019226 (1.862948954404898)
| > loss_1: 33.4010124206543 (36.335646701218515)
| > grad_norm_1: tensor(815.8643, device='cuda:0') (tensor(1645.5496, device='cuda:0'))
| > loss_dur_disc: nan
| > loss_dur_disc_real_0: nan
| > amp_scaler: 64.0 (227.05527638190944)
| > loss_2: nan
| > grad_norm_2: tensor(0) (tensor(0))
| > current_lr_0: 0.0002
| > current_lr_1: 0.0002
| > current_lr_2: 0.0002
| > step_time: 1.8149 (1.429261895280387)
| > loader_time: 0.0206 (0.015218985140623162)

@Marioando
Copy link
Author

if optimizer_idx == 2:

            output_prob_for_real, output_probs_for_pred = self.dur_disc(
                self.model_outputs_cache['hidden_encoded_text'],
                self.model_outputs_cache['hidden_encoded_text_mask'],
                self.model_outputs_cache['real_durations'],  # logscaled
                self.model_outputs_cache['predicted_durations'] # logscaled
            )

            outputs = {
                "real_durations": self.model_outputs_cache['real_durations'],  # logscaled
                "predicted_durations": self.model_outputs_cache['predicted_durations']  # logscaled
            }

            with autocast(enabled=False):
                loss_dict = criterion[optimizer_idx](
                    output_prob_for_real,
                    output_probs_for_pred,
                )

            return outputs, loss_dict

@eginhard
Copy link
Member

eginhard commented Nov 1, 2024

Cool, would be happy to add Vits 2! Are you basing it on the initial work from @p0p4k in coqui-ai#3355?

Impossible to say why something isn't working without seeing any code. But I'm fine with merging something that just works with one GPU for now. It could be improved later. The original Vits had some issues with multi-GPU as well (#103), are these the same issues?

@Marioando
Copy link
Author

I'll try to fix it before a PR. I can say that vits2 is a massive improvement on vits, at least to my ears, the model seems to be way more robust than vits, In my implementation, vits model trained with coqui can be trained as vits 2 by reiniting dp and text encoder at the beginning of the training, which allows me do compare the models.
I didnt use the prototype form p0p4k, it was way easier to start from the original vits in coqui.
I'm currently busy trying to add @p0p4k pflow implementation and this is my priority but I will try to work on the model as soon as possible.
Thank you for your time! I think coqui framework does make experimenting with tts way faster ! We appreciate your work maintaining this repo! Thank you.

@p0p4k
Copy link

p0p4k commented Nov 2, 2024

Thanks for doing this work guys. If you need any other paper implementation or need assistance with porting to coqui lmk.

@Marioando
Copy link
Author

@p0p4k how do we know when to freeze the duration discriminator in vits2 and also when to remove the noise from mas.

@p0p4k
Copy link

p0p4k commented Nov 4, 2024

@Marioando For duration discriminator, do you mean freeze before we start it to train or freeze after the MAS is trained for sometime and gives accurate results?
The number of steps to remove noise from MAS might be experimental, id say 10k steps should be fine.

@Marioando
Copy link
Author

Marioando commented Nov 4, 2024

@p0p4k I thought that the vits 2 paper said they trained the duration disc for 30k step, I reread the paper again and it was duration predictor so we dont need to freeze the duration disc, just freeze the duration predictor after we got good result. Right!?

@p0p4k
Copy link

p0p4k commented Nov 4, 2024

Right, I was thinking that initially the mas is still waiting for text embeddings to get to a reasonable place to give the right gt durations and so we can wait for it to stabilize first and then begin the duration disc to train.

@Marioando
Copy link
Author

Marioando commented Nov 7, 2024

@eginhard I have made a PR for vits2 here is some audio from the model :
vits2_audio_samples.zip.tar.gz
It's still not perfect but I think we can improve it. Concerning the multigpu training issues will computing loss1 with loss2 help i.e : using only two optimizer.

@Marioando
Copy link
Author

I trained the model using d-vector.

@eginhard eginhard added the enhancement New feature or request label Nov 8, 2024
@eginhard eginhard changed the title Vits 2 doesnt work on multigpu training. Add VITS 2 model Nov 8, 2024
@eginhard eginhard linked a pull request Nov 8, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants