-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nan loss during training #10
Comments
Update: I train the small version, and everything is fine. |
Hi, I also encountered nan loss during training (especially when testing fp16 training), but the final configuration (2x80GB A100 GPUs with the configured learning rate and batch size) worked for me successfully without any nan's.
|
I am still facing the same issue on the train.sh (41GB of GPU memory). Moreover, is it normal to run very slowly like this? I increased num_worker=32 and set max_grad_norm=5e-4 but still facing nan loss
|
I would suggest to try set num_workers=0 and also include the other fix from my message above |
We've followed your training instructions to avoid nan loss, but now we're encountering exploding gradients after 20k training steps. Will you be releasing the pretrained model? It would greatly assist us in reproducing the results outlined in the paper. |
We do not release weights because of licensing issues. I'd be happy to help with any reproduction issues, how exactly did you get the exploding gradients? I never encountered them and think that's what the max_grad_norm parameters is avoiding |
Hi team, thanks for sharing this great work. I have a problem that when training with train.sh on 40GB A100. I set the batch_size=2 and gradient_accumulation_steps=16 with LR=5e-5 and 2.5e-5. The training loss become nan for both LR.
Do you have any suggestions?
Thanks!
The text was updated successfully, but these errors were encountered: