-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loss Nan Value #11
Comments
I didn't encounter this problem. Did you use 16 bit precision training? |
如果是fp16训练遇到nan是正常的吗? |
正常的,可以用之前epoch的checkpoint使用32精度继续训练 |
@xxchauncey 可以用bf16,性能比fp16差点,但不怎么遇到nan |
感谢,我是最近才接触audio separation这一块的,前阵子切换了好几种backbone都会在训练中期出现nan,在v100卡上解决方案只能是切回32精度继续训练。以前不管是asr还是小型nlp模型都没有碰到过这样的情况,所以比较好奇。 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I am getting the value for loss as Nan
And cuda error while training
The text was updated successfully, but these errors were encountered: