Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss Nan Value #11

Open
PriyankaPaud opened this issue May 30, 2023 · 5 comments
Open

Loss Nan Value #11

PriyankaPaud opened this issue May 30, 2023 · 5 comments

Comments

@PriyankaPaud
Copy link

I am getting the value for loss as Nan

And cuda error while training

@quancs
Copy link
Member

quancs commented Jun 3, 2023

I didn't encounter this problem. Did you use 16 bit precision training?

@xxchauncey
Copy link

如果是fp16训练遇到nan是正常的吗?

@quancs
Copy link
Member

quancs commented Oct 10, 2023

正常的,可以用之前epoch的checkpoint使用32精度继续训练

@quancs
Copy link
Member

quancs commented Oct 10, 2023

@xxchauncey 可以用bf16,性能比fp16差点,但不怎么遇到nan

@xxchauncey
Copy link

@xxchauncey 可以用bf16,性能比fp16差点,但不怎么遇到nan

感谢,我是最近才接触audio separation这一块的,前阵子切换了好几种backbone都会在训练中期出现nan,在v100卡上解决方案只能是切回32精度继续训练。以前不管是asr还是小型nlp模型都没有碰到过这样的情况,所以比较好奇。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants