You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
training always abort after saving the checkpoint for 249999th step, I presume the model saving process in rank 0 disrupts the nccl communication somehow. According to logs ,the saving process is no where near the time out threshold of nccl(which should be 30min by default). Any advice on how to resolve this issue would be helpful!
Error logs
Expected behavior
nccl timeout error after a certain steps of training
The text was updated successfully, but these errors were encountered:
System Info
8*A100 with docker enviroment
Information
🐛 Describe the bug
training always abort after saving the checkpoint for 249999th step, I presume the model saving process in rank 0 disrupts the nccl communication somehow. According to logs ,the saving process is no where near the time out threshold of nccl(which should be 30min by default). Any advice on how to resolve this issue would be helpful!
Error logs
Expected behavior
nccl timeout error after a certain steps of training
The text was updated successfully, but these errors were encountered: