Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL error when saving with DDP #109

Open
1 of 2 tasks
Vindicator645 opened this issue Jul 1, 2024 · 2 comments
Open
1 of 2 tasks

NCCL error when saving with DDP #109

Vindicator645 opened this issue Jul 1, 2024 · 2 comments

Comments

@Vindicator645
Copy link

Vindicator645 commented Jul 1, 2024

System Info

8*A100 with docker enviroment

Information

  • The official example scripts
  • My own modified scripts

🐛 Describe the bug

training always abort after saving the checkpoint for 249999th step, I presume the model saving process in rank 0 disrupts the nccl communication somehow. According to logs ,the saving process is no where near the time out threshold of nccl(which should be 30min by default). Any advice on how to resolve this issue would be helpful!

Error logs

image

Expected behavior

nccl timeout error after a certain steps of training

@cnlinxi
Copy link

cnlinxi commented Jul 9, 2024

Same problem. Do you have any solution?

@zhangron013
Copy link

same problem too, Do you have any solution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants