Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于DDP的问题 #31

Open
hlh2023214 opened this issue Dec 19, 2024 · 2 comments
Open

关于DDP的问题 #31

hlh2023214 opened this issue Dec 19, 2024 · 2 comments

Comments

@hlh2023214
Copy link

博主你好,你的开源项目对我的帮助很大,然而我最近遇到了一个关于DDP的问题。
我用的显卡是A6000, 单卡运行时没有问题。但是用到多卡时,不管是deepspeed 还是torchrun命令,总是会卡住。我自查了一下,可能是在数据分发阶段出现了堵塞,但是我不知道该怎么解决。想问一下大家有没有遇到过这种问题,是怎么解决的呢?

微信图片_20241219192719
@jingyaogong
Copy link
Owner

抱歉,没有遇到过
从截图并不清楚遇到了什么问题

或许已经解决或其他人有类似经历
欢迎继续跟进~

@lizc2003
Copy link

我在多卡4090机器, ubuntu 22.04系统上, 在调用DistributedDataParallel 上卡死, 通过设置环境变量搞定:export NCCL_IB_DISABLE=1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants