We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
博主你好,你的开源项目对我的帮助很大,然而我最近遇到了一个关于DDP的问题。 我用的显卡是A6000, 单卡运行时没有问题。但是用到多卡时,不管是deepspeed 还是torchrun命令,总是会卡住。我自查了一下,可能是在数据分发阶段出现了堵塞,但是我不知道该怎么解决。想问一下大家有没有遇到过这种问题,是怎么解决的呢?
The text was updated successfully, but these errors were encountered:
抱歉,没有遇到过 从截图并不清楚遇到了什么问题
或许已经解决或其他人有类似经历 欢迎继续跟进~
Sorry, something went wrong.
我在多卡4090机器, ubuntu 22.04系统上, 在调用DistributedDataParallel 上卡死, 通过设置环境变量搞定:export NCCL_IB_DISABLE=1
No branches or pull requests
博主你好,你的开源项目对我的帮助很大,然而我最近遇到了一个关于DDP的问题。
我用的显卡是A6000, 单卡运行时没有问题。但是用到多卡时,不管是deepspeed 还是torchrun命令,总是会卡住。我自查了一下,可能是在数据分发阶段出现了堵塞,但是我不知道该怎么解决。想问一下大家有没有遇到过这种问题,是怎么解决的呢?
The text was updated successfully, but these errors were encountered: