You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've tried to reproduce training of MegaPose on Jean Zay, and it failed with this error:
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1646756402876/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
Hi,
I've tried to reproduce training of MegaPose on Jean Zay, and it failed with this error:
I did some fixes in the codebase to run the code on JZ here: https://github.com/ponimatkin/happypose/commit/44aacdb79e0557ae50ea84716338e322c6ebe239
Do you by chance know what is the cause of this NCCL error?
The text was updated successfully, but these errors were encountered: