Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL problem occured when multiple GPU cards are saving model.safetensors #160

Open
pyh314 opened this issue Feb 2, 2025 · 0 comments
Open

Comments

@pyh314
Copy link

pyh314 commented Feb 2, 2025

I successfully trained the model Qwen/Qwen2.5-1.5B-Instruct through sft.py. However when I saved the model.safetensors I met the problem below:

[rank4]:[E202 22:11:23.884267610 ProcessGroupNCCL.cpp:616] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=59709, OpType=_ALLGATHER_BASE, NumelIn=29171712, NumelOut=233373696, Timeout(ms)=1800000) ran for 1800018 milliseconds before timing out.
[rank4]:[E202 22:11:23.885041639 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 4] Exception (either an error or timeout) detected by watchdog at work: 59709, last enqueued NCCL work: 59711, last completed NCCL work: 59708.
[rank6]:[E202 22:11:23.891257576 ProcessGroupNCCL.cpp:616] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=59709, OpType=_ALLGATHER_BASE, NumelIn=29171712, NumelOut=233373696, Timeout(ms)=1800000) ran for 1800026 milliseconds before timing out.
[rank6]:[E202 22:11:23.891933152 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 6] Exception (either an error or timeout) detected by watchdog at work: 59709, last enqueued NCCL work: 59711, last completed NCCL work: 59708.
[rank2]:[E202 22:11:23.894245484 ProcessGroupNCCL.cpp:616] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=59709, OpType=_ALLGATHER_BASE, NumelIn=29171712, NumelOut=233373696, Timeout(ms)=1800000) ran for 1800028 milliseconds before timing out.
[rank2]:[E202 22:11:23.894987441 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 59709, last enqueued NCCL work: 59711, last completed NCCL work: 59708.
[rank5]:[E202 22:11:23.895523353 ProcessGroupNCCL.cpp:616] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=59709, OpType=_ALLGATHER_BASE, NumelIn=29171712, NumelOut=233373696, Timeout(ms)=1800000) ran for 1800030 milliseconds before timing out.
[rank5]:[E202 22:11:23.896311292 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 5] Exception (either an error or timeout) detected by watchdog at work: 59709, last enqueued NCCL work: 59711, last completed NCCL work: 59708.
[rank3]:[E202 22:11:23.911192076 ProcessGroupNCCL.cpp:616] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=59709, OpType=_ALLGATHER_BASE, NumelIn=29171712, NumelOut=233373696, Timeout(ms)=1800000) ran for 1800045 milliseconds before timing out.
[rank3]:[E202 22:11:23.912724212 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 3] Exception (either an error or timeout) detected by watchdog at work: 59709, last enqueued NCCL work: 59711, last completed NCCL work: 59708.
[rank7]:[E202 22:11:23.919983167 ProcessGroupNCCL.cpp:616] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=59709, OpType=_ALLGATHER_BASE, NumelIn=29171712, NumelOut=233373696, Timeout(ms)=1800000) ran for 1800055 milliseconds before timing out.
[rank7]:[E202 22:11:23.920710697 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 7] Exception (either an error or timeout) detected by watchdog at work: 59709, last enqueued NCCL work: 59711, last completed NCCL work: 59708.
[rank1]:[E202 22:11:23.955180326 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=59709, OpType=_ALLGATHER_BASE, NumelIn=29171712, NumelOut=233373696, Timeout(ms)=1800000) ran for 1800088 milliseconds before timing out.
[rank1]:[E202 22:11:23.955725285 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 59709, last enqueued NCCL work: 59711, last completed NCCL work: 59708.

model.safetensors:  69%|██████▉   | 2.14G/3.09G [29:51<12:35, 1.26MB/s][rank6]:[E202 22:11:24.040743363 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 6] Timeout at NCCL work: 59709, last enqueued NCCL work: 59711, last completed NCCL work: 59708.
[rank6]:[E202 22:11:24.040764409 ProcessGroupNCCL.cpp:630] [Rank 6] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank6]:[E202 22:11:24.040772046 ProcessGroupNCCL.cpp:636] [Rank 6] To avoid data inconsistency, we are taking the entire process down.
[rank4]:[E202 22:11:24.041934582 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 4] Timeout at NCCL work: 59709, last enqueued NCCL work: 59711, last completed NCCL work: 59708.
[rank4]:[E202 22:11:24.041963680 ProcessGroupNCCL.cpp:630] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank4]:[E202 22:11:24.041969880 ProcessGroupNCCL.cpp:636] [Rank 4] To avoid data inconsistency, we are taking the entire process down.
[rank6]:[E202 22:11:24.042016965 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=59709, OpType=_ALLGATHER_BASE, NumelIn=29171712, NumelOut=233373696, Timeout(ms)=1800000) ran for 1800026 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f278d61b446 in /home/yhpeng/anaconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f278e92e772 in /home/yhpeng/anaconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f278e935bb3 in /home/yhpeng/anaconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f278e93761d in /home/yhpeng/anaconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f27d72c45c0 in /home/yhpeng/anaconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x8609 (0x7f27eba48609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f27eb813353 in /lib/x86_64-linux-gnu/libc.so.6)

It seems that the problem is about NCCL timeout and multiple gpus are trying to save the model at the same time, so how can I change the code to address the issue? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant