NCCL problem occured when multiple GPU cards are saving model.safetensors #160

pyh314 · 2025-02-02T14:25:58Z

I successfully trained the model Qwen/Qwen2.5-1.5B-Instruct through sft.py. However when I saved the model.safetensors I met the problem below:

[rank4]:[E202 22:11:23.884267610 ProcessGroupNCCL.cpp:616] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=59709, OpType=_ALLGATHER_BASE, NumelIn=29171712, NumelOut=233373696, Timeout(ms)=1800000) ran for 1800018 milliseconds before timing out.
[rank4]:[E202 22:11:23.885041639 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 4] Exception (either an error or timeout) detected by watchdog at work: 59709, last enqueued NCCL work: 59711, last completed NCCL work: 59708.
[rank6]:[E202 22:11:23.891257576 ProcessGroupNCCL.cpp:616] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=59709, OpType=_ALLGATHER_BASE, NumelIn=29171712, NumelOut=233373696, Timeout(ms)=1800000) ran for 1800026 milliseconds before timing out.
[rank6]:[E202 22:11:23.891933152 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 6] Exception (either an error or timeout) detected by watchdog at work: 59709, last enqueued NCCL work: 59711, last completed NCCL work: 59708.
[rank2]:[E202 22:11:23.894245484 ProcessGroupNCCL.cpp:616] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=59709, OpType=_ALLGATHER_BASE, NumelIn=29171712, NumelOut=233373696, Timeout(ms)=1800000) ran for 1800028 milliseconds before timing out.
[rank2]:[E202 22:11:23.894987441 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 59709, last enqueued NCCL work: 59711, last completed NCCL work: 59708.
[rank5]:[E202 22:11:23.895523353 ProcessGroupNCCL.cpp:616] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=59709, OpType=_ALLGATHER_BASE, NumelIn=29171712, NumelOut=233373696, Timeout(ms)=1800000) ran for 1800030 milliseconds before timing out.
[rank5]:[E202 22:11:23.896311292 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 5] Exception (either an error or timeout) detected by watchdog at work: 59709, last enqueued NCCL work: 59711, last completed NCCL work: 59708.
[rank3]:[E202 22:11:23.911192076 ProcessGroupNCCL.cpp:616] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=59709, OpType=_ALLGATHER_BASE, NumelIn=29171712, NumelOut=233373696, Timeout(ms)=1800000) ran for 1800045 milliseconds before timing out.
[rank3]:[E202 22:11:23.912724212 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 3] Exception (either an error or timeout) detected by watchdog at work: 59709, last enqueued NCCL work: 59711, last completed NCCL work: 59708.
[rank7]:[E202 22:11:23.919983167 ProcessGroupNCCL.cpp:616] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=59709, OpType=_ALLGATHER_BASE, NumelIn=29171712, NumelOut=233373696, Timeout(ms)=1800000) ran for 1800055 milliseconds before timing out.
[rank7]:[E202 22:11:23.920710697 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 7] Exception (either an error or timeout) detected by watchdog at work: 59709, last enqueued NCCL work: 59711, last completed NCCL work: 59708.
[rank1]:[E202 22:11:23.955180326 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=59709, OpType=_ALLGATHER_BASE, NumelIn=29171712, NumelOut=233373696, Timeout(ms)=1800000) ran for 1800088 milliseconds before timing out.
[rank1]:[E202 22:11:23.955725285 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 59709, last enqueued NCCL work: 59711, last completed NCCL work: 59708.

model.safetensors:  69%|██████▉   | 2.14G/3.09G [29:51<12:35, 1.26MB/s][rank6]:[E202 22:11:24.040743363 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 6] Timeout at NCCL work: 59709, last enqueued NCCL work: 59711, last completed NCCL work: 59708.
[rank6]:[E202 22:11:24.040764409 ProcessGroupNCCL.cpp:630] [Rank 6] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank6]:[E202 22:11:24.040772046 ProcessGroupNCCL.cpp:636] [Rank 6] To avoid data inconsistency, we are taking the entire process down.
[rank4]:[E202 22:11:24.041934582 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 4] Timeout at NCCL work: 59709, last enqueued NCCL work: 59711, last completed NCCL work: 59708.
[rank4]:[E202 22:11:24.041963680 ProcessGroupNCCL.cpp:630] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank4]:[E202 22:11:24.041969880 ProcessGroupNCCL.cpp:636] [Rank 4] To avoid data inconsistency, we are taking the entire process down.
[rank6]:[E202 22:11:24.042016965 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=59709, OpType=_ALLGATHER_BASE, NumelIn=29171712, NumelOut=233373696, Timeout(ms)=1800000) ran for 1800026 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f278d61b446 in /home/yhpeng/anaconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f278e92e772 in /home/yhpeng/anaconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f278e935bb3 in /home/yhpeng/anaconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f278e93761d in /home/yhpeng/anaconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f27d72c45c0 in /home/yhpeng/anaconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x8609 (0x7f27eba48609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f27eb813353 in /lib/x86_64-linux-gnu/libc.so.6)

It seems that the problem is about NCCL timeout and multiple gpus are trying to save the model at the same time, so how can I change the code to address the issue? Thanks!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL problem occured when multiple GPU cards are saving model.safetensors #160

NCCL problem occured when multiple GPU cards are saving model.safetensors #160

pyh314 commented Feb 2, 2025 •

edited

Loading

NCCL problem occured when multiple GPU cards are saving model.safetensors #160

NCCL problem occured when multiple GPU cards are saving model.safetensors #160

Comments

pyh314 commented Feb 2, 2025 • edited Loading

pyh314 commented Feb 2, 2025 •

edited

Loading