You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I successfully trained the model Qwen/Qwen2.5-1.5B-Instruct through sft.py. However when I saved the model.safetensors I met the problem below:
[rank4]:[E202 22:11:23.884267610 ProcessGroupNCCL.cpp:616] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=59709, OpType=_ALLGATHER_BASE, NumelIn=29171712, NumelOut=233373696, Timeout(ms)=1800000) ran for 1800018 milliseconds before timing out.
[rank4]:[E202 22:11:23.885041639 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 4] Exception (either an error or timeout) detected by watchdog at work: 59709, last enqueued NCCL work: 59711, last completed NCCL work: 59708.
[rank6]:[E202 22:11:23.891257576 ProcessGroupNCCL.cpp:616] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=59709, OpType=_ALLGATHER_BASE, NumelIn=29171712, NumelOut=233373696, Timeout(ms)=1800000) ran for 1800026 milliseconds before timing out.
[rank6]:[E202 22:11:23.891933152 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 6] Exception (either an error or timeout) detected by watchdog at work: 59709, last enqueued NCCL work: 59711, last completed NCCL work: 59708.
[rank2]:[E202 22:11:23.894245484 ProcessGroupNCCL.cpp:616] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=59709, OpType=_ALLGATHER_BASE, NumelIn=29171712, NumelOut=233373696, Timeout(ms)=1800000) ran for 1800028 milliseconds before timing out.
[rank2]:[E202 22:11:23.894987441 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 59709, last enqueued NCCL work: 59711, last completed NCCL work: 59708.
[rank5]:[E202 22:11:23.895523353 ProcessGroupNCCL.cpp:616] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=59709, OpType=_ALLGATHER_BASE, NumelIn=29171712, NumelOut=233373696, Timeout(ms)=1800000) ran for 1800030 milliseconds before timing out.
[rank5]:[E202 22:11:23.896311292 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 5] Exception (either an error or timeout) detected by watchdog at work: 59709, last enqueued NCCL work: 59711, last completed NCCL work: 59708.
[rank3]:[E202 22:11:23.911192076 ProcessGroupNCCL.cpp:616] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=59709, OpType=_ALLGATHER_BASE, NumelIn=29171712, NumelOut=233373696, Timeout(ms)=1800000) ran for 1800045 milliseconds before timing out.
[rank3]:[E202 22:11:23.912724212 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 3] Exception (either an error or timeout) detected by watchdog at work: 59709, last enqueued NCCL work: 59711, last completed NCCL work: 59708.
[rank7]:[E202 22:11:23.919983167 ProcessGroupNCCL.cpp:616] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=59709, OpType=_ALLGATHER_BASE, NumelIn=29171712, NumelOut=233373696, Timeout(ms)=1800000) ran for 1800055 milliseconds before timing out.
[rank7]:[E202 22:11:23.920710697 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 7] Exception (either an error or timeout) detected by watchdog at work: 59709, last enqueued NCCL work: 59711, last completed NCCL work: 59708.
[rank1]:[E202 22:11:23.955180326 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=59709, OpType=_ALLGATHER_BASE, NumelIn=29171712, NumelOut=233373696, Timeout(ms)=1800000) ran for 1800088 milliseconds before timing out.
[rank1]:[E202 22:11:23.955725285 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 59709, last enqueued NCCL work: 59711, last completed NCCL work: 59708.
model.safetensors: 69%|██████▉ | 2.14G/3.09G [29:51<12:35, 1.26MB/s][rank6]:[E202 22:11:24.040743363 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 6] Timeout at NCCL work: 59709, last enqueued NCCL work: 59711, last completed NCCL work: 59708.
[rank6]:[E202 22:11:24.040764409 ProcessGroupNCCL.cpp:630] [Rank 6] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank6]:[E202 22:11:24.040772046 ProcessGroupNCCL.cpp:636] [Rank 6] To avoid data inconsistency, we are taking the entire process down.
[rank4]:[E202 22:11:24.041934582 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 4] Timeout at NCCL work: 59709, last enqueued NCCL work: 59711, last completed NCCL work: 59708.
[rank4]:[E202 22:11:24.041963680 ProcessGroupNCCL.cpp:630] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank4]:[E202 22:11:24.041969880 ProcessGroupNCCL.cpp:636] [Rank 4] To avoid data inconsistency, we are taking the entire process down.
[rank6]:[E202 22:11:24.042016965 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=59709, OpType=_ALLGATHER_BASE, NumelIn=29171712, NumelOut=233373696, Timeout(ms)=1800000) ran for 1800026 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f278d61b446 in /home/yhpeng/anaconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f278e92e772 in /home/yhpeng/anaconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f278e935bb3 in /home/yhpeng/anaconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f278e93761d in /home/yhpeng/anaconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f27d72c45c0 in /home/yhpeng/anaconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x8609 (0x7f27eba48609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f27eb813353 in /lib/x86_64-linux-gnu/libc.so.6)
It seems that the problem is about NCCL timeout and multiple gpus are trying to save the model at the same time, so how can I change the code to address the issue? Thanks!
The text was updated successfully, but these errors were encountered:
I successfully trained the model Qwen/Qwen2.5-1.5B-Instruct through sft.py. However when I saved the model.safetensors I met the problem below:
It seems that the problem is about NCCL timeout and multiple gpus are trying to save the model at the same time, so how can I change the code to address the issue? Thanks!
The text was updated successfully, but these errors were encountered: