Skip to content

我在train_stage这个过程中,一直显示nccl time out。请问能给出您的意见吗 #224

@SHAWEGG

Description

@SHAWEGG

ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplet
e data.
[rank2]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E ProcessGroupNCCL.cpp:1182] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5, OpType=ALLGATHER, Num
elIn=1, NumelOut=4, Timeout(ms)=1800000) ran for 1800650 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x759f6ca12d87 in /root/anaconda3/envs/hallo1/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x759f2129c6e6 in /root/anaconda3/envs/hallo1/lib/pytho
n3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x759f2129fc3d in /root/anaconda3/envs/hallo1/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x759f212a0839 in /root/anaconda3/envs/hallo1/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xecdb4 (0x759f722ecdb4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x9caa4 (0x759f71e9caa4 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x129c3c (0x759f71f29c3c in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout
(ms)=1800000) ran for 1800650 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x759f6ca12d87 in /root/anaconda3/envs/hallo1/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x759f2129c6e6 in /root/anaconda3/envs/hallo1/lib/pytho
n3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x759f2129fc3d in /root/anaconda3/envs/hallo1/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x759f212a0839 in /root/anaconda3/envs/hallo1/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xecdb4 (0x759f722ecdb4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x9caa4 (0x759f71e9caa4 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x129c3c (0x759f71f29c3c in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x759f6ca12d87 in /root/anaconda3/envs/hallo1/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0xdf6b11 (0x759f20ff6b11 in /root/anaconda3/envs/hallo1/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xecdb4 (0x759f722ecdb4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: + 0x9caa4 (0x759f71e9caa4 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #4: + 0x129c3c (0x759f71f29c3c in /usr/lib/x86_64-linux-gnu/libc.so.6)

                                                                                                                                                                                         0

5/10/2025 22:51:25 - ERROR - root - Failed to execute the training process: invalid load key, '\x9c'. | 0/40 [00:00<?, ?it/s]
[rank3]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplet
e data.
[rank3]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E ProcessGroupNCCL.cpp:1182] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5, OpType=ALLGATHER, Num
elIn=1, NumelOut=4, Timeout(ms)=1800000) ran for 1800724 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7b99ff869d87 in /root/anaconda3/envs/hallo1/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, s这是日志目录

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions