add tp example #964

inkcherry · 2025-04-07T08:12:59Z

FYI , @hwchen2017

Signed-off-by: inkcherry <[email protected]>

The release versions are now available. update from the master branch to use the minimum required versions instead. also link the example.deepspeedai/DeepSpeedExamples#964 --------- Signed-off-by: inkcherry <[email protected]>

ekg · 2025-04-17T21:24:52Z

I'm unable to get this to work.

First I run: bash run.sh zero2 (all of the options fail with the same error)

Time to load fused_adam op: 0.1688675880432129 seconds
[rank4]:[E417 20:43:19.177767953 ProcessGroupNCCL.cpp:616] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=131076096, NumelOut=131076096, Timeout(ms)=6
00000) ran for 600016 milliseconds before timing out.
[rank4]:[E417 20:43:19.178407423 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 4] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 291, last completed
 NCCL work: -1.
[rank4]:[E417 20:43:19.178477173 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 4] Timeout at NCCL work: 1, last enqueued NCCL work: 291, last completed NCCL work: -1.
[rank4]:[E417 20:43:19.178491283 ProcessGroupNCCL.cpp:630] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on c
orrupted/incomplete data.
[rank4]:[E417 20:43:19.178503043 ProcessGroupNCCL.cpp:636] [Rank 4] To avoid data inconsistency, we are taking the entire process down.
[rank4]:[E417 20:43:19.180988563 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: Work
NCCL(SeqNum=1, OpType=BROADCAST, NumelIn=131076096, NumelOut=131076096, Timeout(ms)=600000) ran for 600016 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1729647429097/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7a237f9c8446 in /home/erikg/micromamba/envs/deepseed-examples/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7a232e5f14d2 in /home/erikg/micromamba/envs/deepseed-examples/lib/
python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7a232e5f8913 in /home/erikg/micromamba/envs/deepseed-examples/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7a232e5fa37d in /home/erikg/micromamba/envs/deepseed-examples/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7a2386b785c0 in /home/erikg/micromamba/envs/deepseed-examples/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x9caa4 (0x7a238769caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x7a2387729c3c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 1 PG GUID 1 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=13107609
6, NumelOut=131076096, Timeout(ms)=600000) ran for 600016 milliseconds before timing out.

What am I doing wrong?

inkcherry · 2025-04-18T00:39:22Z

I'm unable to get this to work.

First I run: bash run.sh zero2 (all of the options fail with the same error)

Time to load fused_adam op: 0.1688675880432129 seconds
[rank4]:[E417 20:43:19.177767953 ProcessGroupNCCL.cpp:616] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=131076096, NumelOut=131076096, Timeout(ms)=6
00000) ran for 600016 milliseconds before timing out.
[rank4]:[E417 20:43:19.178407423 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 4] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 291, last completed
 NCCL work: -1.
[rank4]:[E417 20:43:19.178477173 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 4] Timeout at NCCL work: 1, last enqueued NCCL work: 291, last completed NCCL work: -1.
[rank4]:[E417 20:43:19.178491283 ProcessGroupNCCL.cpp:630] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on c
orrupted/incomplete data.
[rank4]:[E417 20:43:19.178503043 ProcessGroupNCCL.cpp:636] [Rank 4] To avoid data inconsistency, we are taking the entire process down.
[rank4]:[E417 20:43:19.180988563 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: Work
NCCL(SeqNum=1, OpType=BROADCAST, NumelIn=131076096, NumelOut=131076096, Timeout(ms)=600000) ran for 600016 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1729647429097/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7a237f9c8446 in /home/erikg/micromamba/envs/deepseed-examples/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7a232e5f14d2 in /home/erikg/micromamba/envs/deepseed-examples/lib/
python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7a232e5f8913 in /home/erikg/micromamba/envs/deepseed-examples/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7a232e5fa37d in /home/erikg/micromamba/envs/deepseed-examples/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7a2386b785c0 in /home/erikg/micromamba/envs/deepseed-examples/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x9caa4 (0x7a238769caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x7a2387729c3c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 1 PG GUID 1 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=13107609
6, NumelOut=131076096, Timeout(ms)=600000) ran for 600016 milliseconds before timing out.

What am I doing wrong?

Hi， ekg, If standard ZeRO-1/2 still fails to run properly, it may be due to incorrect configuration of the your CUDA and NCCL versions.

inkcherry · 2025-04-18T00:40:38Z

@hwchen2017 just a reminder in case you miss this~ thanks.

inkcherry added 2 commits April 7, 2025 08:07

update tp example

5d87971

Signed-off-by: inkcherry <[email protected]>

update

06a7fbe

Signed-off-by: inkcherry <[email protected]>

inkcherry requested a review from tjruwase as a code owner April 7, 2025 08:13

inkcherry mentioned this pull request Apr 7, 2025

update dependencies version info deepspeedai/DeepSpeed#7206

Merged

add length bench file

592d28f

Signed-off-by: inkcherry <[email protected]>

inkcherry mentioned this pull request Apr 7, 2025

[BUG]AutoTP train get AssertionError: Data inconsistency within the TP group. deepspeedai/DeepSpeed#7199

Closed

Merge branch 'master' into master

0aa0913

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add tp example #964

add tp example #964

inkcherry commented Apr 7, 2025

ekg commented Apr 17, 2025

inkcherry commented Apr 18, 2025 •

edited

Loading

inkcherry commented Apr 18, 2025

add tp example #964

Are you sure you want to change the base?

add tp example #964

Conversation

inkcherry commented Apr 7, 2025

ekg commented Apr 17, 2025

inkcherry commented Apr 18, 2025 • edited Loading

inkcherry commented Apr 18, 2025

inkcherry commented Apr 18, 2025 •

edited

Loading