Skip to content

debug model training hangs on NVIDIA B200 with >1 GPU #810

Open
@vkuzo

Description

@vkuzo

When I try to train the debug model in torchtitan on a single B200, I see that it only trains correctly when limited to a single GPU. Details:

//
// 1 GPU: it works!
//
(pytorch) [[email protected] ~/local/torchtitan (main)]$ with-proxy CONFIG_FILE="./train_configs/debug_model.toml" NGPU=1 CUDA_VISIBLE_DEVICES=0 ./run_llama_train.sh
...
trains as usual!

//
// 2 GPUs: NCCL error
//
(pytorch) [[email protected] ~/local/torchtitan (main)]$ with-proxy CONFIG_FILE="./train_configs/debug_model.toml" NGPU=2 CUDA_VISIBLE_DEVICES=0,1 ./run_llama_train.sh
...
[rank0]:2025-01-28 13:23:00,193 - root - INFO - Training starts at step 1, with local batch size 8, global batch size 16, sequence length 2048, total steps 10 (warmup 2)                    
[rank0]:[rank0]:[E128 13:23:00.788972321 ProcessGroupNCCL.cpp:1897] [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
[rank0]:CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.                                                              
[rank0]:For debugging consider passing CUDA_LAUNCH_BLOCKING=1                               
[rank0]:Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.                                                                                                                  
[rank0]:                                                                      
[rank0]:Exception raised from c10_cuda_check_implementation at /data/users/vasiliy/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):                                           
[rank0]:frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7fdfa3b8d6c8 in /data/users/vasiliy/pytorch/torch/lib/libc10.so)                                                                                                                                                                       
[rank0]:frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe4 (0x7fdfa3b23426 in /data/users/vasiliy/pytorch/torch/lib/libc10.so)                                                                                                                                   
[rank0]:frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3be (0x7fdfa3fb9f4e in /data/users/vasiliy/pytorch/torch/lib/libc10_cuda.so)                                                                                                                                                                                                      
[rank0]:frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fdf86956606 in /data/users/vasiliy/pytorch/torch/lib/libtorch_cuda.so)                                                                                                                                                                                                                 
[rank0]:frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7fdf86965770 in /data/users/vasiliy/pytorch/torch/lib/libtorch_cuda.so)                                                                                                                                                                                                                                        
[rank0]:frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x606 (0x7fdf86966b96 in /data/users/vasiliy/pytorch/torch/lib/libtorch_cuda.so)                                                                                                                                                                                                                                             
[rank0]:frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x148 (0x7fdf86967b78 in /data/users/vasiliy/pytorch/torch/lib/libtorch_cuda.so)                                                                                                                                                                                                                                            
[rank0]:frame #7: <unknown function> + 0xdbbf4 (0x7fdf85adbbf4 in /home/vasiliy/.conda/envs/pytorch/lib/libstdc++.so.6)                                                                                                                                                                                                                                                                    
[rank0]:frame #8: <unknown function> + 0x89e92 (0x7fdfa4889e92 in /lib64/libc.so.6)                                                                                                          
[rank0]:frame #9: <unknown function> + 0x10ef20 (0x7fdfa490ef20 in /lib64/libc.so.6)                                                                                                         
[rank0]:                                                                                                                                                                                     
[rank0]:Fatal Python error: Aborted                                                                                                                                                          
[rank0]:                                                                                                                                                                                     
[rank0]:Thread 0x00007fdcbce00640 (most recent call first):                                                                                                                                  
[rank0]:  File "/home/vasiliy/.conda/envs/pytorch/lib/python3.11/threading.py", line 324 in wait                                                                                             
[rank0]:  File "/home/vasiliy/.conda/envs/pytorch/lib/python3.11/threading.py", line 622 in wait                                                                                                                                                                                                                                                                                           
[rank0]:  File "/home/vasiliy/.conda/envs/pytorch/lib/python3.11/site-packages/tqdm/_monitor.py", line 60 in run                                                                             
[rank0]:  File "/home/vasiliy/.conda/envs/pytorch/lib/python3.11/threading.py", line 1038 in _bootstrap_inner
[rank0]:  File "/home/vasiliy/.conda/envs/pytorch/lib/python3.11/threading.py", line 995 in _bootstrap
[rank0]:                                                                
[rank0]:Thread 0x00007fdfa4a89400 (most recent call first):                    
[rank0]:  File "/data/users/vasiliy/pytorch/torch/autograd/grad_mode.py", line 85 in __exit__                                                                                                
[rank0]:  File "/data/users/vasiliy/pytorch/torch/utils/_contextlib.py", line 115 in decorate_context                                                                                        
[rank0]:  File "/data/users/vasiliy/pytorch/torch/distributed/fsdp/_fully_shard/_fsdp_param_group.py", line 288 in wait_for_unshard
[rank0]:  File "/data/users/vasiliy/pytorch/torch/distributed/fsdp/_fully_shard/_fsdp_param_group.py", line 335 in pre_forward                                                               
[rank0]:  File "/data/users/vasiliy/pytorch/torch/distributed/fsdp/_fully_shard/_fsdp_state.py", line 230 in _pre_forward                                                                                                                                                                                                                                                                  
[rank0]:  File "/data/users/vasiliy/pytorch/torch/_dynamo/eval_frame.py", line 745 in _fn                                                                                                    
[rank0]:  File "/data/users/vasiliy/pytorch/torch/distributed/fsdp/_fully_shard/_fsdp_state.py", line 62 in fsdp_hook_wrapper
[rank0]:  File "/data/users/vasiliy/pytorch/torch/nn/modules/module.py", line 1782 in inner                                                                                                  
[rank0]:  File "/data/users/vasiliy/pytorch/torch/nn/modules/module.py", line 1855 in _call_impl                                                                                             
[rank0]:  File "/data/users/vasiliy/pytorch/torch/nn/modules/module.py", line 1749 in _wrapped_call_impl                                                                                     
[rank0]:  File "/data/users/vasiliy/torchtitan/train.py", line 309 in main                                                                                                                   
[rank0]:  File "/data/users/vasiliy/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355 in wrapper                                                               
[rank0]:  File "/data/users/vasiliy/torchtitan/train.py", line 436 in <module>

// full error
P1720698802: https://www.internalfb.com/intern/paste/P1720698802/

// 4 GPUs: hangs indefinitely on first forward
(pytorch) [[email protected] ~/local/torchtitan (main)]$ with-proxy CONFIG_FILE="./train_configs/debug_model.toml" NGPU=4 CUDA_VISIBLE_DEVICES=0,1 ./run_llama_train.sh
...
[rank0]:2025-01-28 13:24:27,764 - root - INFO - CUDA memory usage for model: 0.01GiB(0.01%)                                                                                                                                                                                                                                                                                                
[rank0]:2025-01-28 13:24:27,764 - root - INFO - Training starts at step 1, with local batch size 8, global batch size 32, sequence length 2048, total steps 10 (warmup 2)                                                                                                                                                                                                                  
...hangs here!....

// full error
https://gist.github.com/vkuzo/ce6547e13740b437ee93a1ebf58f7dc4

// 8 GPUs: same as 4 GPUs

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions