debug model training hangs on NVIDIA B200 with >1 GPU

When I try to train the debug model in torchtitan on a single B200, I see that it only trains correctly when limited to a single GPU.  Details:

```
//
// 1 GPU: it works!
//
(pytorch) [vasiliy@devgpu005.snc8 ~/local/torchtitan (main)]$ with-proxy CONFIG_FILE="./train_configs/debug_model.toml" NGPU=1 CUDA_VISIBLE_DEVICES=0 ./run_llama_train.sh
...
trains as usual!

//
// 2 GPUs: NCCL error
//
(pytorch) [vasiliy@devgpu005.snc8 ~/local/torchtitan (main)]$ with-proxy CONFIG_FILE="./train_configs/debug_model.toml" NGPU=2 CUDA_VISIBLE_DEVICES=0,1 ./run_llama_train.sh
...
[rank0]:2025-01-28 13:23:00,193 - root - INFO - Training starts at step 1, with local batch size 8, global batch size 16, sequence length 2048, total steps 10 (warmup 2)                    
[rank0]:[rank0]:[E128 13:23:00.788972321 ProcessGroupNCCL.cpp:1897] [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
[rank0]:CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.                                                              
[rank0]:For debugging consider passing CUDA_LAUNCH_BLOCKING=1                               
[rank0]:Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.                                                                                                                  
[rank0]:                                                                      
[rank0]:Exception raised from c10_cuda_check_implementation at /data/users/vasiliy/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):                                           
[rank0]:frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7fdfa3b8d6c8 in /data/users/vasiliy/pytorch/torch/lib/libc10.so)                                                                                                                                                                       
[rank0]:frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe4 (0x7fdfa3b23426 in /data/users/vasiliy/pytorch/torch/lib/libc10.so)                                                                                                                                   
[rank0]:frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3be (0x7fdfa3fb9f4e in /data/users/vasiliy/pytorch/torch/lib/libc10_cuda.so)                                                                                                                                                                                                      
[rank0]:frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fdf86956606 in /data/users/vasiliy/pytorch/torch/lib/libtorch_cuda.so)                                                                                                                                                                                                                 
[rank0]:frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7fdf86965770 in /data/users/vasiliy/pytorch/torch/lib/libtorch_cuda.so)                                                                                                                                                                                                                                        
[rank0]:frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x606 (0x7fdf86966b96 in /data/users/vasiliy/pytorch/torch/lib/libtorch_cuda.so)                                                                                                                                                                                                                                             
[rank0]:frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x148 (0x7fdf86967b78 in /data/users/vasiliy/pytorch/torch/lib/libtorch_cuda.so)                                                                                                                                                                                                                                            
[rank0]:frame #7: <unknown function> + 0xdbbf4 (0x7fdf85adbbf4 in /home/vasiliy/.conda/envs/pytorch/lib/libstdc++.so.6)                                                                                                                                                                                                                                                                    
[rank0]:frame #8: <unknown function> + 0x89e92 (0x7fdfa4889e92 in /lib64/libc.so.6)                                                                                                          
[rank0]:frame #9: <unknown function> + 0x10ef20 (0x7fdfa490ef20 in /lib64/libc.so.6)                                                                                                         
[rank0]:                                                                                                                                                                                     
[rank0]:Fatal Python error: Aborted                                                                                                                                                          
[rank0]:                                                                                                                                                                                     
[rank0]:Thread 0x00007fdcbce00640 (most recent call first):                                                                                                                                  
[rank0]:  File "/home/vasiliy/.conda/envs/pytorch/lib/python3.11/threading.py", line 324 in wait                                                                                             
[rank0]:  File "/home/vasiliy/.conda/envs/pytorch/lib/python3.11/threading.py", line 622 in wait                                                                                                                                                                                                                                                                                           
[rank0]:  File "/home/vasiliy/.conda/envs/pytorch/lib/python3.11/site-packages/tqdm/_monitor.py", line 60 in run                                                                             
[rank0]:  File "/home/vasiliy/.conda/envs/pytorch/lib/python3.11/threading.py", line 1038 in _bootstrap_inner
[rank0]:  File "/home/vasiliy/.conda/envs/pytorch/lib/python3.11/threading.py", line 995 in _bootstrap
[rank0]:                                                                
[rank0]:Thread 0x00007fdfa4a89400 (most recent call first):                    
[rank0]:  File "/data/users/vasiliy/pytorch/torch/autograd/grad_mode.py", line 85 in __exit__                                                                                                
[rank0]:  File "/data/users/vasiliy/pytorch/torch/utils/_contextlib.py", line 115 in decorate_context                                                                                        
[rank0]:  File "/data/users/vasiliy/pytorch/torch/distributed/fsdp/_fully_shard/_fsdp_param_group.py", line 288 in wait_for_unshard
[rank0]:  File "/data/users/vasiliy/pytorch/torch/distributed/fsdp/_fully_shard/_fsdp_param_group.py", line 335 in pre_forward                                                               
[rank0]:  File "/data/users/vasiliy/pytorch/torch/distributed/fsdp/_fully_shard/_fsdp_state.py", line 230 in _pre_forward                                                                                                                                                                                                                                                                  
[rank0]:  File "/data/users/vasiliy/pytorch/torch/_dynamo/eval_frame.py", line 745 in _fn                                                                                                    
[rank0]:  File "/data/users/vasiliy/pytorch/torch/distributed/fsdp/_fully_shard/_fsdp_state.py", line 62 in fsdp_hook_wrapper
[rank0]:  File "/data/users/vasiliy/pytorch/torch/nn/modules/module.py", line 1782 in inner                                                                                                  
[rank0]:  File "/data/users/vasiliy/pytorch/torch/nn/modules/module.py", line 1855 in _call_impl                                                                                             
[rank0]:  File "/data/users/vasiliy/pytorch/torch/nn/modules/module.py", line 1749 in _wrapped_call_impl                                                                                     
[rank0]:  File "/data/users/vasiliy/torchtitan/train.py", line 309 in main                                                                                                                   
[rank0]:  File "/data/users/vasiliy/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355 in wrapper                                                               
[rank0]:  File "/data/users/vasiliy/torchtitan/train.py", line 436 in <module>

// full error
P1720698802: https://www.internalfb.com/intern/paste/P1720698802/

// 4 GPUs: hangs indefinitely on first forward
(pytorch) [vasiliy@devgpu005.snc8 ~/local/torchtitan (main)]$ with-proxy CONFIG_FILE="./train_configs/debug_model.toml" NGPU=4 CUDA_VISIBLE_DEVICES=0,1 ./run_llama_train.sh
...
[rank0]:2025-01-28 13:24:27,764 - root - INFO - CUDA memory usage for model: 0.01GiB(0.01%)                                                                                                                                                                                                                                                                                                
[rank0]:2025-01-28 13:24:27,764 - root - INFO - Training starts at step 1, with local batch size 8, global batch size 32, sequence length 2048, total steps 10 (warmup 2)                                                                                                                                                                                                                  
...hangs here!....

// full error
https://gist.github.com/vkuzo/ce6547e13740b437ee93a1ebf58f7dc4

// 8 GPUs: same as 4 GPUs
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

debug model training hangs on NVIDIA B200 with >1 GPU #810

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

debug model training hangs on NVIDIA B200 with >1 GPU #810

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions