-
Notifications
You must be signed in to change notification settings - Fork 651
Description
Bug description
Hi, I'm testing TorchComms (ncclx and nccl) with Torchtitan using:
TEST_BACKEND=ncclx
TRAIN_FILE=torchtitan.experiments.torchcomms.train
CONFIG_FILE=./torchtitan/models/llama3/train_configs/llama3_8b.toml
./run_train.sh
and the same for:
TEST_BACKEND=nccl
TorchComms reports successful backend registration:
[TC] Backend ncclx is registered
but when Torchtitan begins initializing the device mesh, the run hangs indefinitely at:
[rank0]: sliced_mesh_layout = self._get_slice_mesh_layout(mesh_dim_names)
Nothing proceeds after that. No error is thrown, it simply stalls.
This happens for both backends (ncclx and nccl) when using my custom llama3_8b.toml config.
However, If I switch to the small debug model:
CONFIG_FILE=debug_model.toml
then both ncclx and nccl run successfully end-to-end.
But even when using the given llama3_8b.toml config, it looks like TorchComms is not actually selected as the ProcessGroup backend — it still prints PyTorch warnings such as:
[rank0]:/pscratch/sd/s/sk2693/sysml-project/lib/python3.10/site-packages/torch/distributed/device_mesh.py:603: UserWarning: Slicing a flattened dim from root mesh will be deprecated in PT 2.11. Users need to bookkeep the flattened mesh directly.
[rank0]: sliced_mesh_layout = self._get_slice_mesh_layout(mesh_dim_names)
[rank0]:[titan] 2025-12-10 12:34:26,884 - root - WARNING - 5 CUDA memory allocation retries.
[rank0]:[titan] 2025-12-10 12:34:26,884 - root - INFO - step: 1 loss: 12.2314 grad_norm: 4.2100 memory: 35.57GiB(90.32%) tps: 759 tflops: 43.96 mfu: 14.09%
[rank0]:[titan] 2025-12-10 12:34:26,885 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:/pscratch/sd/s/sk2693/LLM-Collectives-Profiler/torchtitan/torchtitan/distributed/utils.py:387: UserWarning: Set timeout is now only supported for either nccl or gloo.
[rank0]: torch.distributed.distributed_c10d._set_pg_timeout(timeout, group)
which suggests Torchtitan falls back to PyTorch NCCL, not TorchComms.
Summary
debug_model.toml → runs fine for both ncclx and nccl
custom llama3_8b.toml → both backends hang at sliced_mesh_layout = self._get_slice_mesh_layout(mesh_dim_names)
llama3_8b.toml → UserWarning: Set timeout is now only supported for either nccl or gloo.
[rank0]: torch.distributed.distributed_c10d._set_pg_timeout(timeout, group)
Happy to provide full logs or help test fixes. I can share my custom llama3_8b.toml as well!
Versions
System: NERSC Perlmutter (4× NVIDIA A100-SXM4-40GB)
Torchtitan: 0.2.0 (installed via pip install -e .)
TorchComms: nightly build from pip (cu128)
- Installed from PyTorch nightly wheels: pip install --pre torch torchcomms --index-url https://download.pytorch.org/whl/nightly/cu128
PyTorch: 2.10.0.dev20251208+cu128
Python: 3.10.19