TorchComms backend (NCCLX and TorchComms NCCL) initializes but is ignored / not bound to PG in Torchtitan FSDP + DeviceMesh pipeline → large model hangs

### Bug description


Hi, I'm testing TorchComms (ncclx and nccl) with Torchtitan using:

TEST_BACKEND=ncclx
TRAIN_FILE=torchtitan.experiments.torchcomms.train
CONFIG_FILE=./torchtitan/models/llama3/train_configs/llama3_8b.toml
./run_train.sh

and the same for:

TEST_BACKEND=nccl

TorchComms reports successful backend registration:

[TC] Backend ncclx is registered

but when Torchtitan begins initializing the device mesh, the run hangs indefinitely at:

[rank0]: sliced_mesh_layout = self._get_slice_mesh_layout(mesh_dim_names)

Nothing proceeds after that. No error is thrown, it simply stalls.

This happens for both backends (ncclx and nccl) when using my custom llama3_8b.toml config.

However, If I switch to the small debug model:

CONFIG_FILE=debug_model.toml

then both ncclx and nccl run successfully end-to-end.

But even when using the given llama3_8b.toml config, it looks like TorchComms is not actually selected as the ProcessGroup backend — it still prints PyTorch warnings such as:

[rank0]:/pscratch/sd/s/sk2693/sysml-project/lib/python3.10/site-packages/torch/distributed/device_mesh.py:603: UserWarning: Slicing a flattened dim from root mesh will be deprecated in PT 2.11. Users need to bookkeep the flattened mesh directly. 
[rank0]:  sliced_mesh_layout = self._get_slice_mesh_layout(mesh_dim_names)
[rank0]:[titan] 2025-12-10 12:34:26,884 - root - WARNING - 5 CUDA memory allocation retries.
[rank0]:[titan] 2025-12-10 12:34:26,884 - root - INFO - step:  1  loss: 12.2314  grad_norm:  4.2100  memory: 35.57GiB(90.32%)  tps: 759  tflops: 43.96  mfu: 14.09%
[rank0]:[titan] 2025-12-10 12:34:26,885 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:/pscratch/sd/s/sk2693/LLM-Collectives-Profiler/torchtitan/torchtitan/distributed/utils.py:387: UserWarning: Set timeout is now only supported for either nccl or gloo.
[rank0]:  torch.distributed.distributed_c10d._set_pg_timeout(timeout, group)

which suggests Torchtitan falls back to PyTorch NCCL, not TorchComms.

Summary

debug_model.toml → runs fine for both ncclx and nccl
custom llama3_8b.toml → both backends hang at sliced_mesh_layout = self._get_slice_mesh_layout(mesh_dim_names)
llama3_8b.toml → UserWarning: Set timeout is now only supported for either nccl or gloo.
[rank0]:  torch.distributed.distributed_c10d._set_pg_timeout(timeout, group)

Happy to provide full logs or help test fixes. I can share my custom llama3_8b.toml as well!

### Versions

System: NERSC Perlmutter (4× NVIDIA A100-SXM4-40GB)
Torchtitan: 0.2.0 (installed via pip install -e .)
TorchComms: nightly build from pip (cu128)
- Installed from PyTorch nightly wheels:  pip install --pre torch torchcomms --index-url https://download.pytorch.org/whl/nightly/cu128

PyTorch: 2.10.0.dev20251208+cu128
Python: 3.10.19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TorchComms backend (NCCLX and TorchComms NCCL) initializes but is ignored / not bound to PG in Torchtitan FSDP + DeviceMesh pipeline → large model hangs #2139

Bug description

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TorchComms backend (NCCLX and TorchComms NCCL) initializes but is ignored / not bound to PG in Torchtitan FSDP + DeviceMesh pipeline → large model hangs #2139

Description

Bug description

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions