Skip to content

TorchComms backend (NCCLX and TorchComms NCCL) initializes but is ignored / not bound to PG in Torchtitan FSDP + DeviceMesh pipeline → large model hangs #2139

@sk2693

Description

@sk2693

Bug description

Hi, I'm testing TorchComms (ncclx and nccl) with Torchtitan using:

TEST_BACKEND=ncclx
TRAIN_FILE=torchtitan.experiments.torchcomms.train
CONFIG_FILE=./torchtitan/models/llama3/train_configs/llama3_8b.toml
./run_train.sh

and the same for:

TEST_BACKEND=nccl

TorchComms reports successful backend registration:

[TC] Backend ncclx is registered

but when Torchtitan begins initializing the device mesh, the run hangs indefinitely at:

[rank0]: sliced_mesh_layout = self._get_slice_mesh_layout(mesh_dim_names)

Nothing proceeds after that. No error is thrown, it simply stalls.

This happens for both backends (ncclx and nccl) when using my custom llama3_8b.toml config.

However, If I switch to the small debug model:

CONFIG_FILE=debug_model.toml

then both ncclx and nccl run successfully end-to-end.

But even when using the given llama3_8b.toml config, it looks like TorchComms is not actually selected as the ProcessGroup backend — it still prints PyTorch warnings such as:

[rank0]:/pscratch/sd/s/sk2693/sysml-project/lib/python3.10/site-packages/torch/distributed/device_mesh.py:603: UserWarning: Slicing a flattened dim from root mesh will be deprecated in PT 2.11. Users need to bookkeep the flattened mesh directly.
[rank0]: sliced_mesh_layout = self._get_slice_mesh_layout(mesh_dim_names)
[rank0]:[titan] 2025-12-10 12:34:26,884 - root - WARNING - 5 CUDA memory allocation retries.
[rank0]:[titan] 2025-12-10 12:34:26,884 - root - INFO - step: 1 loss: 12.2314 grad_norm: 4.2100 memory: 35.57GiB(90.32%) tps: 759 tflops: 43.96 mfu: 14.09%
[rank0]:[titan] 2025-12-10 12:34:26,885 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:/pscratch/sd/s/sk2693/LLM-Collectives-Profiler/torchtitan/torchtitan/distributed/utils.py:387: UserWarning: Set timeout is now only supported for either nccl or gloo.
[rank0]: torch.distributed.distributed_c10d._set_pg_timeout(timeout, group)

which suggests Torchtitan falls back to PyTorch NCCL, not TorchComms.

Summary

debug_model.toml → runs fine for both ncclx and nccl
custom llama3_8b.toml → both backends hang at sliced_mesh_layout = self._get_slice_mesh_layout(mesh_dim_names)
llama3_8b.toml → UserWarning: Set timeout is now only supported for either nccl or gloo.
[rank0]: torch.distributed.distributed_c10d._set_pg_timeout(timeout, group)

Happy to provide full logs or help test fixes. I can share my custom llama3_8b.toml as well!

Versions

System: NERSC Perlmutter (4× NVIDIA A100-SXM4-40GB)
Torchtitan: 0.2.0 (installed via pip install -e .)
TorchComms: nightly build from pip (cu128)

PyTorch: 2.10.0.dev20251208+cu128
Python: 3.10.19

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions