Hello! I'm doing fine-tuning through TorchTitan with MSCCLPP on Perlmutter A100 GPUs and I'm getting the shortened version of the error below.
The commands I'm running on a node:
.../torchtitan $ export MSCCLPP_LIB=$HOME/project/mscclpp/build/lib/libmscclpp_nccl.so
.../torchtitan $ export LD_PRELOAD=$MSCCLPP_LIB
.../torchtitan $ NGPU=4 CONFIG_FILE="../trace_gen/deepseek-workload-card.toml" ./run_train.sh
I tried setting MSCCLPP_FORCE_DISABLE_NVLS to true (since Perlmutter does not have NVLS) before re-running the workload, but I still get the same error.
Error:
[rank0]:[titan] 2026-04-16 19:00:24,722 - root - INFO - Profiling active. Traces will be saved at ./outputs/profile_trace
[rank0]:[rank0]: Traceback (most recent call last):
[rank0]:[rank0]: File "", line 198, in _run_module_as_main
[rank0]:[rank0]: File "", line 88, in _run_code
[rank0]:[rank0]: File "/global/u2/j/user/project/torchtitan-opus/torchtitan/train.py", line 682, in
[rank0]:[rank0]: trainer.train()
[rank0]:[rank0]: File "/global/u2/j/user/project/venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 362, in wrapper
[rank0]:[rank0]: return f(*args, **kwargs)
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/global/u2/j/user/project/torchtitan-opus/torchtitan/train.py", line 608, in train
[rank0]:[rank0]: self.train_step(data_iterator)
[rank0]:[rank0]: File "/global/u2/j/user/project/torchtitan-opus/torchtitan/train.py", line 508, in train_step
[rank0]:[rank0]: loss = self.forward_backward_step(input_dict, labels)
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/global/u2/j/user/project/torchtitan-opus/torchtitan/train.py", line 484, in forward_backward_step
[rank0]:[rank0]: pred = model_parts0
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/global/u2/j/user/project/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
[rank0]:[rank0]: return self._call_impl(*args, **kwargs)
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/global/u2/j/user/project/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1882, in _call_impl
[rank0]:[rank0]: return inner()
[rank0]:[rank0]: ^^^^^^^
[rank0]:[rank0]: File "/global/u2/j/user/project/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1830, in inner
[rank0]:[rank0]: result = forward_call(*args, **kwargs)
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/global/u2/j/user/project/torchtitan-opus/torchtitan/models/deepseek_v3/model/model.py", line 386, in forward
[rank0]:[rank0]: h = self.tok_embeddings(tokens) if self.tok_embeddings is not None else tokens
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/global/u2/j/user/project/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
[rank0]:[rank0]: return self._call_impl(*args, **kwargs)
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/global/u2/j/user/project/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1882, in _call_impl
[rank0]:[rank0]: return inner()
[rank0]:[rank0]: ^^^^^^^
[rank0]:[rank0]: File "/global/u2/j/user/project/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1809, in inner
[rank0]:[rank0]: args_kwargs_result = hook(self, args, kwargs) # type: ignore[misc]
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/global/u2/j/user/project/venv/lib/python3.11/site-packages/torch/distributed/fsdp/_fully_shard/_fsdp_state.py", line 62, in fsdp_hook_wrapper
[rank0]:[rank0]: return torch._dynamo.disable(
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/global/u2/j/user/project/venv/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 1181, in _fn
[rank0]:[rank0]: return fn(*args, **kwargs)
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/global/u2/j/user/project/venv/lib/python3.11/site-packages/torch/distributed/fsdp/_fully_shard/_fsdp_state.py", line 253, in _pre_forward
[rank0]:[rank0]: args, kwargs = self._fsdp_param_group.pre_forward(module, args, kwargs)
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/global/u2/j/user/project/venv/lib/python3.11/site-packages/torch/distributed/fsdp/_fully_shard/_fsdp_param_group.py", line 448, in pre_forward
[rank0]:[rank0]: self.unshard(self.unshard_async_op)
[rank0]:[rank0]: File "/global/u2/j/user/project/venv/lib/python3.11/site-packages/torch/distributed/fsdp/_fully_shard/_fsdp_param_group.py", line 338, in unshard
[rank0]:[rank0]: self._all_gather_result = foreach_all_gather(
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/global/u2/j/user/project/venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
[rank0]:[rank0]: return func(*args, **kwargs)
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/global/u2/j/user/project/venv/lib/python3.11/site-packages/torch/distributed/fsdp/_fully_shard/_fsdp_collectives.py", line 275, in foreach_all_gather
[rank0]:[rank0]: all_gather_work = all_gather_comm(
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/global/u2/j/user/project/venv/lib/python3.11/site-packages/torch/distributed/fsdp/_fully_shard/_fsdp_collectives.py", line 89, in call
[rank0]:[rank0]: return dist.all_gather_into_tensor(
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/global/u2/j/user/project/venv/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
[rank0]:[rank0]: return func(*args, **kwargs)
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/global/u2/j/user/project/venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4125, in all_gather_into_tensor
[rank0]:[rank0]: work = group._allgather_base(output_tensor, input_tensor, opts)
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: RuntimeError: cuMemMap is used in env without NVLS support (mscclpp failure: InvalidUsage)
[rank0]:[rank0]:[W416 19:00:57.904349038 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank0]:terminate called without an active exception
W0416 19:00:57.727000 1704011 torch/distributed/elastic/multiprocessing/api.py:1010] Sending process 1704093 closing signal SIGTERM
W0416 19:00:57.744000 1704011 torch/distributed/elastic/multiprocessing/api.py:1010] Sending process 1704094 closing signal SIGTERM
W0416 19:00:57.749000 1704011 torch/distributed/elastic/multiprocessing/api.py:1010] Sending process 1704095 closing signal SIGTERM
Hello! I'm doing fine-tuning through TorchTitan with MSCCLPP on Perlmutter A100 GPUs and I'm getting the shortened version of the error below.
The commands I'm running on a node:
.../torchtitan $ export MSCCLPP_LIB=$HOME/project/mscclpp/build/lib/libmscclpp_nccl.so
.../torchtitan $ export LD_PRELOAD=$MSCCLPP_LIB
.../torchtitan $ NGPU=4 CONFIG_FILE="../trace_gen/deepseek-workload-card.toml" ./run_train.sh
I tried setting MSCCLPP_FORCE_DISABLE_NVLS to true (since Perlmutter does not have NVLS) before re-running the workload, but I still get the same error.
Error:
[rank0]:[titan] 2026-04-16 19:00:24,722 - root - INFO - Profiling active. Traces will be saved at ./outputs/profile_trace
[rank0]:[rank0]: Traceback (most recent call last):
[rank0]:[rank0]: File "", line 198, in _run_module_as_main
[rank0]:[rank0]: File "", line 88, in _run_code
[rank0]:[rank0]: File "/global/u2/j/user/project/torchtitan-opus/torchtitan/train.py", line 682, in
[rank0]:[rank0]: trainer.train()
[rank0]:[rank0]: File "/global/u2/j/user/project/venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 362, in wrapper
[rank0]:[rank0]: return f(*args, **kwargs)
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/global/u2/j/user/project/torchtitan-opus/torchtitan/train.py", line 608, in train
[rank0]:[rank0]: self.train_step(data_iterator)
[rank0]:[rank0]: File "/global/u2/j/user/project/torchtitan-opus/torchtitan/train.py", line 508, in train_step
[rank0]:[rank0]: loss = self.forward_backward_step(input_dict, labels)
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/global/u2/j/user/project/torchtitan-opus/torchtitan/train.py", line 484, in forward_backward_step
[rank0]:[rank0]: pred = model_parts0
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/global/u2/j/user/project/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
[rank0]:[rank0]: return self._call_impl(*args, **kwargs)
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/global/u2/j/user/project/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1882, in _call_impl
[rank0]:[rank0]: return inner()
[rank0]:[rank0]: ^^^^^^^
[rank0]:[rank0]: File "/global/u2/j/user/project/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1830, in inner
[rank0]:[rank0]: result = forward_call(*args, **kwargs)
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/global/u2/j/user/project/torchtitan-opus/torchtitan/models/deepseek_v3/model/model.py", line 386, in forward
[rank0]:[rank0]: h = self.tok_embeddings(tokens) if self.tok_embeddings is not None else tokens
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/global/u2/j/user/project/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
[rank0]:[rank0]: return self._call_impl(*args, **kwargs)
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/global/u2/j/user/project/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1882, in _call_impl
[rank0]:[rank0]: return inner()
[rank0]:[rank0]: ^^^^^^^
[rank0]:[rank0]: File "/global/u2/j/user/project/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1809, in inner
[rank0]:[rank0]: args_kwargs_result = hook(self, args, kwargs) # type: ignore[misc]
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/global/u2/j/user/project/venv/lib/python3.11/site-packages/torch/distributed/fsdp/_fully_shard/_fsdp_state.py", line 62, in fsdp_hook_wrapper
[rank0]:[rank0]: return torch._dynamo.disable(
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/global/u2/j/user/project/venv/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 1181, in _fn
[rank0]:[rank0]: return fn(*args, **kwargs)
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/global/u2/j/user/project/venv/lib/python3.11/site-packages/torch/distributed/fsdp/_fully_shard/_fsdp_state.py", line 253, in _pre_forward
[rank0]:[rank0]: args, kwargs = self._fsdp_param_group.pre_forward(module, args, kwargs)
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/global/u2/j/user/project/venv/lib/python3.11/site-packages/torch/distributed/fsdp/_fully_shard/_fsdp_param_group.py", line 448, in pre_forward
[rank0]:[rank0]: self.unshard(self.unshard_async_op)
[rank0]:[rank0]: File "/global/u2/j/user/project/venv/lib/python3.11/site-packages/torch/distributed/fsdp/_fully_shard/_fsdp_param_group.py", line 338, in unshard
[rank0]:[rank0]: self._all_gather_result = foreach_all_gather(
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/global/u2/j/user/project/venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
[rank0]:[rank0]: return func(*args, **kwargs)
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/global/u2/j/user/project/venv/lib/python3.11/site-packages/torch/distributed/fsdp/_fully_shard/_fsdp_collectives.py", line 275, in foreach_all_gather
[rank0]:[rank0]: all_gather_work = all_gather_comm(
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/global/u2/j/user/project/venv/lib/python3.11/site-packages/torch/distributed/fsdp/_fully_shard/_fsdp_collectives.py", line 89, in call
[rank0]:[rank0]: return dist.all_gather_into_tensor(
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/global/u2/j/user/project/venv/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
[rank0]:[rank0]: return func(*args, **kwargs)
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/global/u2/j/user/project/venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4125, in all_gather_into_tensor
[rank0]:[rank0]: work = group._allgather_base(output_tensor, input_tensor, opts)
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: RuntimeError: cuMemMap is used in env without NVLS support (mscclpp failure: InvalidUsage)
[rank0]:[rank0]:[W416 19:00:57.904349038 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank0]:terminate called without an active exception
W0416 19:00:57.727000 1704011 torch/distributed/elastic/multiprocessing/api.py:1010] Sending process 1704093 closing signal SIGTERM
W0416 19:00:57.744000 1704011 torch/distributed/elastic/multiprocessing/api.py:1010] Sending process 1704094 closing signal SIGTERM
W0416 19:00:57.749000 1704011 torch/distributed/elastic/multiprocessing/api.py:1010] Sending process 1704095 closing signal SIGTERM