Skip to content

Megatron FSDP does not work with CPU_OFFLOADING #1986

@Skylion007

Description

@Skylion007

Describe the bug

Megatron-FSDP with EP8 tries to shard the pin_memory ops,

A clear and concise description of what the bug is.

2025-10-27 19:28:30 WARNING: one_logger package is required to enable e2e metrics tracking. please go to https://confluence.nvidia.com/display/MLWFO/Package+Repositories for details to install it
2025-10-27 19:39:50 [rank2047]: Traceback (most recent call last):
2025-10-27 19:39:50 [rank2047]:   File "/mnt/sharefs/users/runner/joshcopy/a2aoverlap/Megatron-MoE-ModelZoo/Megatron-LM/pretrain_gpt.py", line 233, in <module>
2025-10-27 19:39:50 [rank2047]:     pretrain(
2025-10-27 19:39:50 [rank2047]:   File "/mnt/sharefs/users/runner/joshcopy/a2aoverlap/Megatron-MoE-ModelZoo/Megatron-LM/megatron/training/training.py", line 666, in pretrain
2025-10-27 19:39:50 [rank2047]:     model, optimizer, opt_param_scheduler = setup_model_and_optimizer(
2025-10-27 19:39:50 [rank2047]:                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-10-27 19:39:50 [rank2047]:   File "/mnt/sharefs/users/runner/joshcopy/a2aoverlap/Megatron-MoE-ModelZoo/Megatron-LM/megatron/training/training.py", line 1104, in setup_model_and_optimizer
2025-10-27 19:39:50 [rank2047]:     optimizer = get_megatron_optimizer(
2025-10-27 19:39:50 [rank2047]:                 ^^^^^^^^^^^^^^^^^^^^^^^
2025-10-27 19:39:50 [rank2047]:   File "/mnt/sharefs/users/runner/joshcopy/a2aoverlap/Megatron-MoE-ModelZoo/Megatron-LM/megatron/core/optimizer/__init__.py", line 565, in get_megatron_optimizer
2025-10-27 19:39:50 [rank2047]:     _get_megatron_optimizer_based_on_param_groups(
2025-10-27 19:39:50 [rank2047]:   File "/mnt/sharefs/users/runner/joshcopy/a2aoverlap/Megatron-MoE-ModelZoo/Megatron-LM/megatron/core/optimizer/__init__.py", line 341, in _get_megatron_optimizer_based_on_param_groups
2025-10-27 19:39:50 [rank2047]:     optimizer = HybridDeviceOptimizer(
2025-10-27 19:39:50 [rank2047]:                 ^^^^^^^^^^^^^^^^^^^^^^
2025-10-27 19:39:50 [rank2047]:   File "/mnt/sharefs/users/runner/joshcopy/a2aoverlap/Megatron-MoE-ModelZoo/Megatron-LM/megatron/core/optimizer/cpu_offloading/hybrid_optimizer.py", line 80, in __init__
2025-10-27 19:39:50 [rank2047]:     self._init_sub_optimizers()
2025-10-27 19:39:50 [rank2047]:   File "/mnt/sharefs/users/runner/joshcopy/a2aoverlap/Megatron-MoE-ModelZoo/Megatron-LM/megatron/core/optimizer/cpu_offloading/hybrid_optimizer.py", line 188, in _init_sub_optimizers
2025-10-27 19:39:50 [rank2047]:     ) = self._get_sub_optimizer_param_groups(self.offload_fraction)
2025-10-27 19:39:50 [rank2047]:         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-10-27 19:39:50 [rank2047]:   File "/mnt/sharefs/users/runner/joshcopy/a2aoverlap/Megatron-MoE-ModelZoo/Megatron-LM/megatron/core/optimizer/cpu_offloading/hybrid_optimizer.py", line 274, in _get_sub_optimizer_param_groups
2025-10-27 19:39:50 [rank2047]:     param = param.detach().clone().cpu().pin_memory()
2025-10-27 19:39:50 [rank2047]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-10-27 19:39:50 [rank2047]:   File "/usr/local/lib/python3.12/dist-packages/torch/_compile.py", line 51, in inner
2025-10-27 19:39:50 [rank2047]:     return disable_fn(*args, **kwargs)
2025-10-27 19:39:50 [rank2047]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-10-27 19:39:50 [rank2047]:   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 850, in _fn
2025-10-27 19:39:50 [rank2047]:     return fn(*args, **kwargs)
2025-10-27 19:39:50 [rank2047]:            ^^^^^^^^^^^^^^^^^^^
2025-10-27 19:39:50 [rank2047]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/tensor/_api.py", line 350, in __torch_dispatch__
2025-10-27 19:39:50 [rank2047]:     return DTensor._op_dispatcher.dispatch(
2025-10-27 19:39:50 [rank2047]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-10-27 19:39:50 [rank2047]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/tensor/_dispatch.py", line 160, in dispatch
2025-10-27 19:39:50 [rank2047]:     self.sharding_propagator.propagate(op_info)
2025-10-27 19:39:50 [rank2047]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/tensor/_sharding_prop.py", line 266, in propagate
2025-10-27 19:39:50 [rank2047]:     OutputSharding, self.propagate_op_sharding(op_info.schema)
2025-10-27 19:39:50 [rank2047]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-10-27 19:39:50 [rank2047]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/tensor/_sharding_prop.py", line 45, in __call__
2025-10-27 19:39:50 [rank2047]:     return self.cache(*args, **kwargs)
2025-10-27 19:39:50 [rank2047]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-10-27 19:39:50 [rank2047]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/tensor/_sharding_prop.py", line 486, in propagate_op_sharding_non_cached
2025-10-27 19:39:50 [rank2047]:     raise NotImplementedError(
2025-10-27 19:39:50 [rank2047]: NotImplementedError: Operator aten.is_pinned.default does not have a sharding strategy registered.

Steps/Code to reproduce bug

Please list minimal steps or code snippet for us to be able to reproduce the bug.

A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.

Expected behavior

A clear and concise description of what you expected to happen.

Additional context

Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions