- 
                Notifications
    
You must be signed in to change notification settings  - Fork 3.2k
 
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
Megatron-FSDP with EP8 tries to shard the pin_memory ops,
A clear and concise description of what the bug is.
2025-10-27 19:28:30 WARNING: one_logger package is required to enable e2e metrics tracking. please go to https://confluence.nvidia.com/display/MLWFO/Package+Repositories for details to install it
2025-10-27 19:39:50 [rank2047]: Traceback (most recent call last):
2025-10-27 19:39:50 [rank2047]:   File "/mnt/sharefs/users/runner/joshcopy/a2aoverlap/Megatron-MoE-ModelZoo/Megatron-LM/pretrain_gpt.py", line 233, in <module>
2025-10-27 19:39:50 [rank2047]:     pretrain(
2025-10-27 19:39:50 [rank2047]:   File "/mnt/sharefs/users/runner/joshcopy/a2aoverlap/Megatron-MoE-ModelZoo/Megatron-LM/megatron/training/training.py", line 666, in pretrain
2025-10-27 19:39:50 [rank2047]:     model, optimizer, opt_param_scheduler = setup_model_and_optimizer(
2025-10-27 19:39:50 [rank2047]:                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-10-27 19:39:50 [rank2047]:   File "/mnt/sharefs/users/runner/joshcopy/a2aoverlap/Megatron-MoE-ModelZoo/Megatron-LM/megatron/training/training.py", line 1104, in setup_model_and_optimizer
2025-10-27 19:39:50 [rank2047]:     optimizer = get_megatron_optimizer(
2025-10-27 19:39:50 [rank2047]:                 ^^^^^^^^^^^^^^^^^^^^^^^
2025-10-27 19:39:50 [rank2047]:   File "/mnt/sharefs/users/runner/joshcopy/a2aoverlap/Megatron-MoE-ModelZoo/Megatron-LM/megatron/core/optimizer/__init__.py", line 565, in get_megatron_optimizer
2025-10-27 19:39:50 [rank2047]:     _get_megatron_optimizer_based_on_param_groups(
2025-10-27 19:39:50 [rank2047]:   File "/mnt/sharefs/users/runner/joshcopy/a2aoverlap/Megatron-MoE-ModelZoo/Megatron-LM/megatron/core/optimizer/__init__.py", line 341, in _get_megatron_optimizer_based_on_param_groups
2025-10-27 19:39:50 [rank2047]:     optimizer = HybridDeviceOptimizer(
2025-10-27 19:39:50 [rank2047]:                 ^^^^^^^^^^^^^^^^^^^^^^
2025-10-27 19:39:50 [rank2047]:   File "/mnt/sharefs/users/runner/joshcopy/a2aoverlap/Megatron-MoE-ModelZoo/Megatron-LM/megatron/core/optimizer/cpu_offloading/hybrid_optimizer.py", line 80, in __init__
2025-10-27 19:39:50 [rank2047]:     self._init_sub_optimizers()
2025-10-27 19:39:50 [rank2047]:   File "/mnt/sharefs/users/runner/joshcopy/a2aoverlap/Megatron-MoE-ModelZoo/Megatron-LM/megatron/core/optimizer/cpu_offloading/hybrid_optimizer.py", line 188, in _init_sub_optimizers
2025-10-27 19:39:50 [rank2047]:     ) = self._get_sub_optimizer_param_groups(self.offload_fraction)
2025-10-27 19:39:50 [rank2047]:         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-10-27 19:39:50 [rank2047]:   File "/mnt/sharefs/users/runner/joshcopy/a2aoverlap/Megatron-MoE-ModelZoo/Megatron-LM/megatron/core/optimizer/cpu_offloading/hybrid_optimizer.py", line 274, in _get_sub_optimizer_param_groups
2025-10-27 19:39:50 [rank2047]:     param = param.detach().clone().cpu().pin_memory()
2025-10-27 19:39:50 [rank2047]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-10-27 19:39:50 [rank2047]:   File "/usr/local/lib/python3.12/dist-packages/torch/_compile.py", line 51, in inner
2025-10-27 19:39:50 [rank2047]:     return disable_fn(*args, **kwargs)
2025-10-27 19:39:50 [rank2047]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-10-27 19:39:50 [rank2047]:   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 850, in _fn
2025-10-27 19:39:50 [rank2047]:     return fn(*args, **kwargs)
2025-10-27 19:39:50 [rank2047]:            ^^^^^^^^^^^^^^^^^^^
2025-10-27 19:39:50 [rank2047]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/tensor/_api.py", line 350, in __torch_dispatch__
2025-10-27 19:39:50 [rank2047]:     return DTensor._op_dispatcher.dispatch(
2025-10-27 19:39:50 [rank2047]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-10-27 19:39:50 [rank2047]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/tensor/_dispatch.py", line 160, in dispatch
2025-10-27 19:39:50 [rank2047]:     self.sharding_propagator.propagate(op_info)
2025-10-27 19:39:50 [rank2047]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/tensor/_sharding_prop.py", line 266, in propagate
2025-10-27 19:39:50 [rank2047]:     OutputSharding, self.propagate_op_sharding(op_info.schema)
2025-10-27 19:39:50 [rank2047]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-10-27 19:39:50 [rank2047]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/tensor/_sharding_prop.py", line 45, in __call__
2025-10-27 19:39:50 [rank2047]:     return self.cache(*args, **kwargs)
2025-10-27 19:39:50 [rank2047]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-10-27 19:39:50 [rank2047]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/tensor/_sharding_prop.py", line 486, in propagate_op_sharding_non_cached
2025-10-27 19:39:50 [rank2047]:     raise NotImplementedError(
2025-10-27 19:39:50 [rank2047]: NotImplementedError: Operator aten.is_pinned.default does not have a sharding strategy registered.
Steps/Code to reproduce bug
Please list minimal steps or code snippet for us to be able to reproduce the bug.
A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.
Expected behavior
A clear and concise description of what you expected to happen.
Additional context
Add any other context about the problem here.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working