Skip to content

Add DCP compatibility for FSDP2-TP sharding in TransformerEngine.#2713

Open
cspades wants to merge 7 commits intoNVIDIA:mainfrom
cspades:cye/fsdp2-tp-dcp
Open

Add DCP compatibility for FSDP2-TP sharding in TransformerEngine.#2713
cspades wants to merge 7 commits intoNVIDIA:mainfrom
cspades:cye/fsdp2-tp-dcp

Conversation

@cspades
Copy link
Member

@cspades cspades commented Feb 26, 2026

Summary

  • Support (H/F)SDP2 x TP strided sharding, and DTensor FP8 parameters for Torch DCP checkpointing, across all TransformerEngineBaseModule(s).
    • Except GroupedLinear, pending FSDP2 standalone pipe-cleaning. All other modules under transformer_engine.pytorch.modules are supported.
    • FusibleOperation support is also a WIP, except for LayerNorm or RMSNorm which are TE modules.
  • Associated with BioNeMo-Recipes Llama3 TP: Enable TransformerEngine-backed Tensor Parallelism with Llama3. bionemo-framework#1483
    • Notably, TransformerEngine TP can be easily mixed with DTensor-based TP when unified by Torch DCP! In the Llama3 recipe, we use DTensor-based TP on the torch.nn.Embedding, TransformerEngine-based TP on the LM head, and weight-tie the LM head to the torch.nn.Embedding, which is why we do not need to call set_device_mesh for the LM head!
  • Credit to @pstjohn for coming up with this idea!

Usage / Documentation

(tp_mesh and weight_mesh can also be passed in TEModule.__init__.)

    def set_device_mesh(
        self,
        tp_mesh: Optional[DeviceMesh] = None,
        weight_mesh: Optional[DeviceMesh] = None,
    ) -> None:
        """
        Set DeviceMesh(s) used for sharding weights and convert main weights into DTensor
        depending on the TransformerEngine class to support FSDP-TP sharding with FSDP2.

        TransformerEngine manages tensor parallel mechanics, while DTensor offers seamless
        integration with Torch DCP checkpointing. This method should only be invoked when
        using DTensor parameters, e.g. when using FSDP2 or DCP.

        When FSDP2 fully_shard() encounters any DTensor Shard(s), it will automatically
        convert them into FSDP-TP strided or non-strided shards depending on the current
        sharding dimension and factor of the DTensor. When the sharding dimension of FSDP
        matches that of TP, FSDP uses a _StridedShard placement type instead of Shard.
        This experimental FSDP-TP logic presides in this FSDP2 initialization function:
        ``torch.distributed.fsdp._fully_shard._fsdp_param._init_sharded_param``

        Parameters
        ----------
        tp_mesh : Optional[DeviceMesh]
            A 1-D DeviceMesh containing a TP mesh dimension, e.g. device_mesh["tp"].
            Only required when using TP with DTensor parameters, e.g. for FSDP2 or DCP.
        weight_mesh : Optional[DeviceMesh]
            A 1-D DeviceMesh containing a weight-sharding mesh dimension. Only required
            when using the FP8 Current (per-tensor) Scaling recipe on sharded DTensor
            parameters and if the DTensor DeviceMesh includes dimensions that do not
            shard weights, such as in the case of HSDP (DP-Replicate x DP-Shard).
            For example:
                - device_mesh["dp"] for FSDP.
                - device_mesh["dp_cp"] if using CP ranks in FSDP.
                - device_mesh["dp_shard"] if using HSDP ("dp_replicate", "dp_shard").
                - device_mesh["tp"] if using TP.
                - device_mesh["dp_cp_tp"] if strided-sharding with FSDP-TP.
        """

Details

DTensor Lifecycle in TransformerEngine

  • Initialization
    • __init__
      • TransformerEngine model parameters are initialized either on device or meta device with the appropriate tp_size and TP sharding strategy, e.g. parallel_mode and sequence_parallel.
    • TransformerEngineModule.set_device_mesh(tp_mesh, weight_mesh)
      • Converts parameters to DTensor with appropriate TP placement(s) based on the TP sharding strategy specified in __init__, using transformer_engine.pytorch.distributed._convert_param_to_dtensor_param.
        • tp_mesh is a 1-D DeviceMesh containing the TP ProcessGroup that will be registered with the TransformerEngine module.
        • weight_mesh is the 1-D DeviceMesh containing the ProcessGroup that shards TransformerEngine module weights, the flattened combination of groups such as FSDP and TP. Specifically, it excludes non-weight groups such as DP-Replicate when using HSDP or HSDP-TP and is mainly required for per-Tensor scaling recipes like Float8CurrentScaling.
      • Needs to be invoked prior to fully_shard (which responds to the TP placements) and prior to reset_parameters(defer_init=False), which quantizes parameters.
      • Can also be directly invoked during __init__(tp_mesh, weight_mesh) for supported TransformerEngine modules.
    • fully_shard shards the TransformerEngine model with FSDP2.
      • When fully_shard encounters TP sharding on dim=0, it will use a _StridedShard for DP. Put simply, this "pre-shards" the data prior to sharding on the current placement, followed by concatenating the pre-shards to get strided shards that will be re-sharded by the next placement. This effectively reverses the sharding order when processing the placements from left-to-right, and distributes shards as if we sharded on TP first, then FSDP, as required, even though DP appears before TP in the DeviceMesh and DTensor.placements. (See Appendix for visualization of this sharding strategy.)
    • reset_parameters is called if using meta device initialization.
  • Training
    • Pre-forward, FSDP2 all-gathers the sharded DTensor "main" weight that it registered during fully_shard. (Note that this essentially shares the same properties as the compute weight besides shape, and supporting tools such as FusedAdam must be used to properly handle high-precision main weights.)
      • When using FSDP2 x TP, the all-gathered Tensor is actually a TP-sharded DTensor, which deviates from the original FSDP2 paradigm where the all-gathered Tensor is fully-unsharded and the DTensor wrapping is discarded. To support these DTensor compute weights in TransformerEngine modules, we utilize transformer_engine.pytorch.distributed._extract_trainable_tensor_from_dtensor to localize the DTensor and also inherit requires_grad attribute from the DTensor parameter as the local Tensor has this un-set during DTensor.from_local(Tensor) for FP8 parameters specifically!
    • Post-backward, the Tensor gradient is converted to DTensor and attached to the DTensor.grad attribute. Handled by DTensor <> Tensor Autograd conversion functions, and in the case of FusibleOperation, casted during the backward implementation.

Bugs

  • Fix bug where "shard" was the presumed weight sharding sub-mesh in the DTensor.device_mesh. Now, users can precisely specify their own custom weight-sharding DeviceMesh for per-tensor amax_reduction_group via the set_device_mesh(weight_mesh) API.
  • TransformerEngineBaseModule: self.quantizers = {"scaling_fwd": [], "scaling_bwd": []}

Testing

# TransformerEngine Main
[Rank 0] (after 1 iterations) memory (MB) | allocated: 23511.65 | max allocated: 25189.68 | reserved: 25678.00 | max reserved: 25678.00
 [2026-03-02 09:55:17.189564] iteration       99/15258789 | consumed samples:        12672 | elapsed time per iteration (ms): 12715.7 | throughput per GPU (TFLOP/s/GPU): 530.6 | learning rate: 4.866046E-07 | global batch size:   128 | lm loss: 1.124915E+00 | loss scale: 1.0 | grad norm: 5.474 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2026-03-02 09:55:29.768521] iteration      100/15258789 | consumed samples:        12800 | elapsed time per iteration (ms): 12578.7 | throughput per GPU (TFLOP/s/GPU): 536.4 | learning rate: 4.915198E-07 | global batch size:   128 | lm loss: 1.143806E+00 | loss scale: 1.0 | grad norm: 5.366 | number of skipped iterations:   0 | number of nan iterations:   0 |

# Post-DCP Modifications (This PR)
[Rank 0] (after 2 iterations) memory (MB) | allocated: 23511.65 | max allocated: 29783.24 | reserved: 25678.00 | max reserved: 31510.00
 [2026-03-02 09:29:36.550070] iteration       99/15258789 | consumed samples:        12672 | elapsed time per iteration (ms): 12556.5 | throughput per GPU (TFLOP/s/GPU): 537.3 | learning rate: 4.866046E-07 | global batch size:   128 | lm loss: 1.124463E+00 | loss scale: 1.0 | grad norm: 5.471 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2026-03-02 09:29:49.216068] iteration      100/15258789 | consumed samples:        12800 | elapsed time per iteration (ms): 12665.7 | throughput per GPU (TFLOP/s/GPU): 532.7 | learning rate: 4.915198E-07 | global batch size:   128 | lm loss: 1.142863E+00 | loss scale: 1.0 | grad norm: 5.355 | number of skipped iterations:   0 | number of nan iterations:   0 |
  • NOTE(@cspades): DelayedScaling has DCP save/load disparity issues, i.e. on the scale of +/-1 to the uint8 parameter checkpoint!

Appendix

_StridedShard - Using FSDP2 x TP Strided-Sharding

# (DP=4, TP=2)
(_StridedShard(dim=0, sf=2), Shard(dim=0))

┌───┬───┐
│ 0 │ 4 │ ← DP=0
├───┼───┤
│ 1 │ 5 │ ← DP=1
├───┼───┤          FSDP all-gather happens across the DP ranks,
│ 2 │ 6 │ ← DP=2   so we need to form the 0-3 and 4-7 TP shards!
├───┼───┤
│ 3 │ 7 │ ← DP=3
└───┴───┘
  ↑   ↑
TP=0 TP=1

When redistribute'ing a global DTensor to (_StridedShard(dim=0, sf=2), Shard(dim=0)), DTensor will perform the following steps:

  • Pre-shard the Tensor data with respect to the stride / shard factor, which is defined as the product of the parallelism sizes of all Shard placements to the right of _StridedShard. (In the above example, since TP=2, the factor is 2.)
    • [0 1 2 3 4 5 6 7] -> [0 1 2 3] and [4 5 6 7].
    • In the context of this PR and fully_shard, this has already been done via initializing the TransformerEngine module with TP and calling _convert_param_to_dtensor_param!
  • Shard the pre-shards for _StridedShard.
    • [0] [1] [2] [3] and [4] [5] [6] [7]
  • Concatenate the strided shards.
    • [0 4] [1 5] [2 6] [3 7], which are assigned to the _StridedShard ranks.
    • Note that this is very different if we did left-to-right-sharding, which would have given us [0 1] [2 3] [4 5] [6 7]!
  • Subsequently / finally, each strided shard is sharded on the Shard placement.
    • [0] [4] / [1] [5] / [2] [6] / [3] [7], which are assigned to the Shard ranks.
    • Note that this is very different if we did left-to-right sharding, which would have given us [0] [1] / [2] [3] / [4] [5] / [6] [7]!

PyTorch also supports the inverse / un-sharding of this redistribute, which is literally the inverse of these simple operations! (Though things get a bit more complicated with un-even shards from odd-numbered dimension sizes.)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 4, 2026

Greptile Summary

This PR adds full Torch DCP checkpoint compatibility for FSDP2 × TP sharding across all TransformerEngineBaseModule subclasses (except GroupedLinear). It introduces a set_device_mesh(tp_mesh, weight_mesh) API that converts TE module parameters into DTensors with correct TP placements (Shard(dim=0) for column-parallel, Shard(dim=1) for row-parallel, Replicate for biases/norms), enabling FSDP2's _StridedShard mechanism for strided FSDP×TP sharding. A _ToLocalIdentity autograd function preserves object-identity between the DTensor's _local_tensor and the compute tensor handed to TE C++ kernels, ensuring FSDP2 in-place all-gather updates remain visible while still routing gradients back as DTensors. Also fixed: the quantizers dict was changed from {} to [], the amax_reduction_group bug for per-tensor scaling with HSDP, a pre-existing isinstance check on ctx.fc1_weight_quantizer vs ctx.fc1_weight in the _LayerNormMLP backward, and DTensor-awareness in float8_tensor.py, mxfp8_tensor.py, and nvfp4_tensor.py for FSDP2 all-gather. A new end-to-end DCP save/load round-trip test is added alongside FSDP-TP mesh setup.

Key items noted:

  • DTensor.from_local in the LayerNorm and RMSNorm op backwards omit run_check=False, adding a small unnecessary collective per backward call (see inline comments).
  • NVFP4Tensor.untyped_storage() has a copy-paste docstring error that still references "MXFP8Tensor".
  • save/load are imported from torch.distributed.checkpoint in run_fsdp2_model.py but the actual calls use the fully-qualified module path, making the direct imports unused.

Confidence Score: 4/5

  • PR is safe to merge; the new DTensor lifecycle is well-designed and parity-tested, with only minor style and overhead issues remaining.
  • The core logic — parameter-to-DTensor conversion, _ToLocalIdentity identity-preserving extraction, FSDP2 StridedShard interaction, and full DCP round-trip — is carefully designed and backed by Megatron-LM parity tests. Bugs found are minor: a copy-paste docstring error in NVFP4Tensor, unused imports in the test file, and missing run_check=False in LayerNorm/RMSNorm op backward (extra collective overhead, not a correctness issue). No correctness regressions detected.
  • transformer_engine/pytorch/ops/basic/layer_norm.py and transformer_engine/pytorch/ops/basic/rmsnorm.py — both have unnecessary collectives in the backward pass due to missing run_check=False.

Important Files Changed

Filename Overview
transformer_engine/pytorch/distributed.py Adds _convert_param_to_dtensor_param (wraps a parameter as a DTensor, copying user-set attributes) and _ToLocalIdentity / _extract_trainable_tensor_from_dtensor (object-identity-preserving DTensor localization that properly propagates requires_grad and routes gradients back as DTensors for FSDP2 compatibility). Logic is sound and well-motivated by the FSDP2 in-place update semantics.
transformer_engine/pytorch/module/base.py Fixes the quantizers dict to use lists instead of dicts, adds DTensor input localization in the forward pre-hook, improves Float8CurrentScalingQuantizer amax-reduction-group handling (only sets the group when not already set by set_device_mesh), and switches to _convert_param_to_dtensor_param for DTensor parameter recreation after quantization to correctly carry over user attributes.
transformer_engine/pytorch/module/linear.py Adds set_device_mesh with TP-sharded DTensor conversion, _get_bias_tensors to correctly handle DTensor biases, refactors _get_weight_and_bias_tensors to use the new helpers, and moves TP attribute setting into _set_tensor_parallel_attributes. Logic is clean; DTensor extraction path in _get_weight_tensors is consistent with other modules.
transformer_engine/pytorch/module/layernorm_mlp.py Adds set_device_mesh with correct TP placements (FC1 column-parallel Shard(0), FC2 row-parallel Shard(1), LN/FC2-bias Replicate), adds _get_bias_tensors / _get_layernorm_weight_and_bias helpers, and fixes a pre-existing bug where isinstance(ctx.fc1_weight_quantizer, QuantizedTensorStorage) was incorrectly checking the quantizer instead of the weight tensor.
transformer_engine/pytorch/module/layernorm_linear.py Adds set_device_mesh mirroring the Linear pattern, with the addition of DTensor conversion for layer_norm_weight/layer_norm_bias (Replicate for column-parallel, Shard(0) for row-parallel). Introduces _get_bias_tensors and _get_layernorm_weight_and_bias helpers and threads them through both the main forward and onnx_forward paths.
transformer_engine/pytorch/ops/basic/layer_norm.py Correctly handles DTensor weight/bias in both the forward (via to_local()) and backward (by wrapping computed gradients back into DTensors matching the parameter's placement). Minor: run_check=True (default) in the two DTensor.from_local calls in backward adds an unnecessary collective; should use run_check=False consistent with the rest of the PR.
transformer_engine/pytorch/ops/basic/rmsnorm.py Same DTensor handling pattern as layer_norm.py, correctly wrapping the computed grad_weight back into a DTensor in backward. Same run_check=False omission as layer_norm.py.
transformer_engine/pytorch/tensor/nvfp4_tensor.py Adds untyped_storage() identical in logic to MXFP8Tensor. Minor docstring copy-paste error: still says "MXFP8Tensor" instead of "NVFP4Tensor" in two places.
tests/pytorch/distributed/run_fsdp2_model.py Substantial addition: implements full DCP round-trip (save → corrupt → load → parity assert) for both model and optimizer states, adds FSDP2-TP mesh setup, and an AppState Stateful container. The save/load imports from torch.distributed.checkpoint are unused (calls use the fully-qualified path).

Sequence Diagram

sequenceDiagram
    participant User
    participant TEModule
    participant SDM as set_device_mesh
    participant FSDP2 as fully_shard
    participant Fwd as forward
    participant DCP as DCP checkpoint

    User->>TEModule: init with tp_mesh and weight_mesh
    TEModule->>SDM: set_device_mesh called
    SDM->>SDM: wrap params as DTensor with Shard or Replicate placements
    SDM->>SDM: configure amax_reduction_group via weight_mesh
    TEModule->>TEModule: reset_parameters - quantize and recreate DTensors
    User->>FSDP2: fully_shard model on dp_mesh
    Note over FSDP2: DTensor Shard dim=0 triggers StridedShard for FSDP-TP
    User->>Fwd: model forward pass
    Fwd->>Fwd: extract local tensor from DTensor weight
    Note over Fwd: ToLocalIdentity preserves object identity and requires_grad
    Fwd->>Fwd: run TE C++ kernels with local tensor
    Note over Fwd: backward wraps grad_local back to DTensor grad
    User->>DCP: save AppState
    Note over DCP: evicts _extra_state, clears empty optimizer states
    DCP-->>User: per-rank checkpoint shards written
    User->>DCP: load AppState
    DCP-->>User: model and optimizer restored to pre-save state
Loading

Comments Outside Diff (1)

  1. transformer_engine/pytorch/ops/basic/layer_norm.py, line 271-295 (link)

    Missing run_check=False adds unnecessary collective in backward

    Both DTensor.from_local calls in this backward method (for grad_weight and grad_bias) omit run_check=False. The default run_check=True triggers an all-reduce across ranks to verify tensor shape/stride consistency, which is unnecessary overhead in the hot backward path since the shapes are already guaranteed to match the saved self.weight / self.bias placements.

    The existing _ToLocalIdentity.backward in distributed.py correctly uses run_check=False as the reference pattern. The same issue exists in transformer_engine/pytorch/ops/basic/rmsnorm.py for grad_weight.

Last reviewed commit: 5dd64ea

@cspades cspades force-pushed the cye/fsdp2-tp-dcp branch from 4ec2947 to dbb9d14 Compare March 4, 2026 18:10
@cspades cspades force-pushed the cye/fsdp2-tp-dcp branch from fcdd5bd to c912f5b Compare March 5, 2026 16:06
@cspades cspades force-pushed the cye/fsdp2-tp-dcp branch 5 times, most recently from bc82f02 to 267f1df Compare March 10, 2026 01:30
@vthumbe1503
Copy link
Collaborator

/te-ci L1 pytorch

@cspades cspades force-pushed the cye/fsdp2-tp-dcp branch 2 times, most recently from 5c87a4f to 7769c00 Compare March 12, 2026 01:49
cspades and others added 6 commits March 11, 2026 18:55
Signed-off-by: Cory Ye <cye@nvidia.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants