[TP] reorder MXFP8 wrapper over DTensor by pianpwk · Pull Request #4010 · pytorch/ao

pianpwk · 2026-03-05T22:51:13Z

The original attempt to handle TP + MXFP8 wrapped DTensor over the MXFP8 subclass. The MXFP8 subclass intends to capture at the torch_function level, and use custom autograd functions to control fwd/bwd behavior. Because DTensor has no fwd/bwd coupling, and because it CIA-decomposes aten::linear, this ordering does not work; MXFP8 tensor does not see aten::linear at torch_function

This PR reverses the order to MXFP8(DTensor), allowing aten::linear interception and fwd/bwd control. Relies on pytorch/pytorch#177234 landing in pytorch.

I understand other dtypes still need reordering?

More details in discussion of pytorch/pytorch#177059

pytorch-bot · 2026-03-05T22:51:16Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/4010

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f31cd57 with merge base 1d75a07 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

test/prototype/mx_formats/test_mx_dtensor.py

torchao/testing/training/dtensor_utils.py

danielvegamyhre · 2026-03-05T22:57:57Z

thanks for investigating this further @pianpwk ! i added a couple notes to assist

torchao/prototype/moe_training/tensor.py

torchao/prototype/mx_formats/utils.py

pianpwk · 2026-03-10T23:34:59Z

@danielvegamyhre updated the PR to cover more spots where we should reorder. I didn't review carefully but at a high-level it should make more sense?

Also had to override a DTensor sharding rule for scaled_mm - I'll upstream that to pytorch instead, if it turns out to be valid.

danielvegamyhre · 2026-03-11T00:24:26Z

nice @pianpwk, qq, why did you need to register a custom sharding strategy for scaled_mm, does the one in core have a bug?

pianpwk · 2026-03-12T06:06:07Z

torchao/prototype/mx_formats/mx_tensor.py

-                run_check=False,
-                shape=data_lp.size(),
-                stride=data_lp.stride(),
-            )


consequence of reversing order?

makes sense, this would rewrap as "Dtensor(MXTensor(...))" which the opposite order of what we are doing now. nice that all this can be removed now, cleaner

torchao/testing/training/dtensor_utils.py

also for pytorch/ao#4010, adds custom handler to check local tensor [ghstack-poisoned]

For pytorch/ao#4010, the existing strategy incorrect copies scaled_mm input strategies (2d) onto the scale tensor (1d) [ghstack-poisoned]

For pytorch/ao#4010, the existing strategy incorrect copies scaled_mm input strategies (2d) onto the scale tensor (1d) Pull Request resolved: #177234 Approved by: https://github.com/danielvegamyhre

also for pytorch/ao#4010, adds custom handler to check local tensor Pull Request resolved: #177235 Approved by: https://github.com/Skylion007, https://github.com/wconstab ghstack dependencies: #177234

danielvegamyhre

nice, looks much cleaner now with that dtensor 1d scale sharding fix landed in core! couple minor comments

danielvegamyhre · 2026-03-16T21:30:20Z

test/prototype/mx_formats/test_mx_dtensor.py

+            del t
+            tests.append(_test_mxfp8_mlp_tensor_parallelism_auto)
+        except Exception:
+            print("Skipping auto test: mxfp8_quantize CUDA kernel not available")


if _mxfp8_cuda_kernels_available let's just append the test case without doing the kernel dispatch test

danielvegamyhre · 2026-03-16T21:37:32Z

torchao/testing/training/dtensor_utils.py


+    # For MXFP8: parallelize first, then quantize.
+    # This puts MXFP8 wrapper on top of DTensor so __torch_function__
+    # intercepts F.linear before DTensor can trigger premature all-gathers.


@vkuzo this wrapping order change to MXFP8WeightWrapperTensor(Dtensor(...)) is pretty fundamental one, if you'd like to review as well. see the comment above, it is working cleanly now after @pianpwk landed a fix to dtensor sharding rules for 1d/flattened scale factors for torch._scaled_mm: pytorch/pytorch#177234

When DTensor wraps MXFP8 (quantize first, then parallelize), DTensor dispatches first on F.linear and performs premature Shard→Replicate all-gather before MXFP8's __torch_function__ can intercept, causing both ranks to see identical full weights and producing wrong numerics. Fix: reverse the wrapping order for MXFP8 so MXFP8 sits on top of DTensor (parallelize first, then quantize). MXFP8's __torch_function__ intercepts F.linear first, unwraps to get the DTensor via a differentiable unwrap_weight() helper, and DTensor handles sharding at the aten op level. Changes: - tensor.py: add scatter_ to preserved ops (TP weight distribution), fix pin_memory for DTensor, narrow linear override to func name "linear" only, use unwrap_weight(B) in both grouped_mm and linear paths, add _UnwrapWeight autograd function - utils.py: transpose DTensor placements in _to_mxfp8_dim1_kernel_wrapper to match transposed local data - dtensor_utils.py: make parallelize/quantize order conditional on config type (MXFP8: parallelize first; Float8: quantize first), use SQNR-based assertions for MXFP8, bf16 model/inputs - test_mx_dtensor.py: update to MXFP8 (was FP4), split into emulated and auto tests with CUDA kernel availability guard

danielvegamyhre · 2026-03-19T16:58:22Z

torchao/prototype/mx_formats/mx_tensor.py

-                run_check=False,
-                shape=data_lp.size(),
-                stride=data_lp.stride(),
-            )


makes sense, this would rewrap as "Dtensor(MXTensor(...))" which the opposite order of what we are doing now. nice that all this can be removed now, cleaner

danielvegamyhre · 2026-03-19T17:00:27Z

torchao/prototype/mx_formats/mx_tensor.py

+            to_blocked,
+            in_placements=(t_placements,),
+            out_placements=t_placements,
+        )(t)


to confirm my understanding, local_map just runs the function (to_blocked) on each local shard as if it were a plain tensor, and then rewraps the output in a dtensor according to out_placements right

danielvegamyhre · 2026-03-19T17:03:47Z

torchao/prototype/mx_formats/kernels.py

+        rule_shard_dim1 = (
+            [Replicate(), Shard(1), Replicate(), Shard(0)],
+            [Shard(1)] + non_tensor_args,
+        )


why did these dim1 quantization kernel sharding rules need to be updated?

also, i thought the rule tuple order was (inputs, outputs) but it seems like this is the opposite, do i have it backwards?

Ahh ordering is [*outputs, *inputs], I think that's the issue: https://github.com/pytorch/pytorch/blob/6511db6ea2ae6133a5260076862f63540564897a/torch/distributed/tensor/experimental/_register_sharding.py#L22

danielvegamyhre · 2026-03-19T21:59:36Z

@pianpwk when i use latest torch nightly, checkout this PR and run ./test/prototype/mx_formats/test_mx_dtensor.sh the test fails with:

[rank0]:   File "/home/dev/ao/torchao/prototype/mx_formats/mx_tensor.py", line 409, in to_dtype
[rank0]:     data_hp = data_hp * s_fp
[rank0]:               ~~~~~~~~^~~~~~
[rank0]: RuntimeError: The size of tensor a (8) must match the size of tensor b (4) at non-singleton dimension 0

pianpwk · 2026-03-19T23:25:41Z

./test/prototype/mx_formats/test_mx_dtensor.sh

@pianpwk when i use latest torch nightly, checkout this PR and run ./test/prototype/mx_formats/test_mx_dtensor.sh the test fails with:
[rank0]:   File "/home/dev/ao/torchao/prototype/mx_formats/mx_tensor.py", line 409, in to_dtype
[rank0]:     data_hp = data_hp * s_fp
[rank0]:               ~~~~~~~~^~~~~~
[rank0]: RuntimeError: The size of tensor a (8) must match the size of tensor b (4) at non-singleton dimension 0

hmm not able to repro this

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 5, 2026

pianpwk mentioned this pull request Mar 5, 2026

[not for land] [mxfp8 training] fix TP bug #3985

Open

danielvegamyhre reviewed Mar 5, 2026

View reviewed changes

test/prototype/mx_formats/test_mx_dtensor.py Outdated Show resolved Hide resolved

danielvegamyhre reviewed Mar 5, 2026

View reviewed changes

torchao/testing/training/dtensor_utils.py Outdated Show resolved Hide resolved

danielvegamyhre reviewed Mar 5, 2026

View reviewed changes

torchao/prototype/moe_training/tensor.py Outdated Show resolved Hide resolved

danielvegamyhre reviewed Mar 5, 2026

View reviewed changes

torchao/prototype/mx_formats/utils.py Outdated Show resolved Hide resolved

pianpwk mentioned this pull request Mar 10, 2026

Way to have DTensor ops go through torch_function rather than go directly through torch_dispatch? pytorch/pytorch#177059

Open

pianpwk force-pushed the mxfp8-tp-fix branch from c9c295c to 0c43fe7 Compare March 10, 2026 23:33

danielvegamyhre mentioned this pull request Mar 11, 2026

[training] skip Dtensor/TP integration test pending solution #4059

Merged

This was referenced Mar 12, 2026

[DTensor] fix scaled_mm sharding strategy pytorch/pytorch#177234

Closed

[DTensor] handle is_pinned() pytorch/pytorch#177235

Closed

pianpwk commented Mar 12, 2026

View reviewed changes

pianpwk force-pushed the mxfp8-tp-fix branch from 0c43fe7 to 50b3768 Compare March 12, 2026 06:14

pianpwk changed the title ~~fix tensor parallelism by reordering subclass wrapping~~ [TP] reorder MXFP8 wrapper over DTensor Mar 12, 2026

pianpwk requested review from IvanKobzarev, andrewor14 and wconstab March 12, 2026 06:20

pianpwk marked this pull request as ready for review March 12, 2026 06:20

danielvegamyhre reviewed Mar 12, 2026

View reviewed changes

torchao/testing/training/dtensor_utils.py Outdated Show resolved Hide resolved

danielvegamyhre added this to the MXFP8 Training milestone Mar 12, 2026

danielvegamyhre mentioned this pull request Mar 12, 2026

[mxfp8 training] add TP warning pytorch/torchtitan#2562

Open

pianpwk added the module: training quantize_ api training flow label Mar 13, 2026

pianpwk added a commit to pytorch/pytorch that referenced this pull request Mar 13, 2026

Update base for Update on "[DTensor] handle is_pinned()"

f23e6e7

also for pytorch/ao#4010, adds custom handler to check local tensor [ghstack-poisoned]

pianpwk added a commit to pytorch/pytorch that referenced this pull request Mar 13, 2026

Update on "[DTensor] handle is_pinned()"

33ecb21

also for pytorch/ao#4010, adds custom handler to check local tensor [ghstack-poisoned]

pianpwk added a commit to pytorch/pytorch that referenced this pull request Mar 13, 2026

Update on "[DTensor] fix scaled_mm sharding strategy"

0d2a5d1

For pytorch/ao#4010, the existing strategy incorrect copies scaled_mm input strategies (2d) onto the scale tensor (1d) [ghstack-poisoned]

danielvegamyhre assigned pianpwk Mar 14, 2026

danielvegamyhre reviewed Mar 16, 2026

View reviewed changes

pianpwk force-pushed the mxfp8-tp-fix branch from b65b280 to 5300056 Compare March 18, 2026 01:46

pianpwk requested a review from danielvegamyhre March 18, 2026 02:48

pianpwk added 8 commits March 19, 2026 00:11

reorder consistently

162ee76

cut down diff

c44d9d3

fix CI

d67ca9c

bitwise?

762ba23

R/S0

1dec8aa

remove is_pinned

a2c4af2

full like in float32

f31cd57

pianpwk force-pushed the mxfp8-tp-fix branch from 788e4fc to f31cd57 Compare March 19, 2026 07:32

danielvegamyhre reviewed Mar 19, 2026

View reviewed changes

pianpwk requested a review from danielvegamyhre March 19, 2026 19:14

Conversation

pianpwk commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/4010

✅ No Failures

Uh oh!

Uh oh!

Uh oh!

danielvegamyhre commented Mar 5, 2026

Uh oh!

Uh oh!

Uh oh!

pianpwk commented Mar 10, 2026

Uh oh!

danielvegamyhre commented Mar 11, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

danielvegamyhre left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre commented Mar 19, 2026

Uh oh!

pianpwk commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pianpwk commented Mar 5, 2026 •

edited

Loading

pytorch-bot bot commented Mar 5, 2026 •

edited

Loading

danielvegamyhre Mar 16, 2026 •

edited

Loading