[not for land] [mxfp8 training] fix TP bug by danielvegamyhre · Pull Request #3985 · pytorch/ao

danielvegamyhre · 2026-03-04T06:14:39Z

No description provided.

pytorch-bot · 2026-03-04T06:14:43Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3985

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 6 New Failures

As of commit 3dd03f5 with merge base b8708a2 ():

NEW FAILURES - The following jobs have failed:

Code Analysis with Ruff / build (3.9) (gh)
Process completed with exit code 1.
Run 1xH100 Tests / test (H100, linux.aws.h100, --pre torch torchvision torchaudio mslk --index-url https://download.... / linux-job (gh)
RuntimeError: Command docker exec -t 5171e412fa8352809448992c4a4063a483511f341772eafec958a2277c9dae27 /exec failed with exit code 1
Run Regression Tests / test (CUDA 2.10, linux.g5.12xlarge.nvidia.gpu, torch==2.10.0, cuda, 12.6) / linux-job (gh)
RuntimeError: Command docker exec -t 44e146464b0552d42ea33775afa1c2d1af18eb40cbd4fdf5302b703e54c7f2db /exec failed with exit code 1
Run Regression Tests / test (CUDA 2.8, linux.g5.12xlarge.nvidia.gpu, torch==2.8.0, cuda, 12.6) / linux-job (gh)
RuntimeError: Command docker exec -t c93a10edc225f86bb88e389e8de6136ae28e7e77ce0e78e153fbfae96cd782e5 /exec failed with exit code 1
Run Regression Tests / test (CUDA 2.9, linux.g5.12xlarge.nvidia.gpu, torch==2.9.1, cuda, 12.6) / linux-job (gh)
RuntimeError: Command docker exec -t 517a2c875c62918cc26fa5b91bc9351e6ec84938416d5f0bcb0ecacd82c5d2b8 /exec failed with exit code 1
Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://downloa... / linux-job (gh)
RuntimeError: Command docker exec -t 6dafa39b2d0c0bbd5e3ceae25592c7b227f6a66e909dc2866960892266757042 /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pianpwk · 2026-03-04T22:46:11Z

On my side the [LINEAR rank=*] logs are only trigged from the non-parallelized model; this patch gets SQNR from 23 -> 50:

+++ b/torchao/prototype/moe_training/tensor.py
@@ -21,6 +21,7 @@ from torchao.prototype.moe_training.config import (
 )
 from torchao.prototype.moe_training.utils import _quantize_then_scaled_grouped_mm
 from torchao.prototype.mx_formats.mx_linear import _to_mxfp8_then_scaled_mm
+from torchao.prototype.mx_formats.mx_tensor import MXTensor
 from torchao.utils import TorchAOBaseTensor
 
 aten = torch.ops.aten
             
             return _to_mxfp8_then_scaled_mm(
@@ -357,6 +359,37 @@ class MXFP8TrainingWeightWrapperTensor(TrainingWeightWrapperBaseTensor):
             with torch._C.DisableTorchFunctionSubclass():
                 return func(*args, **kwargs)
 
+    @classmethod
+    def __torch_dispatch__(cls, func, types, args, kwargs={}):
+        # Intercept aten.mm.default to apply MXFP8 quantization.
+        # This is needed because DTensor decomposes F.linear (a CompositeImplicitAutograd op)
+        # into aten.t + aten.mm, bypassing our __torch_function__ override for "linear".
+        # Without this, the TP/SP model would compute with regular BF16 mm.
+        if func == torch.ops.aten.mm.default:
+            A, B = args[0], args[1]
+
+            if not isinstance(A, cls) and isinstance(B, cls):
+                config = B.config
+
+                if isinstance(config, MXFP8TrainingOpConfig):
+                    input_hp = A
+                    # B._data is the transposed weight (from aten.t in linear decomposition).
+                    # _to_mxfp8_then_scaled_mm expects weight in [out, in] layout and
+                    # internally computes input @ weight.t(), so we pass B._data.t()
+                    # to recover the original weight layout:
+                    #   input @ (B._data.t()).t() = input @ B._data
+                    # which matches the aten.mm(input, weight_t) semantics.
+                    weight_hp = B._data.t().contiguous()
+                    return _to_mxfp8_then_scaled_mm(
+                        input_hp,
+                        weight_hp,
+                        kernel_preference=config.kernel_preference,
+                        scale_calculation_mode=config.scale_calculation_mode,
+                        wgrad_with_hp=config.wgrad_with_hp,
+                    )
+
+        return super().__torch_dispatch__(func, types, args, kwargs)
+

danielvegamyhre · 2026-03-04T22:57:20Z

On my side the [LINEAR rank=*] logs are only trigged from the non-parallelized model; this patch gets SQNR from 23 -> 50:

+++ b/torchao/prototype/moe_training/tensor.py
@@ -21,6 +21,7 @@ from torchao.prototype.moe_training.config import (
 )
 from torchao.prototype.moe_training.utils import _quantize_then_scaled_grouped_mm
 from torchao.prototype.mx_formats.mx_linear import _to_mxfp8_then_scaled_mm
+from torchao.prototype.mx_formats.mx_tensor import MXTensor
 from torchao.utils import TorchAOBaseTensor
 
 aten = torch.ops.aten
             
             return _to_mxfp8_then_scaled_mm(
@@ -357,6 +359,37 @@ class MXFP8TrainingWeightWrapperTensor(TrainingWeightWrapperBaseTensor):
             with torch._C.DisableTorchFunctionSubclass():
                 return func(*args, **kwargs)
 
+    @classmethod
+    def __torch_dispatch__(cls, func, types, args, kwargs={}):
+        # Intercept aten.mm.default to apply MXFP8 quantization.
+        # This is needed because DTensor decomposes F.linear (a CompositeImplicitAutograd op)
+        # into aten.t + aten.mm, bypassing our __torch_function__ override for "linear".
+        # Without this, the TP/SP model would compute with regular BF16 mm.
+        if func == torch.ops.aten.mm.default:
+            A, B = args[0], args[1]
+
+            if not isinstance(A, cls) and isinstance(B, cls):
+                config = B.config
+
+                if isinstance(config, MXFP8TrainingOpConfig):
+                    input_hp = A
+                    # B._data is the transposed weight (from aten.t in linear decomposition).
+                    # _to_mxfp8_then_scaled_mm expects weight in [out, in] layout and
+                    # internally computes input @ weight.t(), so we pass B._data.t()
+                    # to recover the original weight layout:
+                    #   input @ (B._data.t()).t() = input @ B._data
+                    # which matches the aten.mm(input, weight_t) semantics.
+                    weight_hp = B._data.t().contiguous()
+                    return _to_mxfp8_then_scaled_mm(
+                        input_hp,
+                        weight_hp,
+                        kernel_preference=config.kernel_preference,
+                        scale_calculation_mode=config.scale_calculation_mode,
+                        wgrad_with_hp=config.wgrad_with_hp,
+                    )
+
+        return super().__torch_dispatch__(func, types, args, kwargs)
+

thanks for looking at this @pianpwk, knowing that linear is decomposed into aten.t + aten.mm is super useful. quick question, i thought torch_dispatch runs below autograd so we can't run autograd functions there otherwise it won't run backwards properly? is that not the case? this is why i have the autograd functions in torch_function instead

danielvegamyhre · 2026-03-04T23:12:18Z

@pianpwk

it needs to intercept aten::mm at the torch_dispatch level instead.

that makes sense, but in the past when i tried to use autograd functions in torch_dispatch, i found only the forward pass ran, never the backward pass, which is problematic for this use case since we need to control the backward pass as well. my understand was that autograd is not captured at the __torch_dispatch__level, am i mistaken on that?

when Dtensor decomposes linear into aten.t + aten.mm, does it skip torch_function entirely and go straight to dispatch? there is no way to intercept in torch_function?

pianpwk · 2026-03-04T23:14:42Z

sorry, I think I was also partly mistaken. I'm not 100% sure what causes the (linear -> t + view + mm) decomposition for the DTensor + MXFP8 composition case, but you should also be able to intercept mm for it at the torch_function level as well.

Let me dig a bit more...

danielvegamyhre · 2026-03-04T23:18:00Z

sorry, I think I was also partly mistaken. I don't think DTensor CIA handling is why the (linear -> t + view + mm) decomposition happens anymore, and you should also be able to intercept mm at the torch_function level as well.

Let me dig a bit more...

ok thanks again for your help - for what it's worth i've been trying to figure out how to intercept mm at torch dispatch level as well for TP case but i don't see any meaningful func names, just __get__ at function level followed by mm.default at dispatch level:

TP case
[DISPATCH rank=0] aten.t.default args=['Wrapper(torch.Size([64, 64]), mean=-0.001793)']
func __get__
[DISPATCH rank=0] aten.mm.default args=['Tensor(torch.Size([256, 64]), mean=0.496094)', 'Wrapper(torch.Size([64, 64]), mean=-0.001793)']
not preserving subclass aten.mm.default
[DISPATCH rank=0] aten.t.default args=['Wrapper(torch.Size([64, 64]), mean=-0.000147)']
func __get__
[DISPATCH rank=0] aten.mm.default args=['Tensor(torch.Size([256, 64]), mean=0.496094)', 'Wrapper(torch.Size([64, 64]), mean=-0.000147)']
not preserving subclass aten.mm.default
[DISPATCH rank=0] aten.t.default args=['Wrapper(torch.Size([64, 64]), mean=0.000112)']
func __get__
[DISPATCH rank=0] aten.mm.default args=['Tensor(torch.Size([256, 64]), mean=-0.000790)', 'Wrapper(torch.Size([64, 64]), mean=0.000112)']
not preserving subclass aten.mm.default

pianpwk · 2026-03-04T23:44:42Z

but i don't see any meaningful func names, just get at function level followed by mm.default at dispatch level:

my bad, this only triggered because I had a custom dispatch mode on (DebugMode), otherwise it's at dispatch level. I get your point about autograd though, let me ask around. I tried reversing (parallelize before quantize) but that hit other composability issues.

danielvegamyhre · 2026-03-05T00:12:57Z

but i don't see any meaningful func names, just get at function level followed by mm.default at dispatch level:

my bad, this only triggered because I had a custom dispatch mode on (DebugMode), otherwise it's at dispatch level. I get your point about autograd though, let me ask around. I tried reversing (parallelize before quantize) but that hit other composability issues.

sounds good, thanks again - i will keep debugging on my side as well

pianpwk · 2026-03-05T22:53:34Z

@danielvegamyhre I vibecoded this reordering fix which seems to work? #4010

danielvegamyhre · 2026-03-07T01:28:14Z

@danielvegamyhre I vibecoded this reordering fix which seems to work? #4010

@pianpwk per our discussion on that PR and elsewhere, do you know of a way to intercept the linear op when the wrapping is DTensor(MXFP8TrainingTensor(..))? that is the fundamental issue i am trying to solve, from logging it seems to be decomposed into

__get__ (not sure what this is)
t
mm

andrewor14

stamping to unblock CI, please address the comments before landing

andrewor14 · 2026-03-11T21:03:11Z

torchao/prototype/moe_training/tensor.py

    @classmethod
    def __torch_function__(cls, func, types, args, kwargs={}):
        # grouped_mm op override
+        print("[TORCH_FUNCTION]", func.__name__)


remove these before landing?

andrewor14 · 2026-03-11T21:04:35Z

torchao/prototype/mx_formats/mx_tensor.py

    )
-    assert data_hp.is_contiguous(), "unsupported"
+    if not data_hp.is_contiguous():
+        assert data_hp.is_contiguous(), "unsupported"


a bit confused by this, if it's not contiguous it would fail like before, so is there a reason behind this change?

@andrewor14 sorry i linked the wrong PR, this is not the one that will address test failures, this is a WIP draft for an issue we are still trying to find a proper solution for - please disregard this PR.

andrewor14 · 2026-03-11T21:06:20Z

torchao/prototype/moe_training/tensor.py

    torch.ops.aten.transpose.int,
    torch.ops.aten.t.default,
+    # required for TP - scatter_ is used to distribute weights
+    torch.ops.c10d.scatter_.default,


so this is the real fix right? Do we need the other ops you commented out?

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 4, 2026

danielvegamyhre force-pushed the tpmarch3 branch from a5f6742 to b01d0df Compare March 4, 2026 06:25

[mxfp8 training] fix TP bug

f5cab89

danielvegamyhre force-pushed the tpmarch3 branch from b01d0df to f5cab89 Compare March 4, 2026 06:34

debugging

3dd03f5

danielvegamyhre force-pushed the tpmarch3 branch from dde1cab to 3dd03f5 Compare March 10, 2026 19:37

danielvegamyhre mentioned this pull request Mar 10, 2026

Way to have DTensor ops go through torch_function rather than go directly through torch_dispatch? pytorch/pytorch#177059

Open

danielvegamyhre added mx module: training quantize_ api training flow labels Mar 11, 2026

danielvegamyhre added this to the MXFP8 Training milestone Mar 11, 2026

andrewor14 approved these changes Mar 11, 2026

View reviewed changes

andrewor14 reviewed Mar 11, 2026

View reviewed changes

danielvegamyhre mentioned this pull request Mar 11, 2026

[training] skip Dtensor/TP integration test pending solution #4059

Open

danielvegamyhre changed the title ~~[mxfp8 training] fix TP bug~~ [not for land] [mxfp8 training] fix TP bug Mar 11, 2026

Conversation

danielvegamyhre commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3985

❌ 6 New Failures

Uh oh!

pianpwk commented Mar 4, 2026

Uh oh!

danielvegamyhre commented Mar 4, 2026

Uh oh!

danielvegamyhre commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pianpwk commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielvegamyhre commented Mar 4, 2026

Uh oh!

pianpwk commented Mar 4, 2026

Uh oh!

danielvegamyhre commented Mar 5, 2026

Uh oh!

pianpwk commented Mar 5, 2026

Uh oh!

danielvegamyhre commented Mar 7, 2026

Uh oh!

andrewor14 left a comment

Choose a reason for hiding this comment

Uh oh!

andrewor14 Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

andrewor14 Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewor14 Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

danielvegamyhre commented Mar 4, 2026 •

edited

Loading

pytorch-bot bot commented Mar 4, 2026 •

edited

Loading

danielvegamyhre commented Mar 4, 2026 •

edited

Loading

pianpwk commented Mar 4, 2026 •

edited

Loading

danielvegamyhre Mar 11, 2026 •

edited

Loading