[ROCm] Enable MXFP8/MXFP4 emulation tests on ROCm (MI300+) by brucechanglongxu · Pull Request #4041 · pytorch/ao

brucechanglongxu · 2026-03-10T23:29:16Z

The MX emulation path (KernelPreference.EMULATED) performs quantization and matmul entirely via PyTorch ops with no native MX hardware kernels involved. Despite this, most MX tests were gated behind CUDA SM checks (is_sm_at_least_89/90/100) or blanket @skip_if_rocm decorators that prevented them from running on any ROCm GPU, including MI300X where the emulation path works fine.

This patch makes the skip conditions ROCm-aware across four test files so that emulation-path tests run on ROCm while native-kernel tests remain correctly gated.

test_inference_workflow.py: the @skip_if_rocm("ROCm float4 gemm require gfx950") decorator on test_inference_workflow_mx was intended for the native float4/float8 gemm path but also blocked the emulate=True parameter sweep. Replaced with in-body logic that skips the native path unless MI350 (gfx950) is present, preserves the mxfp4+compile skip, and lets all emulation configs through. The CUDA path is unchanged (same is_sm_at_least_89/100 checks as before, just nested under an else branch).

test_mx_tensor.py: widened the is_sm_at_least_89/90 skip conditions on four tests to also pass on ROCm -- test_to_mx_from_mx_compile_numerics (float8 compile numerics), test_to_mx_inductor_single_kernel (inductor fusion), test_index_select (3D MXTensor indexing, no compile involved at all), and test_cast_to_float8_e4m3fn_saturation_behavior (triton float8 saturated cast).

test_mx_serialization.py: the is_sm_at_least_100 decorator-level skip prevented the mxfp8 recipe from running on ROCm even though it uses EMULATED mode and just tests checkpoint save/load. Moved the skip into the test body so that mxfp8 runs on ROCm while nvfp4 stays gated on SM100.

test_mxfp8_allgather.py: widened the SM90 assert to also accept ROCm. The allgather test is pure tensor data transfer with no compute dependency on SM version.

Files not changed (with rationale): test_mx_linear.py already runs TORCH-cast eager tests on ROCm with no skip (the is_sm_at_least_89 gate only applies to non-TORCH cast kernels). test_mx_dtensor.py already runs emulated tests; the dim1 triton/cuda tests correctly require SM100 since they use PTX inline assembly. test_kernels.py triton mxfp8 kernels are CUDA-only (PTX). test_mx_mm.py tests native scaled_mm which requires SM100 hardware.

Tested on MI300X (gfx942): 16 inference workflow emulation tests pass, 90 basic MX tensor tests pass, serialization[mxfp8] passes, index_select passes. Existing test_mx_linear TORCH-cast tests still pass (123 passed, 0 regressions). All native-path tests correctly skip on MI300X.

pytorch-bot · 2026-03-10T23:29:20Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/4041

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit 9c1764c with merge base 4ae435e ():

NEW FAILURES - The following jobs have failed:

Run 1xH100 Tests / test (H100, linux.aws.h100, --pre torch torchvision torchaudio mslk --index-url https://download.... / linux-job (gh)
RuntimeError: Command docker exec -t ee7510b64eb83df4eafc7ac0089c3896939b6ad533ce85135953d5647b6ebcc1 /exec failed with exit code 1
Run Regression Tests on ROCm / test-nightly (ROCM Nightly, linux.rocm.gpu.gfx942.1, --pre torch --index-url https://download.pyt... / linux-job (gh)
RuntimeError: Command docker exec -t 88cf558ad4c99699dfa123a41c6bac51e8fde4e4dfaa1e22ad9a31b261826413 /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

The MX emulation path (`KernelPreference.EMULATED`) performs quantization and matmul entirely via PyTorch ops -- no native MX hardware kernels are involved. Despite this, most MX tests were gated behind CUDA SM checks (`is_sm_at_least_89/90/100`) or blanket `@skip_if_rocm` decorators that prevented them from running on any ROCm GPU, including MI300X where the emulation path works correctly. This patch makes the skip conditions ROCm-aware across four test files so that emulation-path tests run on ROCm while native-kernel tests remain correctly gated. test_inference_workflow.py: Remove `@skip_if_rocm("ROCm float4 gemm require gfx950")` from `test_inference_workflow_mx`. The decorator was intended for the native float4/float8 gemm path but also blocked the `emulate=True` parameter sweep. Replace with in-body logic: skip native path unless MI350 (gfx950), preserve the mxfp4+compile skip, let everything else run. test_mx_tensor.py: Widen the `is_sm_at_least_89/90` skip conditions on four tests to also pass on ROCm: `test_to_mx_from_mx_compile_numerics` (float8 compile numerics), `test_to_mx_inductor_single_kernel` (inductor fusion), `test_index_select` (3D MXTensor indexing, no compile involved), and `test_cast_to_float8_e4m3fn_saturation_behavior` (triton float8 cast). test_mx_serialization.py: The `is_sm_at_least_100` decorator-level skip prevented the mxfp8 recipe from running on ROCm even though it uses `EMULATED` mode and just tests checkpoint save/load. Move the skip into the test body so that mxfp8 runs on ROCm while nvfp4 remains gated on SM100. test_mxfp8_allgather.py: Widen the SM90 assert to also accept ROCm. The allgather test is pure tensor data transfer with no compute dependency on SM version. Verified on MI300X (gfx942): 16 inference workflow tests pass, 90 basic MX tensor tests pass, serialization and index_select pass, existing test_mx_linear TORCH-cast tests still pass (123 passed, 0 regressions).

pytorch-bot · 2026-03-11T00:12:08Z

To add the ciflow label ciflow/rocm please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

pytorch-bot · 2026-03-11T21:57:52Z

Warning: Unknown label ciflow/rocm-mi300.
Currently recognized labels are

ciflow/benchmark
ciflow/tutorials
ciflow/rocm
ciflow/4xh100
ciflow/xpu

Please add the new label to .github/pytorch-probot.yml

pytorch-bot bot added the module: rocm label Mar 10, 2026

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 10, 2026

brucechanglongxu force-pushed the rocm-mxfp-emulation-tests branch from b7874f4 to 9c1764c Compare March 10, 2026 23:37

danielvegamyhre approved these changes Mar 11, 2026

View reviewed changes

danielvegamyhre added module: training quantize_ api training flow ciflow/rocm labels Mar 11, 2026

pytorch-bot bot removed the ciflow/rocm label Mar 11, 2026

danielvegamyhre added this to the MXFP8 Training milestone Mar 11, 2026

danielvegamyhre added ciflow/rocm ciflow/rocm-mi300 labels Mar 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] Enable MXFP8/MXFP4 emulation tests on ROCm (MI300+)#4041

[ROCm] Enable MXFP8/MXFP4 emulation tests on ROCm (MI300+)#4041
brucechanglongxu wants to merge 1 commit intopytorch:mainfrom
brucechanglongxu:rocm-mxfp-emulation-tests

brucechanglongxu commented Mar 10, 2026

Uh oh!

pytorch-bot bot commented Mar 10, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Mar 11, 2026

Uh oh!

pytorch-bot bot commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

brucechanglongxu commented Mar 10, 2026

Uh oh!

pytorch-bot bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/4041

❌ 2 New Failures

Uh oh!

pytorch-bot bot commented Mar 11, 2026

Uh oh!

pytorch-bot bot commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pytorch-bot bot commented Mar 10, 2026 •

edited

Loading