Skip to content

[training] skip Dtensor/TP integration test pending solution#4059

Open
danielvegamyhre wants to merge 1 commit intomainfrom
mxex
Open

[training] skip Dtensor/TP integration test pending solution#4059
danielvegamyhre wants to merge 1 commit intomainfrom
mxex

Conversation

@danielvegamyhre
Copy link
Contributor

@danielvegamyhre danielvegamyhre commented Mar 11, 2026

Summary

@pytorch-bot
Copy link

pytorch-bot bot commented Mar 11, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/4059

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 6 Pending

As of commit 4264158 with merge base 77f23d0 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 11, 2026
@danielvegamyhre danielvegamyhre added module: training quantize_ api training flow and removed CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. ciflow/rocm-mi300 module: rocm labels Mar 11, 2026
@pytorch-bot
Copy link

pytorch-bot bot commented Mar 11, 2026

Warning: Unknown label ciflow/rocm-mi300.
Currently recognized labels are

  • ciflow/benchmark
  • ciflow/tutorials
  • ciflow/rocm
  • ciflow/4xh100
  • ciflow/xpu

Please add the new label to .github/pytorch-probot.yml

@pytorch-bot
Copy link

pytorch-bot bot commented Mar 11, 2026

Warning: Unknown label ciflow/rocm-mi300.
Currently recognized labels are

  • ciflow/benchmark
  • ciflow/tutorials
  • ciflow/rocm
  • ciflow/4xh100
  • ciflow/xpu

Please add the new label to .github/pytorch-probot.yml

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 11, 2026
@pytorch-bot
Copy link

pytorch-bot bot commented Mar 11, 2026

Warning: Unknown label ciflow/rocm-mi300.
Currently recognized labels are

  • ciflow/benchmark
  • ciflow/tutorials
  • ciflow/rocm
  • ciflow/4xh100
  • ciflow/xpu

Please add the new label to .github/pytorch-probot.yml

@danielvegamyhre danielvegamyhre changed the title [training] skip rocm and distributed tests pending solution [training] skip rocm and Dtensor/TP integration test pending solution Mar 11, 2026
@danielvegamyhre danielvegamyhre force-pushed the mxex branch 2 times, most recently from 8f27938 to ad5b9a7 Compare March 11, 2026 22:24
… on ROCm (#3992)

* [ROCm] Enable FSDP2 Float8 and affine quantized tensor parallel tests on ROCm

Remove blanket ROCm test skips and fix FP8 hardware capability gates
to support AMD MI300/MI350 GPUs alongside NVIDIA SM89+/SM90+.

test/float8/test_fsdp2/test_fsdp2.py:
- Replace dual module-level skip (is_sm_at_least_89 + ROCm skip) with
  a single gate: is_sm_at_least_89() or is_MI300() or is_MI350()
- Import e4m3_dtype from config and use it in test_amax_allreduce_device_mesh
  instead of hardcoded torch.float8_e4m3fn (MI300 uses float8_e4m3fnuz)

test/dtypes/test_affine_quantized_tensor_parallel.py:
- Remove module-level pytest.skip on ROCm that blocked all TP tests
  (Int8wo, Int4wo, Int8dq) even though they have no FP8 dependency
- Fix Float8 TP class gate: use is_sm_at_least_90() instead of raw
  get_device_capability() >= (9, 0), which incorrectly passes on ROCm
  where gfx90a (MI250X) maps to (9, 0) despite lacking FP8 support

Validated on MI250X (gfx90a, 8 GPUs):
- FSDP2 Float8: correctly skipped (MI250X lacks FP8)
- Affine quantized TP: 4 passed, 2 skipped (Int8wo 3/3, Int8dq 1/1)
- Float8 TP classes correctly not defined on non-FP8 hardware

* Fix ruff F401: remove unused pytest import in test_affine_quantized_tensor_parallel.py

The pytest import was left over after removing the module-level
pytest.skip on ROCm.

* Fix ruff format: break long pytest.skip line
@danielvegamyhre danielvegamyhre changed the title [training] skip rocm and Dtensor/TP integration test pending solution [training] skip Dtensor/TP integration test pending solution Mar 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/rocm ciflow/rocm-mi300 ciflow/4xh100 CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. module: rocm module: training quantize_ api training flow

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants