Skip to content

[GPUHeuristics] Remove MNT boost for VeryLargeGemm on CDNA4#23876

Merged
Yu-Zhewen merged 1 commit intoiree-org:mainfrom
Yu-Zhewen:revert_verylarge_gemm
Mar 21, 2026
Merged

[GPUHeuristics] Remove MNT boost for VeryLargeGemm on CDNA4#23876
Yu-Zhewen merged 1 commit intoiree-org:mainfrom
Yu-Zhewen:revert_verylarge_gemm

Conversation

@Yu-Zhewen
Copy link
Copy Markdown
Contributor

Fixes #23831.

#23652 added boostMNTileCountPerSubgroup=32 for CDNA4 LargeGemm but
applied the same boost to VeryLargeGemm. That PR only benchmarked
LargeGemm shapes and didn't cover VeryLargeGemm.

For LargeGemm the heuristic selects the MFMA_F32_16x16x32 intrinsic
where MNT=32 fits within register limits. However, for VeryLargeGemm
shapes (e.g. 16384x16384x16384), the heuristic prefers the larger
MFMA_F32_32x32x16 intrinsic, and the boosted MNT=32 results in VGPR
spilling, causing a ~10x regression on mi355x:

Metric Before After
Time 10 ms 104 ms
Scratch Allocation 0 B/work-item 1208 B/work-item
VGPRs 216 256 (max)
VMEM instructions 71M 618M (8.7x)

This patch removes boostMNTileCountPerSubgroup and
minUtilizationThreshold from VeryLargeGemm CDNA4 seeds, reverting
to default. LargeGemm seeds are unchanged.

Signed-off-by: Yu-Zhewen <zhewenyu@amd.com>
@Yu-Zhewen Yu-Zhewen marked this pull request as ready for review March 20, 2026 17:03
@Yu-Zhewen Yu-Zhewen requested review from jerryyin and lialan March 20, 2026 17:03
@Yu-Zhewen Yu-Zhewen merged commit 7a5fc38 into iree-org:main Mar 21, 2026
53 of 57 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[GPUHeuristics] ~10x regression for 16384x16384x16384 bf16 matmul on mi355x after #23652

2 participants