[GPUHeuristics] Further tune `LargeGemm` perf by lialan · Pull Request #23652 · iree-org/iree

lialan · 2026-03-04T22:57:36Z

Adds additional, optional utilization rate as a guard in the heuristic seeds.
Tunes LargeGemm MMA schedule heuristic seeds from sg=4/MNT=16 to sg=8/MNT=32 and upgrades the workgroup count adjustment guard to be utilization-aware.
Inflate LargeGemm heuristic seeds for gfx950 (sg=8, MNT=32) to better utilize CDNA4 compute resources. This is based on tuner results.
Benchmarked on 75 LargeGemm bf16 shapes from mi355x: +7.0% geomean improvement with 28 shapes improved (up to 30%) and no regressions >5%.

Here are some LargeGemms representative results:

M×N×K	Change
150000×4096×16384	+11%
150000×16384×4096	+10%
150000×2048×8192	+12%
150000×4096×2268	+10%
21760×3840×3840	+30%
16640×3840×3840	+26%
11520×3840×3840	+17%
6400×3840×3840	+13%
24576×2048×1536	+10%
4096×8192×2048	+6%
3840×3840×4352	neutral
4096×16384×150000 (large-K)	neutral
2268×4096×150000	neutral

yzhang93 · 2026-03-05T17:53:03Z

I like the idea of using utilization-aware guard when increasing the workgroup counts. The main concern about this PR is the generalizability. It would be good to have a thorough benchmarking of all convolution and gemm shapes on CDNA4. It would be also nice to check the performance impact on other GPUs, e.g. CDNA3, RDNA series.

lialan · 2026-03-06T22:13:36Z

I like the idea of using utilization-aware guard when increasing the workgroup counts. The main concern about this PR is the generalizability. It would be good to have a thorough benchmarking of all convolution and gemm shapes on CDNA4. It would be also nice to check the performance impact on other GPUs, e.g. CDNA3, RDNA series.

Simply applying this idea to all archs will inevitably have regressions on some. So this patch:

only tweaks on CDNA4/gfx950 at the moment, and nothing else. The new adjuster function can be implemented for other architectures. Before that, we do not change any heuristic results on CDNA3 and RDNA series at all, so their performance are the same before and after.
Only tweaks GFX950's LargeGemms, so small gemms will not change at all. I have included the large gemm results in the message. There is a regression case among general improvements.

However, I haven't tried its impact on convolutions maybe we should team up and try that out on conv because I was focusing on gemms. @yzhang93

yzhang93 · 2026-03-07T00:11:25Z

@yzhang93 This patch should only change large gemms, so I think the convs you saw are noises.

As for the regression in GEMM, I think it is hit or miss. We saw improvements and a few regressions.

lialan · 2026-03-08T01:30:19Z

@yzhang93 This patch should only change large gemms, so I think the convs you saw are noises.

As for the regression in GEMM, I think it is hit or miss. We saw improvements and a few regressions.

@yzhang93 Oops ... I am sorry it seems like I accidentally modified your comment instead of replying to it ....

yzhang93 · 2026-03-09T15:41:47Z

@yzhang93 This patch should only change large gemms, so I think the convs you saw are noises.

As for the regression in GEMM, I think it is hit or miss. We saw improvements and a few regressions.

What I shared are 1x1 convs and they should fall into large GEMMs category. Could you double check?

yzhang93 · 2026-03-09T15:44:14Z

@yzhang93 This patch should only change large gemms, so I think the convs you saw are noises.
As for the regression in GEMM, I think it is hit or miss. We saw improvements and a few regressions.

@yzhang93 Oops ... I am sorry it seems like I accidentally modified your comment instead of replying to it ....

There should be my original comments in your email. Could you check it?

lialan · 2026-03-10T15:08:13Z

@yzhang93 Oops ... I am sorry it seems like I accidentally modified your comment instead of replying to it ....

There should be my original comments in your email. Could you check it?

Ahh I turned off email notifications for github as there are too many...

Anyway, I have updated the code to make sure it will not regress your conv case.

yzhang93 · 2026-03-16T22:33:03Z

LGTM. My last request is to add some large GEMM tests for expected lowering config after the change.

* Refactor `deduceMMASchedule` to accept `TargetAttr` instead of `wgpCount` + boolean flag; rename `adjustSeedsForWgpCount` to `adjustSeeds` which now extracts arch/chip info internally. * Inflate LargeGemm seeds (sg=8, MNT=32) on CDNA4 (gfx950) for better subgroup utilization; non-CDNA4 retains original seeds (sg=4, MNT=16). * Add utilization-aware seed reduction for CDNA4 LargeGemm/VeryLargeGemm: jointly reduce subgroup count and MN tile count until CU utilization exceeds 80%, with additional large-K reduction to avoid register spills. * Extract `reduceSeeds` lambda to deduplicate MNT/subgroup reduction logic between the utilization loop and the large-K loop. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…4 tuning. * Replace SeedAdjustFn callback approach with a single adjustSeedsForTarget function parameterized by target architecture. * Pass IREE::GPU::TargetAttr to deduceMMASchedule instead of wgpCount, letting the adjustment function extract what it needs internally. * CDNA4 large gemm tuning: boost MNT to 32 for balanced-K problems, then apply utilization-aware MNT reduction (threshold 0.50). * Remove per-architecture function pointers from GPUMMAHeuristicSeeds. * Add comments explaining empirical constants (targetMNT=32, kMinUtilizationThreshold=0.50) and their derivation on MI355X. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add SeedAdjustFn callback type and adjustSeeds() utility function. * Define per-arch callbacks in KnownTargets.cpp (adjustDefault, adjustCDNA4) that call adjustSeeds() with architecture-specific parameters. * Wire callbacks into ArchSeedSet and thread through deduceMMASchedule. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…acks." This reverts commit 477e627.

…ld field Replace the GPUMMAHeuristicSeedAdjustmentCallback function pointer with an optional minUtilizationThreshold field in GPUMMAHeuristicSeeds. The CDNA4-specific MNT boost and utilization-based halving logic is now inline in the shared adjustSeedsForTarget function, gated by the presence of this threshold. CDNA4 seeds set it to 0.50; other architectures leave it as nullopt, preserving existing behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Replace hardcoded kTargetMNT=32 with per-architecture boostMNTileCountPerSubgroup field in GPUMMAHeuristicSeeds. * Decouple MNT boost and utilization guard into independent checks. * Remove unused Chipset.h include from GPUHeuristics.cpp (fixes Bazel). * Add assert for workgroup size in computeEstimatedWorkgroupCount. * Label minUtilizationThreshold and boostMNTileCountPerSubgroup fields in kCDNA4Seeds for readability. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

This reverts commit 954eddc.

This reverts commit 954eddc. Signed-off-by: Eric Feng <Eric.Feng@amd.com>

This reverts commit 954eddc. Need to look into gfx942 regression.

Here is previous commit message. --- * Adds additional, optional utilization rate as a guard in the heuristic seeds. * Tunes `LargeGemm` MMA schedule heuristic seeds from sg=4/MNT=16 to sg=8/MNT=32 and upgrades the workgroup count adjustment guard to be utilization-aware. * Inflate LargeGemm heuristic seeds for gfx950 (sg=8, MNT=32) to better utilize CDNA4 compute resources. This is based on tuner results. * Benchmarked on 75 LargeGemm bf16 shapes from mi355x: +7.0% geomean improvement with 28 shapes improved (up to 30%) and no regressions >5%. Here are some `LargeGemm`s representative results: | M×N×K | Change | |---|---| | 150000×4096×16384 | **+11%** | | 150000×16384×4096 | **+10%** | | 150000×2048×8192 | **+12%** | | 150000×4096×2268 | **+10%** | | 21760×3840×3840 | **+30%** | | 16640×3840×3840 | **+26%** | | 11520×3840×3840 | **+17%** | | 6400×3840×3840 | **+13%** | | 24576×2048×1536 | **+10%** | | 4096×8192×2048 | **+6%** | | 3840×3840×4352 | neutral | | 4096×16384×150000 (large-K) | neutral | | 2268×4096×150000 | neutral | --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* Adds additional, optional utilization rate as a guard in the heuristic seeds. * Tunes `LargeGemm` MMA schedule heuristic seeds from sg=4/MNT=16 to sg=8/MNT=32 and upgrades the workgroup count adjustment guard to be utilization-aware. * Inflate LargeGemm heuristic seeds for gfx950 (sg=8, MNT=32) to better utilize CDNA4 compute resources. This is based on tuner results. * Benchmarked on 75 LargeGemm bf16 shapes from mi355x: +7.0% geomean improvement with 28 shapes improved (up to 30%) and no regressions >5%. Here are some `LargeGemm`s representative results: | M×N×K | Change | |---|---| | 150000×4096×16384 | **+11%** | | 150000×16384×4096 | **+10%** | | 150000×2048×8192 | **+12%** | | 150000×4096×2268 | **+10%** | | 21760×3840×3840 | **+30%** | | 16640×3840×3840 | **+26%** | | 11520×3840×3840 | **+17%** | | 6400×3840×3840 | **+13%** | | 24576×2048×1536 | **+10%** | | 4096×8192×2048 | **+6%** | | 3840×3840×4352 | neutral | | 4096×16384×150000 (large-K) | neutral | | 2268×4096×150000 | neutral | --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

) Here is previous commit message. --- * Adds additional, optional utilization rate as a guard in the heuristic seeds. * Tunes `LargeGemm` MMA schedule heuristic seeds from sg=4/MNT=16 to sg=8/MNT=32 and upgrades the workgroup count adjustment guard to be utilization-aware. * Inflate LargeGemm heuristic seeds for gfx950 (sg=8, MNT=32) to better utilize CDNA4 compute resources. This is based on tuner results. * Benchmarked on 75 LargeGemm bf16 shapes from mi355x: +7.0% geomean improvement with 28 shapes improved (up to 30%) and no regressions >5%. Here are some `LargeGemm`s representative results: | M×N×K | Change | |---|---| | 150000×4096×16384 | **+11%** | | 150000×16384×4096 | **+10%** | | 150000×2048×8192 | **+12%** | | 150000×4096×2268 | **+10%** | | 21760×3840×3840 | **+30%** | | 16640×3840×3840 | **+26%** | | 11520×3840×3840 | **+17%** | | 6400×3840×3840 | **+13%** | | 24576×2048×1536 | **+10%** | | 4096×8192×2048 | **+6%** | | 3840×3840×4352 | neutral | | 4096×16384×150000 (large-K) | neutral | | 2268×4096×150000 | neutral | --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…" (#23836)" This reverts commit ea39ece.

…" (#23836)" This reverts commit ea39ece. Signed-off-by: Eric Feng <Eric.Feng@amd.com>

Fixes #23831. #23652 added `boostMNTileCountPerSubgroup=32` for CDNA4 LargeGemm but applied the same boost to VeryLargeGemm. That PR only benchmarked LargeGemm shapes and didn't cover VeryLargeGemm. For LargeGemm the heuristic selects the `MFMA_F32_16x16x32` intrinsic where MNT=32 fits within register limits. However, for VeryLargeGemm shapes (e.g. 16384x16384x16384), the heuristic prefers the larger `MFMA_F32_32x32x16` intrinsic, and the boosted MNT=32 results in VGPR spilling, causing a ~10x regression on mi355x: | Metric | Before | After | |---|---|---| | Time | 10 ms | 104 ms | | Scratch Allocation | 0 B/work-item | 1208 B/work-item | | VGPRs | 216 | 256 (max) | | VMEM instructions | 71M | 618M (8.7x) | This patch removes `boostMNTileCountPerSubgroup` and `minUtilizationThreshold` from VeryLargeGemm CDNA4 seeds, reverting to default. LargeGemm seeds are unchanged. Signed-off-by: Yu-Zhewen <zhewenyu@amd.com>

lialan requested a review from yzhang93 March 4, 2026 22:57

lialan force-pushed the users/lialan/large_gemm_tuning branch 4 times, most recently from 688299a to c10dd6a Compare March 5, 2026 01:56

lialan force-pushed the users/lialan/large_gemm_tuning branch from c10dd6a to cd8aa7a Compare March 6, 2026 22:01

lialan force-pushed the users/lialan/large_gemm_tuning branch 2 times, most recently from 3d73cca to 21d1d36 Compare March 9, 2026 21:01

lialan changed the title ~~[GPUHeuristics] Tune LargeGemm MMA schedule seeds and utilization guard~~ [GPUHeuristics] Add SeedAdjustFn hook to tune LargeGemm seeds Mar 10, 2026

lialan force-pushed the users/lialan/large_gemm_tuning branch from 21d1d36 to 29213c0 Compare March 10, 2026 18:27

lialan commented Mar 10, 2026

View reviewed changes

Comment thread compiler/src/iree/compiler/Codegen/Common/GPU/GPUHeuristics.cpp Outdated

lialan requested a review from Yu-Zhewen March 10, 2026 22:59

lialan marked this pull request as ready for review March 11, 2026 00:09

lialan requested review from Groverkss, Max191, krzysz00, kuhar, nirvedhmeshram and qedawkins as code owners March 11, 2026 00:09

lialan removed request for Max191, krzysz00, kuhar and qedawkins March 11, 2026 00:09

yzhang93 approved these changes Mar 17, 2026

View reviewed changes

lialan and others added 15 commits March 17, 2026 12:54

update

7392533

Revert "[Codegen] Refactor seed adjustment to use arch-specific callb…

09c055a

…acks." This reverts commit 477e627.

[Codegen] Move CDNA4 seed tuning behind callback

f998c9b

[Codegen] Move RDNA4 comment to above isRDNA4, add CDNA4 comment

ea40c71

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix bazel.

26a0e56

Address comments.

577529d

address comments.

74288fe

address another comment.

904b852

Adding tests.

08083e6

Update tests.

359bf98

lialan force-pushed the users/lialan/large_gemm_tuning branch from 9f3e984 to 359bf98 Compare March 17, 2026 20:28

lialan merged commit 954eddc into main Mar 18, 2026
58 of 59 checks passed

lialan deleted the users/lialan/large_gemm_tuning branch March 18, 2026 01:23

lialan added a commit that referenced this pull request Mar 18, 2026

Revert "[GPUHeuristics] Further tune LargeGemm perf (#23652)"

fef0c42

This reverts commit 954eddc.

efric added a commit that referenced this pull request Mar 18, 2026

Revert "[GPUHeuristics] Further tune LargeGemm perf (#23652)"

fda6445

This reverts commit 954eddc. Signed-off-by: Eric Feng <Eric.Feng@amd.com>

lialan added a commit that referenced this pull request Mar 18, 2026

Revert "[GPUHeuristics] Further tune LargeGemm perf (#23652)" (#23833)

d664dd7

This reverts commit 954eddc. Need to look into gfx942 regression.

Yu-Zhewen mentioned this pull request Mar 20, 2026

[GPUHeuristics] Remove MNT boost for VeryLargeGemm on CDNA4 #23876

Merged

efric added a commit that referenced this pull request Mar 20, 2026

Revert "Reapply "[GPUHeuristics] Further tune LargeGemm perf (#23652)…

7a5147d

…" (#23836)" This reverts commit ea39ece.

efric added a commit that referenced this pull request Mar 20, 2026

Revert "Reapply "[GPUHeuristics] Further tune LargeGemm perf (#23652)…

e2271ed

…" (#23836)" This reverts commit ea39ece. Signed-off-by: Eric Feng <Eric.Feng@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPUHeuristics] Further tune `LargeGemm` perf#23652

[GPUHeuristics] Further tune `LargeGemm` perf#23652
lialan merged 15 commits intomainfrom
users/lialan/large_gemm_tuning

lialan commented Mar 4, 2026 •

edited

Loading

Uh oh!

yzhang93 commented Mar 5, 2026 •

edited

Loading

Uh oh!

lialan commented Mar 6, 2026

Uh oh!

yzhang93 commented Mar 7, 2026 •

edited by lialan

Loading

Uh oh!

lialan commented Mar 8, 2026

Uh oh!

yzhang93 commented Mar 9, 2026 •

edited

Loading

Uh oh!

yzhang93 commented Mar 9, 2026

Uh oh!

lialan commented Mar 10, 2026

Uh oh!

Uh oh!

yzhang93 commented Mar 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

lialan commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yzhang93 commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lialan commented Mar 6, 2026

Uh oh!

yzhang93 commented Mar 7, 2026 • edited by lialan Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lialan commented Mar 8, 2026

Uh oh!

yzhang93 commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yzhang93 commented Mar 9, 2026

Uh oh!

lialan commented Mar 10, 2026

Uh oh!

Uh oh!

yzhang93 commented Mar 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

lialan commented Mar 4, 2026 •

edited

Loading

yzhang93 commented Mar 5, 2026 •

edited

Loading

yzhang93 commented Mar 7, 2026 •

edited by lialan

Loading

yzhang93 commented Mar 9, 2026 •

edited

Loading