Skip to content

[GPUHeuristics] Further tune LargeGemm perf#23652

Merged
lialan merged 15 commits intomainfrom
users/lialan/large_gemm_tuning
Mar 18, 2026
Merged

[GPUHeuristics] Further tune LargeGemm perf#23652
lialan merged 15 commits intomainfrom
users/lialan/large_gemm_tuning

Conversation

@lialan
Copy link
Copy Markdown
Contributor

@lialan lialan commented Mar 4, 2026

  • Adds additional, optional utilization rate as a guard in the heuristic seeds.
  • Tunes LargeGemm MMA schedule heuristic seeds from sg=4/MNT=16 to sg=8/MNT=32 and upgrades the workgroup count adjustment guard to be utilization-aware.
  • Inflate LargeGemm heuristic seeds for gfx950 (sg=8, MNT=32) to better utilize CDNA4 compute resources. This is based on tuner results.
  • Benchmarked on 75 LargeGemm bf16 shapes from mi355x: +7.0% geomean improvement with 28 shapes improved (up to 30%) and no regressions >5%.

Here are some LargeGemms representative results:

M×N×K Change
150000×4096×16384 +11%
150000×16384×4096 +10%
150000×2048×8192 +12%
150000×4096×2268 +10%
21760×3840×3840 +30%
16640×3840×3840 +26%
11520×3840×3840 +17%
6400×3840×3840 +13%
24576×2048×1536 +10%
4096×8192×2048 +6%
3840×3840×4352 neutral
4096×16384×150000 (large-K) neutral
2268×4096×150000 neutral

@lialan lialan requested a review from yzhang93 March 4, 2026 22:57
@lialan lialan force-pushed the users/lialan/large_gemm_tuning branch 4 times, most recently from 688299a to c10dd6a Compare March 5, 2026 01:56
@yzhang93
Copy link
Copy Markdown
Contributor

yzhang93 commented Mar 5, 2026

I like the idea of using utilization-aware guard when increasing the workgroup counts. The main concern about this PR is the generalizability. It would be good to have a thorough benchmarking of all convolution and gemm shapes on CDNA4. It would be also nice to check the performance impact on other GPUs, e.g. CDNA3, RDNA series.

@lialan lialan force-pushed the users/lialan/large_gemm_tuning branch from c10dd6a to cd8aa7a Compare March 6, 2026 22:01
@lialan
Copy link
Copy Markdown
Contributor Author

lialan commented Mar 6, 2026

I like the idea of using utilization-aware guard when increasing the workgroup counts. The main concern about this PR is the generalizability. It would be good to have a thorough benchmarking of all convolution and gemm shapes on CDNA4. It would be also nice to check the performance impact on other GPUs, e.g. CDNA3, RDNA series.

Simply applying this idea to all archs will inevitably have regressions on some. So this patch:

  • only tweaks on CDNA4/gfx950 at the moment, and nothing else. The new adjuster function can be implemented for other architectures. Before that, we do not change any heuristic results on CDNA3 and RDNA series at all, so their performance are the same before and after.
  • Only tweaks GFX950's LargeGemms, so small gemms will not change at all. I have included the large gemm results in the message. There is a regression case among general improvements.

However, I haven't tried its impact on convolutions maybe we should team up and try that out on conv because I was focusing on gemms. @yzhang93

@yzhang93
Copy link
Copy Markdown
Contributor

yzhang93 commented Mar 7, 2026

@yzhang93 This patch should only change large gemms, so I think the convs you saw are noises.

As for the regression in GEMM, I think it is hit or miss. We saw improvements and a few regressions.

@lialan
Copy link
Copy Markdown
Contributor Author

lialan commented Mar 8, 2026

@yzhang93 This patch should only change large gemms, so I think the convs you saw are noises.

As for the regression in GEMM, I think it is hit or miss. We saw improvements and a few regressions.

@yzhang93 Oops ... I am sorry it seems like I accidentally modified your comment instead of replying to it ....

@yzhang93
Copy link
Copy Markdown
Contributor

yzhang93 commented Mar 9, 2026

@yzhang93 This patch should only change large gemms, so I think the convs you saw are noises.

As for the regression in GEMM, I think it is hit or miss. We saw improvements and a few regressions.

What I shared are 1x1 convs and they should fall into large GEMMs category. Could you double check?

@yzhang93
Copy link
Copy Markdown
Contributor

yzhang93 commented Mar 9, 2026

@yzhang93 This patch should only change large gemms, so I think the convs you saw are noises.
As for the regression in GEMM, I think it is hit or miss. We saw improvements and a few regressions.

@yzhang93 Oops ... I am sorry it seems like I accidentally modified your comment instead of replying to it ....

There should be my original comments in your email. Could you check it?

@lialan lialan force-pushed the users/lialan/large_gemm_tuning branch 2 times, most recently from 3d73cca to 21d1d36 Compare March 9, 2026 21:01
@lialan lialan changed the title [GPUHeuristics] Tune LargeGemm MMA schedule seeds and utilization guard [GPUHeuristics] Add SeedAdjustFn hook to tune LargeGemm seeds Mar 10, 2026
@lialan
Copy link
Copy Markdown
Contributor Author

lialan commented Mar 10, 2026

@yzhang93 Oops ... I am sorry it seems like I accidentally modified your comment instead of replying to it ....

There should be my original comments in your email. Could you check it?

Ahh I turned off email notifications for github as there are too many...

Anyway, I have updated the code to make sure it will not regress your conv case.

@lialan lialan force-pushed the users/lialan/large_gemm_tuning branch from 21d1d36 to 29213c0 Compare March 10, 2026 18:27
Comment thread compiler/src/iree/compiler/Codegen/Common/GPU/GPUHeuristics.cpp Outdated
@lialan lialan requested a review from Yu-Zhewen March 10, 2026 22:59
@lialan lialan marked this pull request as ready for review March 11, 2026 00:09
@yzhang93
Copy link
Copy Markdown
Contributor

LGTM. My last request is to add some large GEMM tests for expected lowering config after the change.

lialan and others added 15 commits March 17, 2026 12:54
* Refactor `deduceMMASchedule` to accept `TargetAttr` instead of
  `wgpCount` + boolean flag; rename `adjustSeedsForWgpCount` to
  `adjustSeeds` which now extracts arch/chip info internally.
* Inflate LargeGemm seeds (sg=8, MNT=32) on CDNA4 (gfx950) for
  better subgroup utilization; non-CDNA4 retains original seeds
  (sg=4, MNT=16).
* Add utilization-aware seed reduction for CDNA4 LargeGemm/VeryLargeGemm:
  jointly reduce subgroup count and MN tile count until CU utilization
  exceeds 80%, with additional large-K reduction to avoid register spills.
* Extract `reduceSeeds` lambda to deduplicate MNT/subgroup reduction
  logic between the utilization loop and the large-K loop.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…4 tuning.

* Replace SeedAdjustFn callback approach with a single adjustSeedsForTarget
  function parameterized by target architecture.
* Pass IREE::GPU::TargetAttr to deduceMMASchedule instead of wgpCount,
  letting the adjustment function extract what it needs internally.
* CDNA4 large gemm tuning: boost MNT to 32 for balanced-K problems, then
  apply utilization-aware MNT reduction (threshold 0.50).
* Remove per-architecture function pointers from GPUMMAHeuristicSeeds.
* Add comments explaining empirical constants (targetMNT=32,
  kMinUtilizationThreshold=0.50) and their derivation on MI355X.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Add SeedAdjustFn callback type and adjustSeeds() utility function.
* Define per-arch callbacks in KnownTargets.cpp (adjustDefault, adjustCDNA4)
  that call adjustSeeds() with architecture-specific parameters.
* Wire callbacks into ArchSeedSet and thread through deduceMMASchedule.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ld field

Replace the GPUMMAHeuristicSeedAdjustmentCallback function pointer with
an optional minUtilizationThreshold field in GPUMMAHeuristicSeeds. The
CDNA4-specific MNT boost and utilization-based halving logic is now
inline in the shared adjustSeedsForTarget function, gated by the
presence of this threshold. CDNA4 seeds set it to 0.50; other
architectures leave it as nullopt, preserving existing behavior.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Replace hardcoded kTargetMNT=32 with per-architecture
  boostMNTileCountPerSubgroup field in GPUMMAHeuristicSeeds.
* Decouple MNT boost and utilization guard into independent checks.
* Remove unused Chipset.h include from GPUHeuristics.cpp (fixes Bazel).
* Add assert for workgroup size in computeEstimatedWorkgroupCount.
* Label minUtilizationThreshold and boostMNTileCountPerSubgroup fields
  in kCDNA4Seeds for readability.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@lialan lialan force-pushed the users/lialan/large_gemm_tuning branch from 9f3e984 to 359bf98 Compare March 17, 2026 20:28
@lialan lialan merged commit 954eddc into main Mar 18, 2026
58 of 59 checks passed
@lialan lialan deleted the users/lialan/large_gemm_tuning branch March 18, 2026 01:23
lialan added a commit that referenced this pull request Mar 18, 2026
efric added a commit that referenced this pull request Mar 18, 2026
This reverts commit 954eddc.

Signed-off-by: Eric Feng <Eric.Feng@amd.com>
lialan added a commit that referenced this pull request Mar 18, 2026
This reverts commit 954eddc.

Need to look into gfx942 regression.
lialan added a commit that referenced this pull request Mar 18, 2026
Here is previous commit message.
---
* Adds additional, optional utilization rate as a guard in the heuristic
seeds.
* Tunes `LargeGemm` MMA schedule heuristic seeds from sg=4/MNT=16 to
sg=8/MNT=32 and upgrades the workgroup count adjustment guard to be
utilization-aware.
* Inflate LargeGemm heuristic seeds for gfx950 (sg=8, MNT=32) to better
utilize CDNA4 compute resources. This is based on tuner results.
* Benchmarked on 75 LargeGemm bf16 shapes from mi355x: +7.0% geomean
improvement with 28 shapes improved (up to 30%) and no regressions >5%.

Here are some `LargeGemm`s representative results:
| M×N×K | Change |
  |---|---|
  | 150000×4096×16384 | **+11%** |
  | 150000×16384×4096 | **+10%** |
  | 150000×2048×8192 | **+12%** |
  | 150000×4096×2268 | **+10%** |
  | 21760×3840×3840 | **+30%** |
  | 16640×3840×3840 | **+26%** |
  | 11520×3840×3840 | **+17%** |
  | 6400×3840×3840 | **+13%** |
  | 24576×2048×1536 | **+10%** |
  | 4096×8192×2048 | **+6%** |
  | 3840×3840×4352 | neutral |
  | 4096×16384×150000 (large-K) | neutral |
  | 2268×4096×150000 | neutral |

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
lialan added a commit that referenced this pull request Mar 18, 2026
* Adds additional, optional utilization rate as a guard in the heuristic
seeds.
* Tunes `LargeGemm` MMA schedule heuristic seeds from sg=4/MNT=16 to
sg=8/MNT=32 and upgrades the workgroup count adjustment guard to be
utilization-aware.
* Inflate LargeGemm heuristic seeds for gfx950 (sg=8, MNT=32) to better
utilize CDNA4 compute resources. This is based on tuner results.
* Benchmarked on 75 LargeGemm bf16 shapes from mi355x: +7.0% geomean
improvement with 28 shapes improved (up to 30%) and no regressions >5%.

Here are some `LargeGemm`s representative results:
| M×N×K | Change |
  |---|---|
  | 150000×4096×16384 | **+11%** |
  | 150000×16384×4096 | **+10%** |
  | 150000×2048×8192 | **+12%** |
  | 150000×4096×2268 | **+10%** |
  | 21760×3840×3840 | **+30%** |
  | 16640×3840×3840 | **+26%** |
  | 11520×3840×3840 | **+17%** |
  | 6400×3840×3840 | **+13%** |
  | 24576×2048×1536 | **+10%** |
  | 4096×8192×2048 | **+6%** |
  | 3840×3840×4352 | neutral |
  | 4096×16384×150000 (large-K) | neutral |
  | 2268×4096×150000 | neutral |

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
lialan added a commit that referenced this pull request Mar 19, 2026
)

Here is previous commit message.
---
* Adds additional, optional utilization rate as a guard in the heuristic
seeds.
* Tunes `LargeGemm` MMA schedule heuristic seeds from sg=4/MNT=16 to
sg=8/MNT=32 and upgrades the workgroup count adjustment guard to be
utilization-aware.
* Inflate LargeGemm heuristic seeds for gfx950 (sg=8, MNT=32) to better
utilize CDNA4 compute resources. This is based on tuner results.
* Benchmarked on 75 LargeGemm bf16 shapes from mi355x: +7.0% geomean
improvement with 28 shapes improved (up to 30%) and no regressions >5%.

Here are some `LargeGemm`s representative results: | M×N×K | Change |
  |---|---|
  | 150000×4096×16384 | **+11%** |
  | 150000×16384×4096 | **+10%** |
  | 150000×2048×8192 | **+12%** |
  | 150000×4096×2268 | **+10%** |
  | 21760×3840×3840 | **+30%** |
  | 16640×3840×3840 | **+26%** |
  | 11520×3840×3840 | **+17%** |
  | 6400×3840×3840 | **+13%** |
  | 24576×2048×1536 | **+10%** |
  | 4096×8192×2048 | **+6%** |
  | 3840×3840×4352 | neutral |
  | 4096×16384×150000 (large-K) | neutral |
  | 2268×4096×150000 | neutral |

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
efric added a commit that referenced this pull request Mar 20, 2026
efric added a commit that referenced this pull request Mar 20, 2026
…" (#23836)"

This reverts commit ea39ece.

Signed-off-by: Eric Feng <Eric.Feng@amd.com>
Yu-Zhewen added a commit that referenced this pull request Mar 21, 2026
Fixes #23831.

#23652 added `boostMNTileCountPerSubgroup=32` for CDNA4 LargeGemm but
applied the same boost to VeryLargeGemm. That PR only benchmarked
LargeGemm shapes and didn't cover VeryLargeGemm.

For LargeGemm the heuristic selects the `MFMA_F32_16x16x32` intrinsic
where MNT=32 fits within register limits. However, for VeryLargeGemm
shapes (e.g. 16384x16384x16384), the heuristic prefers the larger
`MFMA_F32_32x32x16` intrinsic, and the boosted MNT=32 results in VGPR
spilling, causing a ~10x regression on mi355x:

| Metric | Before | After |
|---|---|---|
| Time | 10 ms | 104 ms |
| Scratch Allocation | 0 B/work-item | 1208 B/work-item |
| VGPRs | 216 | 256 (max) |
| VMEM instructions | 71M | 618M (8.7x) |

This patch removes `boostMNTileCountPerSubgroup` and
`minUtilizationThreshold` from VeryLargeGemm CDNA4 seeds, reverting
to default. LargeGemm seeds are unchanged.

Signed-off-by: Yu-Zhewen <zhewenyu@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants