[GPUHeuristics] Further tune LargeGemm perf#23652
Conversation
688299a to
c10dd6a
Compare
|
I like the idea of using utilization-aware guard when increasing the workgroup counts. The main concern about this PR is the generalizability. It would be good to have a thorough benchmarking of all convolution and gemm shapes on CDNA4. It would be also nice to check the performance impact on other GPUs, e.g. CDNA3, RDNA series. |
c10dd6a to
cd8aa7a
Compare
Simply applying this idea to all archs will inevitably have regressions on some. So this patch:
However, I haven't tried its impact on convolutions maybe we should team up and try that out on conv because I was focusing on gemms. @yzhang93 |
|
@yzhang93 This patch should only change large gemms, so I think the convs you saw are noises. As for the regression in GEMM, I think it is hit or miss. We saw improvements and a few regressions. |
@yzhang93 Oops ... I am sorry it seems like I accidentally modified your comment instead of replying to it .... |
What I shared are 1x1 convs and they should fall into large GEMMs category. Could you double check? |
There should be my original comments in your email. Could you check it? |
3d73cca to
21d1d36
Compare
Ahh I turned off email notifications for github as there are too many... Anyway, I have updated the code to make sure it will not regress your conv case. |
21d1d36 to
29213c0
Compare
|
LGTM. My last request is to add some large GEMM tests for expected lowering config after the change. |
* Refactor `deduceMMASchedule` to accept `TargetAttr` instead of `wgpCount` + boolean flag; rename `adjustSeedsForWgpCount` to `adjustSeeds` which now extracts arch/chip info internally. * Inflate LargeGemm seeds (sg=8, MNT=32) on CDNA4 (gfx950) for better subgroup utilization; non-CDNA4 retains original seeds (sg=4, MNT=16). * Add utilization-aware seed reduction for CDNA4 LargeGemm/VeryLargeGemm: jointly reduce subgroup count and MN tile count until CU utilization exceeds 80%, with additional large-K reduction to avoid register spills. * Extract `reduceSeeds` lambda to deduplicate MNT/subgroup reduction logic between the utilization loop and the large-K loop. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…4 tuning. * Replace SeedAdjustFn callback approach with a single adjustSeedsForTarget function parameterized by target architecture. * Pass IREE::GPU::TargetAttr to deduceMMASchedule instead of wgpCount, letting the adjustment function extract what it needs internally. * CDNA4 large gemm tuning: boost MNT to 32 for balanced-K problems, then apply utilization-aware MNT reduction (threshold 0.50). * Remove per-architecture function pointers from GPUMMAHeuristicSeeds. * Add comments explaining empirical constants (targetMNT=32, kMinUtilizationThreshold=0.50) and their derivation on MI355X. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Add SeedAdjustFn callback type and adjustSeeds() utility function. * Define per-arch callbacks in KnownTargets.cpp (adjustDefault, adjustCDNA4) that call adjustSeeds() with architecture-specific parameters. * Wire callbacks into ArchSeedSet and thread through deduceMMASchedule. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…acks." This reverts commit 477e627.
…ld field Replace the GPUMMAHeuristicSeedAdjustmentCallback function pointer with an optional minUtilizationThreshold field in GPUMMAHeuristicSeeds. The CDNA4-specific MNT boost and utilization-based halving logic is now inline in the shared adjustSeedsForTarget function, gated by the presence of this threshold. CDNA4 seeds set it to 0.50; other architectures leave it as nullopt, preserving existing behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Replace hardcoded kTargetMNT=32 with per-architecture boostMNTileCountPerSubgroup field in GPUMMAHeuristicSeeds. * Decouple MNT boost and utilization guard into independent checks. * Remove unused Chipset.h include from GPUHeuristics.cpp (fixes Bazel). * Add assert for workgroup size in computeEstimatedWorkgroupCount. * Label minUtilizationThreshold and boostMNTileCountPerSubgroup fields in kCDNA4Seeds for readability. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
9f3e984 to
359bf98
Compare
Here is previous commit message. --- * Adds additional, optional utilization rate as a guard in the heuristic seeds. * Tunes `LargeGemm` MMA schedule heuristic seeds from sg=4/MNT=16 to sg=8/MNT=32 and upgrades the workgroup count adjustment guard to be utilization-aware. * Inflate LargeGemm heuristic seeds for gfx950 (sg=8, MNT=32) to better utilize CDNA4 compute resources. This is based on tuner results. * Benchmarked on 75 LargeGemm bf16 shapes from mi355x: +7.0% geomean improvement with 28 shapes improved (up to 30%) and no regressions >5%. Here are some `LargeGemm`s representative results: | M×N×K | Change | |---|---| | 150000×4096×16384 | **+11%** | | 150000×16384×4096 | **+10%** | | 150000×2048×8192 | **+12%** | | 150000×4096×2268 | **+10%** | | 21760×3840×3840 | **+30%** | | 16640×3840×3840 | **+26%** | | 11520×3840×3840 | **+17%** | | 6400×3840×3840 | **+13%** | | 24576×2048×1536 | **+10%** | | 4096×8192×2048 | **+6%** | | 3840×3840×4352 | neutral | | 4096×16384×150000 (large-K) | neutral | | 2268×4096×150000 | neutral | --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* Adds additional, optional utilization rate as a guard in the heuristic seeds. * Tunes `LargeGemm` MMA schedule heuristic seeds from sg=4/MNT=16 to sg=8/MNT=32 and upgrades the workgroup count adjustment guard to be utilization-aware. * Inflate LargeGemm heuristic seeds for gfx950 (sg=8, MNT=32) to better utilize CDNA4 compute resources. This is based on tuner results. * Benchmarked on 75 LargeGemm bf16 shapes from mi355x: +7.0% geomean improvement with 28 shapes improved (up to 30%) and no regressions >5%. Here are some `LargeGemm`s representative results: | M×N×K | Change | |---|---| | 150000×4096×16384 | **+11%** | | 150000×16384×4096 | **+10%** | | 150000×2048×8192 | **+12%** | | 150000×4096×2268 | **+10%** | | 21760×3840×3840 | **+30%** | | 16640×3840×3840 | **+26%** | | 11520×3840×3840 | **+17%** | | 6400×3840×3840 | **+13%** | | 24576×2048×1536 | **+10%** | | 4096×8192×2048 | **+6%** | | 3840×3840×4352 | neutral | | 4096×16384×150000 (large-K) | neutral | | 2268×4096×150000 | neutral | --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
) Here is previous commit message. --- * Adds additional, optional utilization rate as a guard in the heuristic seeds. * Tunes `LargeGemm` MMA schedule heuristic seeds from sg=4/MNT=16 to sg=8/MNT=32 and upgrades the workgroup count adjustment guard to be utilization-aware. * Inflate LargeGemm heuristic seeds for gfx950 (sg=8, MNT=32) to better utilize CDNA4 compute resources. This is based on tuner results. * Benchmarked on 75 LargeGemm bf16 shapes from mi355x: +7.0% geomean improvement with 28 shapes improved (up to 30%) and no regressions >5%. Here are some `LargeGemm`s representative results: | M×N×K | Change | |---|---| | 150000×4096×16384 | **+11%** | | 150000×16384×4096 | **+10%** | | 150000×2048×8192 | **+12%** | | 150000×4096×2268 | **+10%** | | 21760×3840×3840 | **+30%** | | 16640×3840×3840 | **+26%** | | 11520×3840×3840 | **+17%** | | 6400×3840×3840 | **+13%** | | 24576×2048×1536 | **+10%** | | 4096×8192×2048 | **+6%** | | 3840×3840×4352 | neutral | | 4096×16384×150000 (large-K) | neutral | | 2268×4096×150000 | neutral | --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Fixes #23831. #23652 added `boostMNTileCountPerSubgroup=32` for CDNA4 LargeGemm but applied the same boost to VeryLargeGemm. That PR only benchmarked LargeGemm shapes and didn't cover VeryLargeGemm. For LargeGemm the heuristic selects the `MFMA_F32_16x16x32` intrinsic where MNT=32 fits within register limits. However, for VeryLargeGemm shapes (e.g. 16384x16384x16384), the heuristic prefers the larger `MFMA_F32_32x32x16` intrinsic, and the boosted MNT=32 results in VGPR spilling, causing a ~10x regression on mi355x: | Metric | Before | After | |---|---|---| | Time | 10 ms | 104 ms | | Scratch Allocation | 0 B/work-item | 1208 B/work-item | | VGPRs | 216 | 256 (max) | | VMEM instructions | 71M | 618M (8.7x) | This patch removes `boostMNTileCountPerSubgroup` and `minUtilizationThreshold` from VeryLargeGemm CDNA4 seeds, reverting to default. LargeGemm seeds are unchanged. Signed-off-by: Yu-Zhewen <zhewenyu@amd.com>
LargeGemmMMA schedule heuristic seeds from sg=4/MNT=16 to sg=8/MNT=32 and upgrades the workgroup count adjustment guard to be utilization-aware.Here are some
LargeGemms representative results: