[GPU][Codegen] Implement virtual dense mfma (VDMFMA) by efric · Pull Request #23677 · iree-org/iree

efric · 2026-03-06T01:32:15Z

This PR introduces virtual dense MFMAs (VDMFMA), which implement the sparse trick for skinny GEMM (M=8) workloads inspired by the hugging face article Creating custom kernels for the AMD MI300. This patch implements VDMFMA for all CDNA3 variants of the sparse mfma intrinsic. Further plumbing and support for CDNA4 variants will be future PRs.

Sparse MFMA (V_SMFMAC) instructions perform MMA on an imbalanced pair of operands: a 4:2 structured-sparse matrix A and a dense matrix B. The instruction also takes a sparsity index that encodes which 2 of every 4 elements along K are non-zero within the sparse matrix A. The trick exploits this by pairing even/odd lanes to jointly describe a full dense row.

The benefit over the current approach of padding M=8 to M=16 for dense MFMA is that sparse MFMA processes twice the K-depth of the equivalent dense MFMA in the same number of cycles.

Performance comparison for FP16/FP8 (coupled with optimizing out redundant accumulator processing):

shape              vdmfma  baseline  %
-----------------  ------  --------  -----
f16_8x13312x16384  214 us  217 us    +1.4%
f16_8x13312x8192   123 us  122 us    -
f16_8x2304x16384   137 us  145 us    +5.5%
f16_8x2304x8192    114 us  115 us    -
f16_8x6656x16384   144 us  145 us    -
f16_8x6656x8192    116 us  123 us    +5.7%
fp8_8x13312x16384  135 us  133 us    -
fp8_8x13312x8192   113 us  106 us    -6.6%
fp8_8x2304x16384   113 us  126 us    +10.3%
fp8_8x2304x8192    92.6 us 98.6 us   +6.1%
fp8_8x6656x16384   114 us  133 us    +14.3%
fp8_8x6656x8192    97.8 us 108 us    +9.4%

Assisted by: Claude

efric · 2026-03-26T18:41:34Z

+  int64_t subgroupSize = getSubgroupSize();
+  int64_t physicalLanesPerThread =
+      subgroupSize / llvm::product_of(lhsLayout.thread);
+  if (isVDMFMAIntrinsic(intrinsic) && physicalLanesPerThread > 1) {


this isVDMFMAIntrinsic is actually redundant; it was a suggestion to make it explicit in the code that this part is only relevant for VDMFMA

… < subgroupsize (#23657) When `prod(thread) < subgroupSize` in `MMASingleSubgroupLayout`, there is implied broadcasting: multiple threads map to the same index in the thread layout and get the same data. This patch adds an optional physicalLanesPerThread parameter (default 1) to `populateCanonicalOffsetsSizesAndStrides`, enabling callers to opt into element splitting so that broadcast lanes load disjoint slices rather than duplicate data. Existing callsites are unaffected since they use the default. `physicalLanesPerThread` is deliberately a caller provided parameter rather than derived from `MMASingleSubgroupLayout`, keeping the layout struct as a pure hardware description without encoding downstream splitting policy. For a concrete example, consider a layout with `subgroupSize = 64` and `outer = {1, 1}, thread = {8, 4}, tstrides = {2, 16}, element = {1, 16}` Here `prod(thread) = 32`, so `physicalLanesPerThread = 2`. The thread assignment looks like: ``` t0t1 t16t17 t32t33 t48t49 t2t3 t18t19 t34t35 t50t51 t4t5 t20t21 t36t37 t52t53 t6t7 t22t23 t38t39 t54t55 t8t9 t24t25 t40t41 t56t57 t10t11 t26t27 t42t43 t58t59 t12t13 t28t29 t44t45 t60t61 t14t15 t30t31 t46t47 t62t63 ``` Without `physicalLanesPerThread`, each pair (e.g., t0 and t1) loads the same 16 elements. With `physicalLanesPerThread = 2`, each loads 8 unique elements instead. The only existing hardware intrinsics which has this broadcast property naturally are WMMAR3 LHS and RHS operands `(thread = {16, 1} or {1, 16}, subgroupSize = 32)`, and for the F16/BF16 variants, the accumulator as well. This mechanism is intended for virtual intrinsics such as VDMFMA (#23677), where broadcast lanes will be assigned disjoint K-slices. Assisted by: Claude --------- Signed-off-by: Eric Feng <Eric.Feng@amd.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Eric Feng <Eric.Feng@amd.com>

Applies the VDMFMA-1 changes: renames FP8/BF8 enum variants to F8E4M3FNUZ/F8E5M2FNUZ, switches expand/collapse accumulator to use vector.interleave/deinterleave, adds isVDMFMAIntrinsic helper and header declarations, and fixes getDistributedTileTypes broadcastFactor logic. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Eric Feng <Eric.Feng@amd.com>

Signed-off-by: Eric Feng <Eric.Feng@amd.com>

efric · 2026-04-06T23:07:09Z

@krzysz00 updated with util.hoistable_conversion

krzysz00

I think this design makes a lot of sense and I don't see any real issues with it. One question (aka can we abstract out some magic constants into a function) and then let's ship it!

Signed-off-by: Eric Feng <Eric.Feng@amd.com>

efric force-pushed the users/efric/vdmfma branch from b4b0f2d to 1818058 Compare March 6, 2026 02:53

efric mentioned this pull request Mar 6, 2026

[GPU][Codegen] Support unique per-lane load option when prod(threads) < subgroupsize #23657

Merged

efric force-pushed the users/efric/vdmfma branch 4 times, most recently from 64ffaca to b2a719d Compare March 8, 2026 22:44

efric changed the title ~~[do not review] vdmfma~~ [GPU][Codegen] Implement virtual dense mfma (VDMFMA) Mar 8, 2026

efric marked this pull request as ready for review March 9, 2026 05:59

efric requested review from Groverkss, Max191, krzysz00, nirvedhmeshram and qedawkins as code owners March 9, 2026 05:59

efric requested a review from kuhar March 9, 2026 06:00

efric force-pushed the users/efric/vdmfma branch from 97dd26b to b4235ac Compare March 26, 2026 18:33

efric commented Mar 26, 2026

View reviewed changes

Base automatically changed from users/efric/splitlaneloads to main March 26, 2026 20:07

efric and others added 6 commits April 5, 2026 17:34

implement vdmfma

d7b3caa

Signed-off-by: Eric Feng <Eric.Feng@amd.com>

naming

aa12c83

Signed-off-by: Eric Feng <Eric.Feng@amd.com>

rebase with name change

fe90681

Signed-off-by: Eric Feng <Eric.Feng@amd.com>

update to hoistable conversions

851eaa0

Signed-off-by: Eric Feng <Eric.Feng@amd.com>

nit

1dd3f18

Signed-off-by: Eric Feng <Eric.Feng@amd.com>

efric force-pushed the users/efric/vdmfma branch from b4235ac to 851eaa0 Compare April 6, 2026 23:02

krzysz00 approved these changes Apr 7, 2026

View reviewed changes

Comment thread compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUAttrs.cpp Outdated

generalize b indices to fn

f3b5a62

Signed-off-by: Eric Feng <Eric.Feng@amd.com>

efric merged commit 4aa6196 into main Apr 8, 2026
63 of 65 checks passed

efric deleted the users/efric/vdmfma branch April 8, 2026 00:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU][Codegen] Implement virtual dense mfma (VDMFMA)#23677

[GPU][Codegen] Implement virtual dense mfma (VDMFMA)#23677
efric merged 7 commits intomainfrom
users/efric/vdmfma

efric commented Mar 6, 2026 •

edited

Loading

Uh oh!

efric Mar 26, 2026

Uh oh!

efric commented Apr 6, 2026

Uh oh!

krzysz00 left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

efric commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

efric Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

efric commented Apr 6, 2026

Uh oh!

krzysz00 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

efric commented Mar 6, 2026 •

edited

Loading