Skip to content

[GPU][Codegen] Implement virtual dense mfma (VDMFMA)#23677

Merged
efric merged 7 commits intomainfrom
users/efric/vdmfma
Apr 8, 2026
Merged

[GPU][Codegen] Implement virtual dense mfma (VDMFMA)#23677
efric merged 7 commits intomainfrom
users/efric/vdmfma

Conversation

@efric
Copy link
Copy Markdown
Member

@efric efric commented Mar 6, 2026

This PR introduces virtual dense MFMAs (VDMFMA), which implement the sparse trick for skinny GEMM (M=8) workloads inspired by the hugging face article Creating custom kernels for the AMD MI300. This patch implements VDMFMA for all CDNA3 variants of the sparse mfma intrinsic. Further plumbing and support for CDNA4 variants will be future PRs.

Sparse MFMA (V_SMFMAC) instructions perform MMA on an imbalanced pair of operands: a 4:2 structured-sparse matrix A and a dense matrix B. The instruction also takes a sparsity index that encodes which 2 of every 4 elements along K are non-zero within the sparse matrix A. The trick exploits this by pairing even/odd lanes to jointly describe a full dense row.

The benefit over the current approach of padding M=8 to M=16 for dense MFMA is that sparse MFMA processes twice the K-depth of the equivalent dense MFMA in the same number of cycles.

Performance comparison for FP16/FP8 (coupled with optimizing out redundant accumulator processing):

shape              vdmfma  baseline  %
-----------------  ------  --------  -----
f16_8x13312x16384  214 us  217 us    +1.4%
f16_8x13312x8192   123 us  122 us    -
f16_8x2304x16384   137 us  145 us    +5.5%
f16_8x2304x8192    114 us  115 us    -
f16_8x6656x16384   144 us  145 us    -
f16_8x6656x8192    116 us  123 us    +5.7%
fp8_8x13312x16384  135 us  133 us    -
fp8_8x13312x8192   113 us  106 us    -6.6%
fp8_8x2304x16384   113 us  126 us    +10.3%
fp8_8x2304x8192    92.6 us 98.6 us   +6.1%
fp8_8x6656x16384   114 us  133 us    +14.3%
fp8_8x6656x8192    97.8 us 108 us    +9.4%

Assisted by: Claude

@efric efric force-pushed the users/efric/vdmfma branch from b4b0f2d to 1818058 Compare March 6, 2026 02:53
@efric efric force-pushed the users/efric/vdmfma branch 4 times, most recently from 64ffaca to b2a719d Compare March 8, 2026 22:44
@efric efric changed the title [do not review] vdmfma [GPU][Codegen] Implement virtual dense mfma (VDMFMA) Mar 8, 2026
@efric efric marked this pull request as ready for review March 9, 2026 05:59
@efric efric requested a review from kuhar March 9, 2026 06:00
@efric efric force-pushed the users/efric/vdmfma branch from 97dd26b to b4235ac Compare March 26, 2026 18:33
int64_t subgroupSize = getSubgroupSize();
int64_t physicalLanesPerThread =
subgroupSize / llvm::product_of(lhsLayout.thread);
if (isVDMFMAIntrinsic(intrinsic) && physicalLanesPerThread > 1) {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isVDMFMAIntrinsic is actually redundant; it was a suggestion to make it explicit in the code that this part is only relevant for VDMFMA

efric added a commit that referenced this pull request Mar 26, 2026
… < subgroupsize (#23657)

When `prod(thread) < subgroupSize` in `MMASingleSubgroupLayout`, there
is implied broadcasting: multiple threads map to the same index in the
thread layout and get the same data.

This patch adds an optional physicalLanesPerThread parameter (default 1)
to `populateCanonicalOffsetsSizesAndStrides`, enabling callers to opt
into element splitting so that broadcast lanes load disjoint slices
rather than duplicate data. Existing callsites are unaffected since they
use the default. `physicalLanesPerThread` is deliberately a caller
provided parameter rather than derived from `MMASingleSubgroupLayout`,
keeping the layout struct as a pure hardware description without
encoding downstream splitting policy.

For a concrete example, consider a layout with `subgroupSize = 64` and
`outer = {1, 1}, thread = {8, 4}, tstrides = {2, 16}, element = {1, 16}`

Here `prod(thread) = 32`, so `physicalLanesPerThread = 2`. The thread
assignment looks like:

```
t0t1    t16t17    t32t33    t48t49
t2t3    t18t19    t34t35    t50t51
t4t5    t20t21    t36t37    t52t53
t6t7    t22t23    t38t39    t54t55
t8t9    t24t25    t40t41    t56t57
t10t11  t26t27    t42t43    t58t59
t12t13  t28t29    t44t45    t60t61
t14t15  t30t31    t46t47    t62t63
```

Without `physicalLanesPerThread`, each pair (e.g., t0 and t1) loads the
same 16 elements. With `physicalLanesPerThread = 2`, each loads 8 unique
elements instead.

The only existing hardware intrinsics which has this broadcast property
naturally are WMMAR3 LHS and RHS operands `(thread = {16, 1} or {1, 16},
subgroupSize = 32)`, and for the F16/BF16 variants, the accumulator as
well. This mechanism is intended for virtual intrinsics such as VDMFMA
(#23677), where broadcast lanes will be assigned disjoint K-slices.

Assisted by: Claude

---------

Signed-off-by: Eric Feng <Eric.Feng@amd.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Base automatically changed from users/efric/splitlaneloads to main March 26, 2026 20:07
efric and others added 6 commits April 5, 2026 17:34
Signed-off-by: Eric Feng <Eric.Feng@amd.com>
Signed-off-by: Eric Feng <Eric.Feng@amd.com>
Applies the VDMFMA-1 changes: renames FP8/BF8 enum variants to
F8E4M3FNUZ/F8E5M2FNUZ, switches expand/collapse accumulator to use
vector.interleave/deinterleave, adds isVDMFMAIntrinsic helper and
header declarations, and fixes getDistributedTileTypes broadcastFactor
logic.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Eric Feng <Eric.Feng@amd.com>
Signed-off-by: Eric Feng <Eric.Feng@amd.com>
Signed-off-by: Eric Feng <Eric.Feng@amd.com>
Signed-off-by: Eric Feng <Eric.Feng@amd.com>
@efric efric force-pushed the users/efric/vdmfma branch from b4235ac to 851eaa0 Compare April 6, 2026 23:02
@efric
Copy link
Copy Markdown
Member Author

efric commented Apr 6, 2026

@krzysz00 updated with util.hoistable_conversion

Copy link
Copy Markdown
Contributor

@krzysz00 krzysz00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this design makes a lot of sense and I don't see any real issues with it. One question (aka can we abstract out some magic constants into a function) and then let's ship it!

Comment thread compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUAttrs.cpp Outdated
Signed-off-by: Eric Feng <Eric.Feng@amd.com>
@efric efric merged commit 4aa6196 into main Apr 8, 2026
63 of 65 checks passed
@efric efric deleted the users/efric/vdmfma branch April 8, 2026 00:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants