[GPU][Codegen] Implement virtual dense mfma (VDMFMA)#23677
Merged
Conversation
b4b0f2d to
1818058
Compare
64ffaca to
b2a719d
Compare
97dd26b to
b4235ac
Compare
efric
commented
Mar 26, 2026
| int64_t subgroupSize = getSubgroupSize(); | ||
| int64_t physicalLanesPerThread = | ||
| subgroupSize / llvm::product_of(lhsLayout.thread); | ||
| if (isVDMFMAIntrinsic(intrinsic) && physicalLanesPerThread > 1) { |
Member
Author
There was a problem hiding this comment.
this isVDMFMAIntrinsic is actually redundant; it was a suggestion to make it explicit in the code that this part is only relevant for VDMFMA
efric
added a commit
that referenced
this pull request
Mar 26, 2026
… < subgroupsize (#23657) When `prod(thread) < subgroupSize` in `MMASingleSubgroupLayout`, there is implied broadcasting: multiple threads map to the same index in the thread layout and get the same data. This patch adds an optional physicalLanesPerThread parameter (default 1) to `populateCanonicalOffsetsSizesAndStrides`, enabling callers to opt into element splitting so that broadcast lanes load disjoint slices rather than duplicate data. Existing callsites are unaffected since they use the default. `physicalLanesPerThread` is deliberately a caller provided parameter rather than derived from `MMASingleSubgroupLayout`, keeping the layout struct as a pure hardware description without encoding downstream splitting policy. For a concrete example, consider a layout with `subgroupSize = 64` and `outer = {1, 1}, thread = {8, 4}, tstrides = {2, 16}, element = {1, 16}` Here `prod(thread) = 32`, so `physicalLanesPerThread = 2`. The thread assignment looks like: ``` t0t1 t16t17 t32t33 t48t49 t2t3 t18t19 t34t35 t50t51 t4t5 t20t21 t36t37 t52t53 t6t7 t22t23 t38t39 t54t55 t8t9 t24t25 t40t41 t56t57 t10t11 t26t27 t42t43 t58t59 t12t13 t28t29 t44t45 t60t61 t14t15 t30t31 t46t47 t62t63 ``` Without `physicalLanesPerThread`, each pair (e.g., t0 and t1) loads the same 16 elements. With `physicalLanesPerThread = 2`, each loads 8 unique elements instead. The only existing hardware intrinsics which has this broadcast property naturally are WMMAR3 LHS and RHS operands `(thread = {16, 1} or {1, 16}, subgroupSize = 32)`, and for the F16/BF16 variants, the accumulator as well. This mechanism is intended for virtual intrinsics such as VDMFMA (#23677), where broadcast lanes will be assigned disjoint K-slices. Assisted by: Claude --------- Signed-off-by: Eric Feng <Eric.Feng@amd.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Eric Feng <Eric.Feng@amd.com>
Applies the VDMFMA-1 changes: renames FP8/BF8 enum variants to F8E4M3FNUZ/F8E5M2FNUZ, switches expand/collapse accumulator to use vector.interleave/deinterleave, adds isVDMFMAIntrinsic helper and header declarations, and fixes getDistributedTileTypes broadcastFactor logic. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Eric Feng <Eric.Feng@amd.com>
Signed-off-by: Eric Feng <Eric.Feng@amd.com>
Signed-off-by: Eric Feng <Eric.Feng@amd.com>
b4235ac to
851eaa0
Compare
Member
Author
|
@krzysz00 updated with |
krzysz00
approved these changes
Apr 7, 2026
Contributor
krzysz00
left a comment
There was a problem hiding this comment.
I think this design makes a lot of sense and I don't see any real issues with it. One question (aka can we abstract out some magic constants into a function) and then let's ship it!
Signed-off-by: Eric Feng <Eric.Feng@amd.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR introduces virtual dense MFMAs (VDMFMA), which implement the sparse trick for skinny GEMM (M=8) workloads inspired by the hugging face article Creating custom kernels for the AMD MI300. This patch implements VDMFMA for all CDNA3 variants of the sparse mfma intrinsic. Further plumbing and support for CDNA4 variants will be future PRs.
Sparse MFMA (V_SMFMAC) instructions perform MMA on an imbalanced pair of operands: a 4:2 structured-sparse matrix A and a dense matrix B. The instruction also takes a sparsity index that encodes which 2 of every 4 elements along K are non-zero within the sparse matrix A. The trick exploits this by pairing even/odd lanes to jointly describe a full dense row.
The benefit over the current approach of padding M=8 to M=16 for dense MFMA is that sparse MFMA processes twice the K-depth of the equivalent dense MFMA in the same number of cycles.
Performance comparison for FP16/FP8 (coupled with optimizing out redundant accumulator processing):
Assisted by: Claude