Commit 4aa6196
[GPU][Codegen] Implement virtual dense mfma (VDMFMA) (#23677)
This PR introduces virtual dense MFMAs (VDMFMA), which implement the
sparse trick for skinny GEMM (M=8) workloads inspired by the hugging
face article [Creating custom kernels for the AMD
MI300](https://huggingface.co/blog/mi300kernels). This patch implements
VDMFMA for all CDNA3 variants of the sparse mfma intrinsic. Further
plumbing and support for CDNA4 variants will be future PRs.
Sparse MFMA (V_SMFMAC) instructions perform MMA on an imbalanced pair of
operands: a 4:2 structured-sparse matrix A and a dense matrix B. The
instruction also takes a sparsity index that encodes which 2 of every 4
elements along K are non-zero within the sparse matrix A. The trick
exploits this by pairing even/odd lanes to jointly describe a full dense
row.
The benefit over the current approach of padding M=8 to M=16 for dense
MFMA is that sparse MFMA processes twice the K-depth of the equivalent
dense MFMA in the same number of cycles.
Performance comparison for FP16/FP8 (coupled with optimizing out
redundant accumulator processing):
```
shape vdmfma baseline %
----------------- ------ -------- -----
f16_8x13312x16384 214 us 217 us +1.4%
f16_8x13312x8192 123 us 122 us -
f16_8x2304x16384 137 us 145 us +5.5%
f16_8x2304x8192 114 us 115 us -
f16_8x6656x16384 144 us 145 us -
f16_8x6656x8192 116 us 123 us +5.7%
fp8_8x13312x16384 135 us 133 us -
fp8_8x13312x8192 113 us 106 us -6.6%
fp8_8x2304x16384 113 us 126 us +10.3%
fp8_8x2304x8192 92.6 us 98.6 us +6.1%
fp8_8x6656x16384 114 us 133 us +14.3%
fp8_8x6656x8192 97.8 us 108 us +9.4%
```
Assisted by: Claude
---------
Signed-off-by: Eric Feng <Eric.Feng@amd.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent f02ed7d commit 4aa6196
5 files changed
Lines changed: 739 additions & 9 deletions
File tree
- compiler/src/iree/compiler/Codegen/Dialect/GPU
- IR
- test
- TransformExtensions/test
0 commit comments