Skip to content

Commit 4aa6196

Browse files
efricclaude
andauthored
[GPU][Codegen] Implement virtual dense mfma (VDMFMA) (#23677)
This PR introduces virtual dense MFMAs (VDMFMA), which implement the sparse trick for skinny GEMM (M=8) workloads inspired by the hugging face article [Creating custom kernels for the AMD MI300](https://huggingface.co/blog/mi300kernels). This patch implements VDMFMA for all CDNA3 variants of the sparse mfma intrinsic. Further plumbing and support for CDNA4 variants will be future PRs. Sparse MFMA (V_SMFMAC) instructions perform MMA on an imbalanced pair of operands: a 4:2 structured-sparse matrix A and a dense matrix B. The instruction also takes a sparsity index that encodes which 2 of every 4 elements along K are non-zero within the sparse matrix A. The trick exploits this by pairing even/odd lanes to jointly describe a full dense row. The benefit over the current approach of padding M=8 to M=16 for dense MFMA is that sparse MFMA processes twice the K-depth of the equivalent dense MFMA in the same number of cycles. Performance comparison for FP16/FP8 (coupled with optimizing out redundant accumulator processing): ``` shape vdmfma baseline % ----------------- ------ -------- ----- f16_8x13312x16384 214 us 217 us +1.4% f16_8x13312x8192 123 us 122 us - f16_8x2304x16384 137 us 145 us +5.5% f16_8x2304x8192 114 us 115 us - f16_8x6656x16384 144 us 145 us - f16_8x6656x8192 116 us 123 us +5.7% fp8_8x13312x16384 135 us 133 us - fp8_8x13312x8192 113 us 106 us -6.6% fp8_8x2304x16384 113 us 126 us +10.3% fp8_8x2304x8192 92.6 us 98.6 us +6.1% fp8_8x6656x16384 114 us 133 us +14.3% fp8_8x6656x8192 97.8 us 108 us +9.4% ``` Assisted by: Claude --------- Signed-off-by: Eric Feng <Eric.Feng@amd.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent f02ed7d commit 4aa6196

5 files changed

Lines changed: 739 additions & 9 deletions

File tree

0 commit comments

Comments
 (0)