diff --git a/docs/website/docs/community/blog/.authors.yml b/docs/website/docs/community/blog/.authors.yml
index 7e63d6074a00..f5dc78553e93 100644
--- a/docs/website/docs/community/blog/.authors.yml
+++ b/docs/website/docs/community/blog/.authors.yml
@@ -16,6 +16,12 @@ authors:
avatar: https://github.com/bjacob.png
url: https://github.com/bjacob
+ efric:
+ name: Eric Feng
+ description: Software Engineer
+ avatar: https://github.com/efric.png
+ url: https://github.com/efric
+
hanhanW:
name: Han-Chung Wang
description: Software Engineer
diff --git a/docs/website/docs/community/blog/posts/vdmfma_canvas.md b/docs/website/docs/community/blog/posts/vdmfma_canvas.md
new file mode 100644
index 000000000000..1609f1655829
--- /dev/null
+++ b/docs/website/docs/community/blog/posts/vdmfma_canvas.md
@@ -0,0 +1,395 @@
+---
+date: 2026-05-28
+authors:
+ - efric
+categories:
+ - Performance
+tags:
+ - GPU
+ - Codegen
+---
+
+# Virtual Dense MFMAs for Skinny GEMM
+
+When we have a GEMM `A * B = C`, and it is the situation that `A` has a
+small number of rows and many columns, we classify this problem as a skinny
+GEMM. The decode phase of LLM inference is a common sight of this problem: a
+small batch of tokens multiplies against a large weight matrix. Skinny GEMMs are
+less convenient for modern GPU architectures than their non-skinny cousins. One
+reason is because modern GPUs take advantage of matrix core units which offer
+instructions that are specifically designed for matrix multiplication and
+operate on fixed tile sizes, and skinny GEMMs are too small to utilize them to
+their intended size.
+
+On AMDGPUs and in particular on the MI3XX Instinct (CDNA) series, these
+instructions are known as MFMA instructions; for example,
+`V_MFMA_F32_16x16x16_F16`. One useful part of the name is the `MxNxK`
+tile shape consumed, where `M` is the number of rows of the left hand matrix,
+`N` is the number of columns of the right hand matrix, and `K` is the shared
+dimension of both.
+
+
+
+For the ordinary dense GEMM MFMA path available to AMDGPU CDNA series, the
+relevant 16-bit and 8-bit MFMAs have at least 16 rows in M. Consider M=8, which
+is larger than the path we take in IREE for GEMV-like problems, but evidently
+smaller than 16. The previous codegen path in IREE handled this by padding the
+workgroup `M` tile to 16 and
+using the ordinary dense MFMA configuration. The IR snippet below shows this
+directly: the logical `M=8` operation is configured with `padding = [16, ...]`,
+a dense `mma_layout`, and a `workgroup` tile of 16 rows.
+
+
+IR with padding
+
+```mlir
+%10 = linalg.generic {
+ indexing_maps = [
+ affine_map<(d0, d1, d2) -> (d0, d2)>,
+ affine_map<(d0, d1, d2) -> (d1, d2)>,
+ affine_map<(d0, d1, d2) -> (d0, d1)>
+ ],
+ iterator_types = ["parallel", "parallel", "reduction"]
+} ins(%6, %7 : tensor<8x16384xf16>, tensor<13312x16384xf16>)
+ outs(%9 : tensor<8x13312xf32>)
+ attrs = {
+ lowering_config = #iree_gpu.lowering_config<{
+ mma_kind = #iree_gpu.mma_layout,
+ padding = [16, 64, 128],
+ promote_operands = [0, 1],
+ reduction = [0, 0, 8],
+ subgroup = [1, 2, 0],
+ workgroup = [16, 64, 0]
+ }>
+ } {
+ ...
+}
+```
+
+
+
+
+Padding is simple and robust, but we would be wasting cycles on rows that are
+not present in the original matrix. The question is whether we can use the 16
+physical rows of the hardware instruction more carefully.
+
+## Removing Padding with Sparse MFMA
+
+AMD sparse MFMA instructions, `V_SMFMAC`, are matrix-core accumulate
+instructions for a 4:2 structured-sparse `A` matrix and a dense `B` matrix. The
+old `D` value is the accumulator, and the encoded third source is sparse index
+metadata, not a separate `C` matrix operand. The 4:2 structured-sparse
+operand is defined along `K`: in each group of four `K` positions, the sparse
+index metadata tells the
+instruction which two positions are non-zero.
+
+On CDNA3/gfx942, the relevant sparse instruction has the same physical `16x16`
+output tile and the same number of cycles. For
+F16/BF16, dense `V_MFMA_F32_16X16X16_F16` and sparse
+`V_SMFMAC_F32_16X16X32_F16` are both 16-cycle instructions on gfx942. For
+8-bit inputs, the analogous 16-cycle sparse instruction is `16x16x64`.
+
+The idea, described in the Hugging Face
+[MI300 kernel article](https://huggingface.co/blog/mi300kernels), is to make
+two sparse rows represent one dense row. One lane selects positions
+`{0, 1}` in each group of four. Its paired lane selects positions `{2, 3}`.
+Together, the two lanes cover the dense `K` positions for one logical row. The
+benefit, in addition to removing padding, is that a 16-cycle sparse
+instruction covers twice the logical `K` depth of the corresponding dense
+16-cycle F16/BF16 MFMA.
+
+
+
+Figure: "Using sparsity for skinny inputs" from
+[Creating custom kernels for the AMD MI300](https://huggingface.co/blog/mi300kernels).
+
+After the sparse MFMA, the four-element native accumulator contains pairs of
+partial sums for the same logical rows. The lowering adds those pairs together,
+so the result again has the normal dense `M=8` meaning.
+
+## Original HuggingFace Approach
+
+On the standard path for processing data enroute to MFMA instructions, we go
+through global memory -> LDS/Shared memory -> Registers -> MFMA instruction*.
+In the original Hugging Face skinny GEMM kernel, data from matrix `A` is shuffled
+on the way into LDS. The shuffle is necessary to meet the semantics of using the
+sparse trick. If we were to use even lanes to select positions `{0,1}` and odd
+lanes to select positions `{2,3}`, then for a load with 8 contiguous elements along
+`K`:
+
+```text
+K0 K1 K2 K3 K4 K5 K6 K7
+```
+
+We would want even lanes to hold:
+
+```text
+K0 K1 K4 K5
+```
+
+and odd lanes to hold
+
+```text
+K2 K3 K6 K7
+```
+
+In other words, the data loaded from LDS looks exactly like:
+
+```text
+lane 0: K0 K1 _ _ K4 K5 _ _
+lane 1: _ _ K2 K3 _ _ K6 K7
+```
+
+Together (as an even/odd pair), and across all threads in the subgroup, these
+precisely reconstruct the original dense rows. Following the loop around the
+inner K tile, these partials are then reduced to yield the full dense result.
+
+??? note "* Shared-memory hierarchy note"
+
+ This path is a simplified storyline. The actual shared-memory hierarchy has
+ more detail than is useful for the VDMFMA discussion; refer to the AMDGPU ISA
+ documentation for the full memory hierarchy and instruction-level behavior.
+
+## Adaptation in IREE as VDMFMA
+
+The HF kernel makes `A` sparse-trick "friendly" before we read it from shared
+memory. If IREE wanted to materialize that shuffled `A` form as a
+compiler-owned tensor or storage layout, the natural existing mechanism would be
+data tiling: attach an encoding, carry the encoded tensor type through the
+producer/consumer boundary, and materialize the layout change with
+packing/unpacking or other physical layout operations when needed. That is the
+model described in IREE's [data-tiling path](data-tiling-walkthrough.md). In the
+GPU data-tiling path, encoded contractions reach
+`#iree_gpu.data_tiled_mma_layout` on `iree_codegen.inner_tiled`.
+
+Instead, we take advantage of "virtual" MMAs in IREE. Virtual MMAs in IREE
+represent a lowering which is intended to match real MFMAs in the same way but
+are otherwise composed of or are a modification of ordinary MFMAs.
+`#iree_gpu.virtual_mma_layout` is an MMA/inner-tile descriptor: it supplies the
+semantic tile shape, distributed thread layout, and target lowering, while the
+promoted/shared-memory layouts remain unchanged. The
+subgroup level MMA lowering keeps `A` as is when loaded from LDS and performs a
+per-lane shuffle of the `B` matrix register data. Choosing to shuffle `B` in
+registers keeps this part local to the virtual MMA; shuffling `A` into LDS would
+also need a matching promotion/read layout for that operand. The final assembly
+forms generates `ds_read2_b64` LDS reads, which incidentally loads
+twice as much data from LDS as the HF kernel.
+
+With VDMFMA, we give flexibility and keep the sparse trick from becoming a
+skinny-only tensor layout. The current selector still uses it conservatively,
+only when the problem's total
+`M` fits in the virtual `M=8` tile and total `K` is divisible by the VDMFMA
+selection tile. But the abstraction is an `8`-row virtual MMA, not an encoded
+storage format for an entire matmul. A future selector could tile a larger
+multiple-of-8 `M` problem into VDMFMA-sized pieces.
+
+Concretely, we represent VDMFMA in the following form:
+
+```mlir
+#iree_gpu.virtual_mma_layout
+```
+
+Read this as a dense `8x16x64` virtual operation with F16 inputs and F32
+accumulation. The trailing `x2` says that, on the
+CDNA3 F16 path, the virtual operation lowers to two native sparse MFMA
+instructions along `K`.
+
+At the virtual MMA level, each lane sees dense fragments:
+
+```text
+A : vector<8xf16>
+B : vector<16xf16>
+Acc : vector<2xf32>
+```
+
+The sparse instruction wants a different physical view:
+
+```text
+A : vector<4xf16>
+B : vector<8xf16>
+Acc/D : vector<4xf32>
+SparseIndex : vector<4xi8>
+```
+
+VDMFMA is the adapter between these two views. It expands the accumulator,
+chooses sparse metadata from lane parity, slices `A` and `B`, shuffles the per-lane
+`B` register fragment, issues the sparse MFMAs, and collapses the accumulator
+back to the dense virtual shape.
+
+For one lane pair, the two instructions can be visualized as follows. The `K`
+numbering below is the numbering in the dense per-lane fragment after
+distribution. `--` marks `A` positions that are implied zero for that physical
+sparse row. The non-zero `A` samples are packed, and sparse index metadata maps
+them back to positions within each `K` group of four.
+
+```text
+ first smfmac second smfmac
+sparse indices 0 1 2 3 | 0 1 2 3 0 1 2 3 | 0 1 2 3
+L0, selector 0x44 K0 K1 -- --| K2 K3 -- -- K4 K5 -- --| K6 K7 -- --
+L1, selector 0xEE -- -- K8 K9| -- -- K10 K11 -- -- K12 K13| -- -- K14 K15
+B after shuffle B0 B1 B8 B9| B2 B3 B10 B11 B4 B5 B12 B13| B6 B7 B14 B15
+```
+
+The corresponding shuffle indices in the lowering are:
+
+```text
+first smfmac B shuffle: [0, 1, 8, 9, 2, 3, 10, 11]
+second smfmac B shuffle: [4, 5, 12, 13, 6, 7, 14, 15]
+```
+
+The lowering may thus be logically represented as:
+
+```text
+acc = [d0, d1] -> [d0, 0, d1, 0]
+
+sparse_index = (lane_id & 1) ? 0xEE : 0x44
+
+acc = smfmac(A[0:4], shuffle(B, [0, 1, 8, 9, 2, 3, 10, 11]), acc, sparse_index)
+acc = smfmac(A[4:8], shuffle(B, [4, 5, 12, 13, 6, 7, 14, 15]), acc, sparse_index)
+
+acc = [d0, d1, d2, d3] -> [d0 + d1, d2 + d3]
+```
+
+The accumulator conversions are wrapped in `util.hoistable_conversion`. In
+IREE, this marks temporary marshaling between the layout used by `inner_tiled`
+and the layout expected by the target intrinsic, so matching conversions can be
+moved out of loops or canceled when the surrounding IR permits it. For VDMFMA,
+that marshaling expands the logical two-element accumulator into the
+four-element SMFMAC form before the sparse MFMA chain, then collapses the native
+accumulator back by summing lane-pair partials.
+
+## Virtual MMA Layout in VDMFMA
+
+The virtual MMA layout uses `MMASingleSubgroupLayout`, so it is worth unpacking
+the terminology.
+
+A single subgroup layout describes how one operand of one subgroup-level matrix
+operation is distributed across lanes in IREE. More precisely, it maps a lane id
+and a per-lane vector element index to semantic operand dimensions such as `M`,
+`N`, and `K`. For each semantic operand dimension, it has:
+
+* `outer`: outer repetitions of element tiles in the logical per-thread operand
+ vector;
+* `thread`: the logical thread grid over all dimensions;
+* `tstrides`: the lane-id stride for moving by one element tile along that dimension;
+* `element`: the contiguous logical element tile within that vector
+
+For each dimension, `outer[i] * thread[i] * element[i]` is the semantic tile
+size. For the F16 VDMFMA LHS, IREE uses:
+
+```text
+outer = {1, 1}
+thread = {8, 4}
+tstrides = {2, 16}
+element = {1, 16}
+```
+
+The semantic dimensions are `M` and `K`, so this is an `8x64` LHS tile:
+`1 * 8 * 1 = 8` rows and `1 * 4 * 16 = 64` reduction elements. The thread-grid
+part can be visualized as adjacent lane pairs over the `8x4` M/K grid:
+
+```text
+ K thread coordinate
+ 0 1 2 3
+ M0 T0, T1 T16, T17 T32, T33 T48, T49
+ M1 T2, T3 T18, T19 T34, T35 T50, T51
+ M2 T4, T5 T20, T21 T36, T37 T52, T53
+ M3 T6, T7 T22, T23 T38, T39 T54, T55
+ M4 T8, T9 T24, T25 T40, T41 T56, T57
+ M5 T10, T11 T26, T27 T42, T43 T58, T59
+ M6 T12, T13 T28, T29 T44, T45 T60, T61
+ M7 T14, T15 T30, T31 T46, T47 T62, T63
+```
+
+For ordinary layouts, `prod(outer) * prod(element)` is the actual per-lane
+vector length. Here, the product of `thread` is 32, while the CDNA3 subgroup
+size is 64. This means that lanes `2p` and `2p+1` therefore share the same
+logical M/K thread-grid coordinates. IREE then splits the divisible element
+dimension, K, so lane `2p`
+receives the lower 8 elements of the 16-wide `K` element tile and lane `2p+1`
+receives the upper 8. The RHS and accumulator layouts have thread products of
+64, so their logical thread-grid positions already match the physical lanes.
+
+This is the layout-side part that gives VDMFMA the "virtual dense" behavior: the
+compiler still distributes a dense `8x64` LHS tile, but the physical lanes are
+grouped so that each even/odd lane pair owns the two dense halves that the sparse
+instruction trick will reinterpret.
+
+## Selecting VDMFMA
+
+VDMFMA is not selected for every matmul. IREE has multiple codegen pipelines,
+and the one which is relevant for the shape of skinny GEMMs belongs to
+TileAndFuse. TileAndFuse derives VDMFMA candidates from the target's concrete
+MFMA capabilities. On the CDNA3 F16 path, the
+virtual `VDMFMA_F32_8x16x64x2_F16` candidate is derived from
+`MFMA_F32_16x16x16_F16`.
+
+There is one tuning detail that is easy to miss. Since sparse MFMAs have twice
+the K-depth as dense MFMAs, the compute phase is shorter than the padded dense
+MFMA sequence it replaces.
+In a software-pipelined loop, that can reduce the amount of compute available
+to hide the next tile's memory latency. The final selection change scales the
+reduction tile count by the virtual intrinsic's K unroll factor to compensate
+for the shorter compute phase.
+
+With VDMFMA selected for the same shape, the new IR excerpt
+has no `M=16` padding. The workgroup `M` tile is 8, and the MMA kind is the
+virtual layout.
+
+
+IR with VDMFMA
+
+```mlir
+%10 = linalg.generic {
+ indexing_maps = [
+ affine_map<(d0, d1, d2) -> (d0, d2)>,
+ affine_map<(d0, d1, d2) -> (d1, d2)>,
+ affine_map<(d0, d1, d2) -> (d0, d1)>
+ ],
+ iterator_types = ["parallel", "parallel", "reduction"]
+} ins(%6, %7 : tensor<8x16384xf16>, tensor<13312x16384xf16>)
+ outs(%9 : tensor<8x13312xf32>)
+ attrs = {
+ lowering_config = #iree_gpu.lowering_config<{
+ mma_kind =
+ #iree_gpu.virtual_mma_layout,
+ promote_operands = [0, 1],
+ reduction = [0, 0, 4],
+ subgroup = [1, 2, 0],
+ workgroup = [8, 64, 0]
+ }>
+ } {
+ ...
+}
+```
+
+
+
+
+## Performance
+
+The first end-to-end 16-bit selection change reported the following numbers on
+CDNA3, compared with the padded dense baseline:
+
+| Shape | VDMFMA | Baseline | Improvement |
+| --- | ---: | ---: | ---: |
+| `f16_8x13312x16384` | 189 us | 206 us | +8.3% |
+| `f16_8x13312x8192` | 117 us | 116 us | - |
+| `f16_8x2304x16384` | 133 us | 138 us | +3.6% |
+| `f16_8x2304x8192` | 103 us | 110 us | +6.4% |
+| `f16_8x6656x16384` | 127 us | 130 us | +2.3% |
+| `f16_8x6656x8192` | 102 us | 109 us | +6.4% |
+
+## Conclusion
+
+VDMFMA is a small compiler abstraction around a target-specific instruction
+mapping. This is represented in the IR as a "virtual dense" `8x16xK` MMA.
+The generated code for the F16 kernel above uses paired `ds_read2_b64` LDS reads
+to form dense per-lane fragments; the virtual MMA lowering then uses lane
+parity, `B` register shuffling, sparse MFMA instructions and accumulator
+reduction to fulfill the conditions of the sparse trick for skinny GEMMs. At
+configuration time, it is currently selected only for skinny shapes where the
+total `M` fits within the virtual `M=8` tile and total `K`
+is divisible by the VDMFMA selection tile. The result is an end-to-end
+adaptation of a hand-written HIP optimization into IREE's AMDGPU codegen
+pipeline.