Skip to content

[AMDGPU] comgr: Scratch Patches for B0-to-A0 Hotswap 2/2 -- Scaled WMMA#2731

Draft
jammm wants to merge 6 commits into
amd-stagingfrom
users/jam/comgr-wmma-scale16-staging
Draft

[AMDGPU] comgr: Scratch Patches for B0-to-A0 Hotswap 2/2 -- Scaled WMMA#2731
jammm wants to merge 6 commits into
amd-stagingfrom
users/jam/comgr-wmma-scale16-staging

Conversation

@jammm
Copy link
Copy Markdown

@jammm jammm commented May 29, 2026

Re-create PR based on #2365

Summary

Add a COMGR hotswap scratch-patch pass for GFX1250 WMMA Scale16 instructions.

This decomposes v_wmma_scale16_f32_* encodings into A0-supported v_wmma_scale_f32_16x16x128_f8f6f4 sequences. The pass reduces block-16 scale operands to block-32 scale operands using a VALU preamble, rewrites the WMMA encoding, and routes expanded instruction sequences
through a trampoline.

The pass now supports both Scale16 forms handled by this PR:

  • v_wmma_scale16_f32_16x16x128_f8f6f4
    • rewritten into one regular Scale WMMA
  • v_wmma_scale16_f32_32x16x128_f4
    • split into two 16x16 regular Scale WMMAs along the M dimension

Changes

  • Add comgr-hotswap-patch-wmma-scale16.cpp

    • Detects Scale16 VOP3PX3 instructions.
    • Extracts and rewrites SCALE_SRC0 / SCALE_SRC1 fields.
    • Emits a VALU scale-reduction preamble from block-16 to block-32 scale granularity.
    • Rewrites 16x16 Scale16 to regular Scale.
    • Splits 32x16 FP4 Scale16 into two 16x16 regular Scale WMMAs.
    • Preserves relevant scale format, scale opsel, source modifier, and src2 behavior.
    • Handles VGPR and inline-immediate src2 forms used by the split path.
    • Bakes scale_src2 = 0x100 for trampoline-emitted regular Scale instructions.
    • Adds idempotency protection for repeated hotswap rewrites.
  • Register the new scratch patch in the COMGR build and hotswap patch table.

  • Update GFX1250 VGPR scratch allocation granularity from 8 to 16 where required by this path.

  • Add lit coverage for:

    • 16x16 Scale16 decomposition.
    • 32x16 Scale16 split into two 16x16 WMMAs.
    • src2 preservation for float inline immediates.
    • Negative case where regular Scale is left unchanged.
    • Idempotent rewrite behavior.

Test Plan

Validated after rebasing on latest origin/amd-staging.

  • cmake --build /jam/TheRock-build/compiler/amd-comgr/build -- amd_comgr
  • /jam/TheRock-build/compiler/amd-llvm/build/bin/llvm-lit -v /jam/TheRock-build/compiler/amd-comgr/build/test-lit/hotswap-wmma-scale16.s
    • 1/1 passed
  • /jam/TheRock-build/compiler/amd-llvm/build/bin/llvm-lit -sv /jam/TheRock-build/compiler/amd-comgr/build/test-lit --filter='hotswap'
    • 42/42 selected tests passed
  • cmake --build /jam/TheRock-build/compiler/amd-comgr-asan -- amd_comgr
  • ASAN hotswap lit filter
    • 42/42 selected tests passed
  • FFM Scale16 comparison harness
    • all exact variants matched bit-for-bit
    • the existing s16_32x16_f4_diff case remains informational

suryajasper and others added 6 commits May 29, 2026 05:57
…6 → block-32)

Implement Case 3 of the B0-to-A0 scratch-patch pipeline: WMMA Scale16
(VOP3PX3) to regular Scale (VOP3PX2) decomposition.

GFX1250 B0 uses v_wmma_scale16_f32_16x16x128_f8f6f4 with 64-bit scale
operands at block-16 granularity. A0 only supports the VOP3PX2 variant
with 32-bit scale operands at block-32 granularity. The patch:

1. Emits a VALU preamble that reduces block-16 scales to block-32 via
   byte-pair max (max-exponent strategy for E8M0 scales).
2. Rewrites the 16-byte instruction encoding from VOP3PX3 to VOP3PX2
   (changes LD_SCALE opcode byte and SCALE_SRC fields).
3. Routes the expanded sequence through a trampoline.

Also handles v_wmma_scale16_f32_32x16x128_f4 (B0-only) with error
logging since it has no A0 counterpart.

Provides a standalone applyScratchPatches strong symbol override so
this patch can land independently of other scratch-patch PRs.

Additionally removes the overly-strict output-size equality check in
hotswap-rewrite test helper, since trampoline patches legitimately
grow the ELF.

Made-with: Cursor
… 2×16x16 A0)

The v_wmma_scale16_f32_32x16x128_f4 instruction is B0-only with no A0
counterpart. Decompose it into two v_wmma_scale_f32_16x16x128_f8f6f4
(VOP3PX2) instructions by splitting along the M dimension:

- Half 0: rows 0-15, SCALE_OPSEL[0]=0 (threads 0-15 for A scale)
- Half 1: rows 16-31, SCALE_OPSEL[0]=1 (threads 16-31 for A scale)

Both halves share a single scale reduction preamble (block-16 → block-32
via byte-pair max) and the B matrix operand. The trampoline emits:
  [scale reduction A + B] → [WMMA half 0] → [WMMA half 1] → [s_branch back]

Preserves matrix_b_scale (SCALE_OPSEL_HI[0]) and scale format
(matrix_a_scale_fmt, matrix_b_scale_fmt) from the original encoding.

Made-with: Cursor
Gfx1250VgprGranuleSize was 8 (the GFX10/11 wave32 value). On GFX1250
wave32 the VGPR encoding granule is 16 per AMDGPUBaseInfo::
getVGPREncodingGranule with Feature1024AddressableVGPRs, so
ElfView::getKernelVgprCount and updateKernelDescriptor were
mis-decoding COMPUTE_PGM_RSRC1.GRANULATED_WORKITEM_VGPR_COUNT and
under-reporting the kernel's actual VGPR count by ~half.

Concrete effect on a kernel with next_free_vgpr=44 (clang encodes
granulated=2 with granule=16): getKernelVgprCount returned (2+1)*8 =
24 instead of (2+1)*16 = 48. ScratchAllocator then picked the next
free VGPR from v24, which overlaps the kernel's live matrix-A VGPRs
v[16:31], and any patch using ScratchAllocator's preamble would
clobber matrix data before the WMMA consumed it.

Today only the in-flight wmma_scale16 patch uses ScratchAllocator, so
this manifested as misexecution of v_wmma_scale16_f32_32x16x128_f4
under hotswap on FFM. Any future scratch-using patch would hit the
same trap.
… propagation

The 32x16 M-split emitter built each half's assembly with the
accumulator (src2) hardcoded to v[HalfD:HalfD+7]. This matched the
LIT test's source instruction (which uses v[0:15] for both D and C)
and the byte-level rewriter test, but it did not match what HIP-
compiled kernels actually produce: clang folds an all-zero
accumulator (the common `v16f acc16 = {0,...,0}` initializer) to an
inline-immediate 0 for src2, so the trampoline's WMMA was reading
arbitrary stale bytes from D's VGPR range as the accumulator input
-- garbage output on every realistic kernel.

Symmetrically the per-source neg_lo / neg_hi modifiers were dropped.
A wmma_scale16 with c_mod=NEG sets neg_lo on src2, which the printer
formats as `neg_lo:[0,0,1]`. The 32x16 path stripped these bits when
re-assembling the halves, so the hotswap path computed `+C` instead
of `-C`.

Adds:
  - extractSrc2: 9-bit src2 field from VOP3PX bytes [114:122].
  - formatSrc2: emits the source operand as either a sliced VGPR
    range v[c:c+7] / v[c+8:c+15] (for VGPR src2) or the inline
    literal verbatim on both halves (M-split has no accumulator
    carry between halves). Covers the integer (128..208) and float
    (240..247) inline-imm encodings.
  - extractNegFlags: reads neg_lo (Inst{125-127} = byte[15] bits
    [7:5]) and neg_hi (Inst{72-74} = byte[9] bits [2:0]) from the
    WMMA uop.
  - per-half emission of `neg_lo:[a,b,c]` / `neg_hi:[a,b,c]` when
    any bit in the corresponding triple is set.

LIT (hotswap-wmma-scale16.s) still uses the v[0:15]/v[0:15] D/C
layout and so didn't exercise these paths -- the regression was
only visible end-to-end on hipRTC-compiled FFM workloads. See the
new HIP-based wmma_scale16_test.{py,hip} driver in the FFM-test
directory for the variant matrix that catches both regressions.
…ncy guard

Two related cleanups that keep wmma_scale16-emitted trampolines
bit-identical across repeated rewrite invocations and free of false
SALU hazards on first execution.

1. Bake scale_src2 = 0x100 (VGPR0) into the LD_SCALE prefix of every
   trampoline-emitted v_wmma_scale_*. The applyVop3px2Src2Fix
   in-place pass already sets this field on user-emitted forms it
   finds in Decoded[], but trampoline bodies are not in Decoded[] on
   the first rewrite; on a second rewrite the trampolines have been
   appended to .text and the fix fires, producing different bytes
   than pass 1 and breaking idempotency. Same trick PR #2's VOP3PX2
   wrap pass uses in its LdScalePrefix bytes. Applied symmetrically
   to both the 16x16 byte-level rewriter (rewriteScale16ToScale) and
   the 32x16 assembleSingleInst path (HalfBytes post-process).

2. Replace the dead `Decoded[Idx-1] == s_branch` idempotency check
   with the canonical OutTrampolines membership check (mirrors PR #2's
   wrap pass). The previous heuristic never fired meaningfully --
   Decoded[] is built from the original .text and the dispatcher's
   mnemonic narrowing already filters out sites the patch has
   rewritten on a re-rewrite. The new guard correctly catches the
   case where another patch class has claimed the same offset.
   Applied to both patchWmmaScale16_16x16 and patchWmmaScale16_32x16.

Lit (hotswap-wmma-scale16.s): the `cmp %t.out.elf %t.out2.elf`
idempotency check at the end of the test now passes; before this
commit the second rewrite differed at the scale_src2 bytes of every
trampoline-emitted half.
When splitting the 32x16 Scale16 form into two 16x16 Scale forms, formatSrc2 reprints inline float accumulators from the encoded SRC2 value. The float-immediate table was ordered as positives followed by negatives, but AMDGPU inline float encodings interleave sign by magnitude.

Correct encodings 240..247 to 0.5, -0.5, 1.0, -1.0, 2.0, -2.0, 4.0, -4.0, and add a lit check that src2=-0.5 is preserved in both split halves.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants