[AMDGPU] comgr: Scratch Patches for B0-to-A0 Hotswap 2/2 -- Scaled WMMA by jammm · Pull Request #2731 · ROCm/llvm-project

jammm · 2026-05-29T17:12:05Z

Re-create PR based on #2365

Summary

Add a COMGR hotswap scratch-patch pass for GFX1250 WMMA Scale16 instructions.

This decomposes v_wmma_scale16_f32_* encodings into A0-supported v_wmma_scale_f32_16x16x128_f8f6f4 sequences. The pass reduces block-16 scale operands to block-32 scale operands using a VALU preamble, rewrites the WMMA encoding, and routes expanded instruction sequences
through a trampoline.

The pass now supports both Scale16 forms handled by this PR:

v_wmma_scale16_f32_16x16x128_f8f6f4
- rewritten into one regular Scale WMMA
v_wmma_scale16_f32_32x16x128_f4
- split into two 16x16 regular Scale WMMAs along the M dimension

Changes

Add comgr-hotswap-patch-wmma-scale16.cpp
- Detects Scale16 VOP3PX3 instructions.
- Extracts and rewrites SCALE_SRC0 / SCALE_SRC1 fields.
- Emits a VALU scale-reduction preamble from block-16 to block-32 scale granularity.
- Rewrites 16x16 Scale16 to regular Scale.
- Splits 32x16 FP4 Scale16 into two 16x16 regular Scale WMMAs.
- Preserves relevant scale format, scale opsel, source modifier, and src2 behavior.
- Handles VGPR and inline-immediate src2 forms used by the split path.
- Bakes scale_src2 = 0x100 for trampoline-emitted regular Scale instructions.
- Adds idempotency protection for repeated hotswap rewrites.
Register the new scratch patch in the COMGR build and hotswap patch table.
Update GFX1250 VGPR scratch allocation granularity from 8 to 16 where required by this path.
Add lit coverage for:
- 16x16 Scale16 decomposition.
- 32x16 Scale16 split into two 16x16 WMMAs.
- src2 preservation for float inline immediates.
- Negative case where regular Scale is left unchanged.
- Idempotent rewrite behavior.

Test Plan

Validated after rebasing on latest origin/amd-staging.

cmake --build /jam/TheRock-build/compiler/amd-comgr/build -- amd_comgr
/jam/TheRock-build/compiler/amd-llvm/build/bin/llvm-lit -v /jam/TheRock-build/compiler/amd-comgr/build/test-lit/hotswap-wmma-scale16.s
- 1/1 passed
/jam/TheRock-build/compiler/amd-llvm/build/bin/llvm-lit -sv /jam/TheRock-build/compiler/amd-comgr/build/test-lit --filter='hotswap'
- 42/42 selected tests passed
cmake --build /jam/TheRock-build/compiler/amd-comgr-asan -- amd_comgr
ASAN hotswap lit filter
- 42/42 selected tests passed
FFM Scale16 comparison harness
- all exact variants matched bit-for-bit
- the existing s16_32x16_f4_diff case remains informational

…6 → block-32) Implement Case 3 of the B0-to-A0 scratch-patch pipeline: WMMA Scale16 (VOP3PX3) to regular Scale (VOP3PX2) decomposition. GFX1250 B0 uses v_wmma_scale16_f32_16x16x128_f8f6f4 with 64-bit scale operands at block-16 granularity. A0 only supports the VOP3PX2 variant with 32-bit scale operands at block-32 granularity. The patch: 1. Emits a VALU preamble that reduces block-16 scales to block-32 via byte-pair max (max-exponent strategy for E8M0 scales). 2. Rewrites the 16-byte instruction encoding from VOP3PX3 to VOP3PX2 (changes LD_SCALE opcode byte and SCALE_SRC fields). 3. Routes the expanded sequence through a trampoline. Also handles v_wmma_scale16_f32_32x16x128_f4 (B0-only) with error logging since it has no A0 counterpart. Provides a standalone applyScratchPatches strong symbol override so this patch can land independently of other scratch-patch PRs. Additionally removes the overly-strict output-size equality check in hotswap-rewrite test helper, since trampoline patches legitimately grow the ELF. Made-with: Cursor

… 2×16x16 A0) The v_wmma_scale16_f32_32x16x128_f4 instruction is B0-only with no A0 counterpart. Decompose it into two v_wmma_scale_f32_16x16x128_f8f6f4 (VOP3PX2) instructions by splitting along the M dimension: - Half 0: rows 0-15, SCALE_OPSEL[0]=0 (threads 0-15 for A scale) - Half 1: rows 16-31, SCALE_OPSEL[0]=1 (threads 16-31 for A scale) Both halves share a single scale reduction preamble (block-16 → block-32 via byte-pair max) and the B matrix operand. The trampoline emits: [scale reduction A + B] → [WMMA half 0] → [WMMA half 1] → [s_branch back] Preserves matrix_b_scale (SCALE_OPSEL_HI[0]) and scale format (matrix_a_scale_fmt, matrix_b_scale_fmt) from the original encoding. Made-with: Cursor

Gfx1250VgprGranuleSize was 8 (the GFX10/11 wave32 value). On GFX1250 wave32 the VGPR encoding granule is 16 per AMDGPUBaseInfo:: getVGPREncodingGranule with Feature1024AddressableVGPRs, so ElfView::getKernelVgprCount and updateKernelDescriptor were mis-decoding COMPUTE_PGM_RSRC1.GRANULATED_WORKITEM_VGPR_COUNT and under-reporting the kernel's actual VGPR count by ~half. Concrete effect on a kernel with next_free_vgpr=44 (clang encodes granulated=2 with granule=16): getKernelVgprCount returned (2+1)*8 = 24 instead of (2+1)*16 = 48. ScratchAllocator then picked the next free VGPR from v24, which overlaps the kernel's live matrix-A VGPRs v[16:31], and any patch using ScratchAllocator's preamble would clobber matrix data before the WMMA consumed it. Today only the in-flight wmma_scale16 patch uses ScratchAllocator, so this manifested as misexecution of v_wmma_scale16_f32_32x16x128_f4 under hotswap on FFM. Any future scratch-using patch would hit the same trap.

… propagation The 32x16 M-split emitter built each half's assembly with the accumulator (src2) hardcoded to v[HalfD:HalfD+7]. This matched the LIT test's source instruction (which uses v[0:15] for both D and C) and the byte-level rewriter test, but it did not match what HIP- compiled kernels actually produce: clang folds an all-zero accumulator (the common `v16f acc16 = {0,...,0}` initializer) to an inline-immediate 0 for src2, so the trampoline's WMMA was reading arbitrary stale bytes from D's VGPR range as the accumulator input -- garbage output on every realistic kernel. Symmetrically the per-source neg_lo / neg_hi modifiers were dropped. A wmma_scale16 with c_mod=NEG sets neg_lo on src2, which the printer formats as `neg_lo:[0,0,1]`. The 32x16 path stripped these bits when re-assembling the halves, so the hotswap path computed `+C` instead of `-C`. Adds: - extractSrc2: 9-bit src2 field from VOP3PX bytes [114:122]. - formatSrc2: emits the source operand as either a sliced VGPR range v[c:c+7] / v[c+8:c+15] (for VGPR src2) or the inline literal verbatim on both halves (M-split has no accumulator carry between halves). Covers the integer (128..208) and float (240..247) inline-imm encodings. - extractNegFlags: reads neg_lo (Inst{125-127} = byte[15] bits [7:5]) and neg_hi (Inst{72-74} = byte[9] bits [2:0]) from the WMMA uop. - per-half emission of `neg_lo:[a,b,c]` / `neg_hi:[a,b,c]` when any bit in the corresponding triple is set. LIT (hotswap-wmma-scale16.s) still uses the v[0:15]/v[0:15] D/C layout and so didn't exercise these paths -- the regression was only visible end-to-end on hipRTC-compiled FFM workloads. See the new HIP-based wmma_scale16_test.{py,hip} driver in the FFM-test directory for the variant matrix that catches both regressions.

…ncy guard Two related cleanups that keep wmma_scale16-emitted trampolines bit-identical across repeated rewrite invocations and free of false SALU hazards on first execution. 1. Bake scale_src2 = 0x100 (VGPR0) into the LD_SCALE prefix of every trampoline-emitted v_wmma_scale_*. The applyVop3px2Src2Fix in-place pass already sets this field on user-emitted forms it finds in Decoded[], but trampoline bodies are not in Decoded[] on the first rewrite; on a second rewrite the trampolines have been appended to .text and the fix fires, producing different bytes than pass 1 and breaking idempotency. Same trick PR #2's VOP3PX2 wrap pass uses in its LdScalePrefix bytes. Applied symmetrically to both the 16x16 byte-level rewriter (rewriteScale16ToScale) and the 32x16 assembleSingleInst path (HalfBytes post-process). 2. Replace the dead `Decoded[Idx-1] == s_branch` idempotency check with the canonical OutTrampolines membership check (mirrors PR #2's wrap pass). The previous heuristic never fired meaningfully -- Decoded[] is built from the original .text and the dispatcher's mnemonic narrowing already filters out sites the patch has rewritten on a re-rewrite. The new guard correctly catches the case where another patch class has claimed the same offset. Applied to both patchWmmaScale16_16x16 and patchWmmaScale16_32x16. Lit (hotswap-wmma-scale16.s): the `cmp %t.out.elf %t.out2.elf` idempotency check at the end of the test now passes; before this commit the second rewrite differed at the scale_src2 bytes of every trampoline-emitted half.

When splitting the 32x16 Scale16 form into two 16x16 Scale forms, formatSrc2 reprints inline float accumulators from the encoded SRC2 value. The float-immediate table was ordered as positives followed by negatives, but AMDGPU inline float encodings interleave sign by magnitude. Correct encodings 240..247 to 0.5, -0.5, 1.0, -1.0, 2.0, -2.0, 4.0, -4.0, and add a lit check that src2=-0.5 is preserved in both split halves.

suryajasper and others added 6 commits May 29, 2026 05:57

jammm mentioned this pull request May 29, 2026

[AMDGPU] comgr: Scratch Patches for B0-to-A0 Hotswap 2/2 -- Scaled WMMA #2365

Draft

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMDGPU] comgr: Scratch Patches for B0-to-A0 Hotswap 2/2 -- Scaled WMMA#2731

[AMDGPU] comgr: Scratch Patches for B0-to-A0 Hotswap 2/2 -- Scaled WMMA#2731
jammm wants to merge 6 commits into
amd-stagingfrom
users/jam/comgr-wmma-scale16-staging

jammm commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jammm commented May 29, 2026

Summary

Changes

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants