[AMDGPU] comgr: Scratch Patches for B0-to-A0 Hotswap 2/2 -- Scaled WMMA#2731
Draft
jammm wants to merge 6 commits into
Draft
[AMDGPU] comgr: Scratch Patches for B0-to-A0 Hotswap 2/2 -- Scaled WMMA#2731jammm wants to merge 6 commits into
jammm wants to merge 6 commits into
Conversation
…6 → block-32) Implement Case 3 of the B0-to-A0 scratch-patch pipeline: WMMA Scale16 (VOP3PX3) to regular Scale (VOP3PX2) decomposition. GFX1250 B0 uses v_wmma_scale16_f32_16x16x128_f8f6f4 with 64-bit scale operands at block-16 granularity. A0 only supports the VOP3PX2 variant with 32-bit scale operands at block-32 granularity. The patch: 1. Emits a VALU preamble that reduces block-16 scales to block-32 via byte-pair max (max-exponent strategy for E8M0 scales). 2. Rewrites the 16-byte instruction encoding from VOP3PX3 to VOP3PX2 (changes LD_SCALE opcode byte and SCALE_SRC fields). 3. Routes the expanded sequence through a trampoline. Also handles v_wmma_scale16_f32_32x16x128_f4 (B0-only) with error logging since it has no A0 counterpart. Provides a standalone applyScratchPatches strong symbol override so this patch can land independently of other scratch-patch PRs. Additionally removes the overly-strict output-size equality check in hotswap-rewrite test helper, since trampoline patches legitimately grow the ELF. Made-with: Cursor
… 2×16x16 A0) The v_wmma_scale16_f32_32x16x128_f4 instruction is B0-only with no A0 counterpart. Decompose it into two v_wmma_scale_f32_16x16x128_f8f6f4 (VOP3PX2) instructions by splitting along the M dimension: - Half 0: rows 0-15, SCALE_OPSEL[0]=0 (threads 0-15 for A scale) - Half 1: rows 16-31, SCALE_OPSEL[0]=1 (threads 16-31 for A scale) Both halves share a single scale reduction preamble (block-16 → block-32 via byte-pair max) and the B matrix operand. The trampoline emits: [scale reduction A + B] → [WMMA half 0] → [WMMA half 1] → [s_branch back] Preserves matrix_b_scale (SCALE_OPSEL_HI[0]) and scale format (matrix_a_scale_fmt, matrix_b_scale_fmt) from the original encoding. Made-with: Cursor
Gfx1250VgprGranuleSize was 8 (the GFX10/11 wave32 value). On GFX1250 wave32 the VGPR encoding granule is 16 per AMDGPUBaseInfo:: getVGPREncodingGranule with Feature1024AddressableVGPRs, so ElfView::getKernelVgprCount and updateKernelDescriptor were mis-decoding COMPUTE_PGM_RSRC1.GRANULATED_WORKITEM_VGPR_COUNT and under-reporting the kernel's actual VGPR count by ~half. Concrete effect on a kernel with next_free_vgpr=44 (clang encodes granulated=2 with granule=16): getKernelVgprCount returned (2+1)*8 = 24 instead of (2+1)*16 = 48. ScratchAllocator then picked the next free VGPR from v24, which overlaps the kernel's live matrix-A VGPRs v[16:31], and any patch using ScratchAllocator's preamble would clobber matrix data before the WMMA consumed it. Today only the in-flight wmma_scale16 patch uses ScratchAllocator, so this manifested as misexecution of v_wmma_scale16_f32_32x16x128_f4 under hotswap on FFM. Any future scratch-using patch would hit the same trap.
… propagation
The 32x16 M-split emitter built each half's assembly with the
accumulator (src2) hardcoded to v[HalfD:HalfD+7]. This matched the
LIT test's source instruction (which uses v[0:15] for both D and C)
and the byte-level rewriter test, but it did not match what HIP-
compiled kernels actually produce: clang folds an all-zero
accumulator (the common `v16f acc16 = {0,...,0}` initializer) to an
inline-immediate 0 for src2, so the trampoline's WMMA was reading
arbitrary stale bytes from D's VGPR range as the accumulator input
-- garbage output on every realistic kernel.
Symmetrically the per-source neg_lo / neg_hi modifiers were dropped.
A wmma_scale16 with c_mod=NEG sets neg_lo on src2, which the printer
formats as `neg_lo:[0,0,1]`. The 32x16 path stripped these bits when
re-assembling the halves, so the hotswap path computed `+C` instead
of `-C`.
Adds:
- extractSrc2: 9-bit src2 field from VOP3PX bytes [114:122].
- formatSrc2: emits the source operand as either a sliced VGPR
range v[c:c+7] / v[c+8:c+15] (for VGPR src2) or the inline
literal verbatim on both halves (M-split has no accumulator
carry between halves). Covers the integer (128..208) and float
(240..247) inline-imm encodings.
- extractNegFlags: reads neg_lo (Inst{125-127} = byte[15] bits
[7:5]) and neg_hi (Inst{72-74} = byte[9] bits [2:0]) from the
WMMA uop.
- per-half emission of `neg_lo:[a,b,c]` / `neg_hi:[a,b,c]` when
any bit in the corresponding triple is set.
LIT (hotswap-wmma-scale16.s) still uses the v[0:15]/v[0:15] D/C
layout and so didn't exercise these paths -- the regression was
only visible end-to-end on hipRTC-compiled FFM workloads. See the
new HIP-based wmma_scale16_test.{py,hip} driver in the FFM-test
directory for the variant matrix that catches both regressions.
…ncy guard Two related cleanups that keep wmma_scale16-emitted trampolines bit-identical across repeated rewrite invocations and free of false SALU hazards on first execution. 1. Bake scale_src2 = 0x100 (VGPR0) into the LD_SCALE prefix of every trampoline-emitted v_wmma_scale_*. The applyVop3px2Src2Fix in-place pass already sets this field on user-emitted forms it finds in Decoded[], but trampoline bodies are not in Decoded[] on the first rewrite; on a second rewrite the trampolines have been appended to .text and the fix fires, producing different bytes than pass 1 and breaking idempotency. Same trick PR #2's VOP3PX2 wrap pass uses in its LdScalePrefix bytes. Applied symmetrically to both the 16x16 byte-level rewriter (rewriteScale16ToScale) and the 32x16 assembleSingleInst path (HalfBytes post-process). 2. Replace the dead `Decoded[Idx-1] == s_branch` idempotency check with the canonical OutTrampolines membership check (mirrors PR #2's wrap pass). The previous heuristic never fired meaningfully -- Decoded[] is built from the original .text and the dispatcher's mnemonic narrowing already filters out sites the patch has rewritten on a re-rewrite. The new guard correctly catches the case where another patch class has claimed the same offset. Applied to both patchWmmaScale16_16x16 and patchWmmaScale16_32x16. Lit (hotswap-wmma-scale16.s): the `cmp %t.out.elf %t.out2.elf` idempotency check at the end of the test now passes; before this commit the second rewrite differed at the scale_src2 bytes of every trampoline-emitted half.
When splitting the 32x16 Scale16 form into two 16x16 Scale forms, formatSrc2 reprints inline float accumulators from the encoded SRC2 value. The float-immediate table was ordered as positives followed by negatives, but AMDGPU inline float encodings interleave sign by magnitude. Correct encodings 240..247 to 0.5, -0.5, 1.0, -1.0, 2.0, -2.0, 4.0, -4.0, and add a lit check that src2=-0.5 is preserved in both split halves.
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Re-create PR based on #2365
Summary
Add a COMGR hotswap scratch-patch pass for GFX1250 WMMA Scale16 instructions.
This decomposes
v_wmma_scale16_f32_*encodings into A0-supportedv_wmma_scale_f32_16x16x128_f8f6f4sequences. The pass reduces block-16 scale operands to block-32 scale operands using a VALU preamble, rewrites the WMMA encoding, and routes expanded instruction sequencesthrough a trampoline.
The pass now supports both Scale16 forms handled by this PR:
v_wmma_scale16_f32_16x16x128_f8f6f4v_wmma_scale16_f32_32x16x128_f4Changes
Add
comgr-hotswap-patch-wmma-scale16.cppSCALE_SRC0/SCALE_SRC1fields.src2behavior.src2forms used by the split path.scale_src2 = 0x100for trampoline-emitted regular Scale instructions.Register the new scratch patch in the COMGR build and hotswap patch table.
Update GFX1250 VGPR scratch allocation granularity from 8 to 16 where required by this path.
Add lit coverage for:
src2preservation for float inline immediates.Test Plan
Validated after rebasing on latest
origin/amd-staging.cmake --build /jam/TheRock-build/compiler/amd-comgr/build -- amd_comgr/jam/TheRock-build/compiler/amd-llvm/build/bin/llvm-lit -v /jam/TheRock-build/compiler/amd-comgr/build/test-lit/hotswap-wmma-scale16.s1/1passed/jam/TheRock-build/compiler/amd-llvm/build/bin/llvm-lit -sv /jam/TheRock-build/compiler/amd-comgr/build/test-lit --filter='hotswap'42/42selected tests passedcmake --build /jam/TheRock-build/compiler/amd-comgr-asan -- amd_comgr42/42selected tests passeds16_32x16_f4_diffcase remains informational