Skip to content

[AMDGPU] Cost packed-FP32 <2 x float> VGPR-pair setup on gfx9#2730

Open
michaelselehov wants to merge 1 commit into
amd-stagingfrom
amd/dev/mselehov/lcompiler-2268-amdgpu-packed-fp32-tti-cost
Open

[AMDGPU] Cost packed-FP32 <2 x float> VGPR-pair setup on gfx9#2730
michaelselehov wants to merge 1 commit into
amd-stagingfrom
amd/dev/mselehov/lcompiler-2268-amdgpu-packed-fp32-tti-cost

Conversation

@michaelselehov
Copy link
Copy Markdown

A <2 x float> feeding a v_pk_*_f32 op must occupy an aligned VGPR pair, so charge ~1 v_mov_b32 per lane in getVectorInstrCost(InsertElement) and getShuffleCost. This stops the SLP vectorizer over-vectorizing FP32 into packed form, which was widening VGPR live ranges and cutting occupancy.

Gated to f32 (at 32-bit width the only packed VOP3P ALU ops are v_pk_{add,mul,fma}_f32 - there is no packed 32-bit integer op) on the gfx9 generation; gfx12 packed-FP32 targets are left unchanged pending evaluation.

Fixes the rocFFT len-6561 regression (hot-kernel VGPR 172 -> ~117, occupancy 2 -> 4 waves/SIMD on MI350) with no rocRAND regression. Regenerates 4 min/max CostModel tests; the only semantic change is the <2 x float> +2 (the two lane-align inserts), with no packed min/max op to lose.

Ticket: LCOMPILER-2268

A <2 x float> feeding a v_pk_*_f32 op must occupy an aligned VGPR pair, so
charge ~1 v_mov_b32 per lane in getVectorInstrCost(InsertElement) and
getShuffleCost. This stops the SLP vectorizer over-vectorizing FP32 into
packed form, which was widening VGPR live ranges and cutting occupancy.

Gated to f32 (at 32-bit width the only packed VOP3P ALU ops are
v_pk_{add,mul,fma}_f32 - there is no packed 32-bit integer op) on the gfx9
generation; gfx12 packed-FP32 targets are left unchanged pending evaluation.

Fixes the rocFFT len-6561 regression (hot-kernel VGPR 172 -> ~117, occupancy
2 -> 4 waves/SIMD on MI350) with no rocRAND regression. Regenerates 4 min/max
CostModel tests; the only semantic change is the <2 x float> +2 (the two
lane-align inserts), with no packed min/max op to lose.

Ticket: LCOMPILER-2268
@michaelselehov michaelselehov requested a review from ronlieb May 29, 2026 16:31
@michaelselehov
Copy link
Copy Markdown
Author

Summary

On gfx9 packed-FP32 targets (gfx90a / gfx94x / gfx950) the SLP vectorizer
treats building a <2 x float> from two scalars as free. It is not: a
v_pk_{add,mul,fma}_f32 source operand must occupy an aligned, adjacent VGPR
pair
, so synthesizing such a pair from non-adjacent scalars costs roughly one
v_mov_b32 per lane to line the halves up. Because the cost model reports these
moves as free, SLP over-vectorizes FP32 code into packed form, widening VGPR
live ranges and cutting occupancy.

This is the root cause of the rocFFT len-6561 regression (ww-19 amd-staging
promo): the hot kernel's .vgpr_count rose 116 → 172, dropping occupancy from
4 to 2 waves/SIMD on MI350 and costing ~20-24% wall time on the worst subtest.

The patch teaches GCNTTIImpl to charge that setup cost in two hooks:

  • getVectorInstrCost(InsertElement, <2 x float>)1 (the lane-align move).
  • getShuffleCost(<2 x float> ...)1 per non-identity result lane
    (broadcast/permute that must move a lane into its pair slot).

With the cost model honest, SLP stops manufacturing packed pairs that do not pay
for themselves; the rocFFT kernel returns to the GOOD codegen shape.

Why the gate is f32 + gfx9 (and benchmark-independent)

Both narrowings are justified by the ISA, not by tuning to a benchmark:

  • f32 only. At 32-bit element width the only packed VOP3P ALU
    instructions are v_pk_add_f32, v_pk_mul_f32, v_pk_fma_f32. There is no
    packed 32-bit integer op
    (rg "V_PK_..._(I32|U32)" in
    VOP3PInstructions.td is empty). The cost models the pair-alignment a
    v_pk_*_f32 source needs; a <2 x i32> has no packed consumer at 32-bit, so
    building one carries no such penalty and must not be taxed.
  • gfx9 only (getGeneration() == AMDGPUSubtarget::GFX9). Packed FP32 also
    exists on gfx12 (gfx1250, Generation::GFX12). We scope this change to the
    validated gfx9 CDNA family and leave newer targets unchanged pending separate
    evaluation, to avoid unreviewed cost-model shifts there.

The patch uses plain integer InstructionCost values; it has no dependency on
PR llvm#178962
("Always scale InstructionCost::Value") or any fractional-cost
infrastructure.

Validation

Reproducers compiled offline (no GPU) with the patched clang -cc1 on gfx950:

Reproducer Metric Baseline (BAD) Patched GOOD criterion Result
rocFFT len-6561 (LCOMPILER-2268) kernel .vgpr_count 172 123 ≤ 130 regression fixed
rocRAND xorwow (LCOMPILER-2230) phi <2 x i32> in .preheader.i 3 2 ≤ 2 no regression (= GOOD)

Notes:

  • rocRAND's hot tree is integer (<2 x i32>); it is already at GOOD on current
    amd-staging via the in-tree -slp-inst-count-check heuristic. This f32-only
    patch does not touch i32, so rocRAND stays at its GOOD value (vphi = 2; GOOD
    is ≤ 2, not 0).
  • ninja check-llvm on amd-staging + this patch: 0 unexpected failures
    (45231 passed, 92 XFAIL).
  • Analysis/CostModel/AMDGPU + Transforms/SLPVectorizer/AMDGPU: 84/84 pass.

Test changes — per-test rationale (why none is a regression)

The only cost numbers that move are the <2 x float> min/max intrinsics on the
packed-FP32 gfx9 targets, each by +2 (= the two lane-align InsertElements
the scalarizer must emit to assemble a <2 x float> result). No scalar, no
wider vector (<3/4/8/16 x float>), no integer, no gfx900/generic/gfx12
cost changes. A GFX900-SIZE prefix was added to the non-packed gfx900
code-size RUN line so packed (gfx90a) and non-packed (gfx900) split cleanly
under -cost-kind=code-size.

Common reason none of these makes codegen worse: there is no packed min/max
instruction
(v_pk_*_f32 exists only for add/mul/fma). A <2 x float> min or
max therefore never lowers to a packed op — it is always scalarized into two
v_{min,max}_f32. Raising the modeled cost of building that <2 x float>
only makes the scalarization overhead honest; it cannot suppress a profitable
vectorization, because there is no packed form to vectorize toward.

Test Changed line(s) Cost before → after Why not worse
CostModel/AMDGPU/maximum.ll llvm.maximum.v2f32 gfx950 2→4, gfx90a 20→22 (size: 2→4) llvm.maximum (IEEE-2019, NaN-propagating) has no packed form; always scalarized. The +2 is exactly the two result-lane inserts; it reflects real v_mov_b32s, and cannot de-vectorize a packed op that does not exist.
CostModel/AMDGPU/minimum.ll llvm.minimum.v2f32 gfx950 2→4, gfx90a 20→22 (size: 2→4) Same as maximum: scalarized minimum, +2 = two lane inserts, no packed min to lose.
CostModel/AMDGPU/maxnum.ll llvm.maxnum.v2f32 gfx90a 4→6 (size: 4→6) maxnum lowers to scalar v_max_f32 per lane (no v_pk_max_f32 exists). The +2 captures the pair assembly; vector form was never cheaper than scalar, so nothing beneficial is suppressed.
CostModel/AMDGPU/minnum.ll llvm.minnum.v2f32 gfx90a 4→6 (size: 4→6) Same as maxnum: scalar v_min_f32 per lane, +2 for pair assembly, no packed min to lose.

(The four tests are auto-generated via utils/update_analyze_test_checks.py;
the diffs also include the routine prefix-materialization the script emits when
gfx90a/gfx950 diverge from the non-packed path. The only semantic change is
the +2 above.)

Scope / follow-ups

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant