[AMDGPU] Cost packed-FP32 <2 x float> VGPR-pair setup on gfx9 by michaelselehov · Pull Request #2730 · ROCm/llvm-project

michaelselehov · 2026-05-29T16:31:18Z

A <2 x float> feeding a v_pk_*_f32 op must occupy an aligned VGPR pair, so charge ~1 v_mov_b32 per lane in getVectorInstrCost(InsertElement) and getShuffleCost. This stops the SLP vectorizer over-vectorizing FP32 into packed form, which was widening VGPR live ranges and cutting occupancy.

Gated to f32 (at 32-bit width the only packed VOP3P ALU ops are v_pk_{add,mul,fma}_f32 - there is no packed 32-bit integer op) on the gfx9 generation; gfx12 packed-FP32 targets are left unchanged pending evaluation.

Fixes the rocFFT len-6561 regression (hot-kernel VGPR 172 -> ~117, occupancy 2 -> 4 waves/SIMD on MI350) with no rocRAND regression. Regenerates 4 min/max CostModel tests; the only semantic change is the <2 x float> +2 (the two lane-align inserts), with no packed min/max op to lose.

Ticket: LCOMPILER-2268

A <2 x float> feeding a v_pk_*_f32 op must occupy an aligned VGPR pair, so charge ~1 v_mov_b32 per lane in getVectorInstrCost(InsertElement) and getShuffleCost. This stops the SLP vectorizer over-vectorizing FP32 into packed form, which was widening VGPR live ranges and cutting occupancy. Gated to f32 (at 32-bit width the only packed VOP3P ALU ops are v_pk_{add,mul,fma}_f32 - there is no packed 32-bit integer op) on the gfx9 generation; gfx12 packed-FP32 targets are left unchanged pending evaluation. Fixes the rocFFT len-6561 regression (hot-kernel VGPR 172 -> ~117, occupancy 2 -> 4 waves/SIMD on MI350) with no rocRAND regression. Regenerates 4 min/max CostModel tests; the only semantic change is the <2 x float> +2 (the two lane-align inserts), with no packed min/max op to lose. Ticket: LCOMPILER-2268

michaelselehov · 2026-05-29T16:31:52Z

Summary

On gfx9 packed-FP32 targets (gfx90a / gfx94x / gfx950) the SLP vectorizer
treats building a <2 x float> from two scalars as free. It is not: a
v_pk_{add,mul,fma}_f32 source operand must occupy an aligned, adjacent VGPR
pair, so synthesizing such a pair from non-adjacent scalars costs roughly one
v_mov_b32 per lane to line the halves up. Because the cost model reports these
moves as free, SLP over-vectorizes FP32 code into packed form, widening VGPR
live ranges and cutting occupancy.

This is the root cause of the rocFFT len-6561 regression (ww-19 amd-staging
promo): the hot kernel's .vgpr_count rose 116 → 172, dropping occupancy from
4 to 2 waves/SIMD on MI350 and costing ~20-24% wall time on the worst subtest.

The patch teaches GCNTTIImpl to charge that setup cost in two hooks:

getVectorInstrCost(InsertElement, <2 x float>) → 1 (the lane-align move).
getShuffleCost(<2 x float> ...) → 1 per non-identity result lane
(broadcast/permute that must move a lane into its pair slot).

With the cost model honest, SLP stops manufacturing packed pairs that do not pay
for themselves; the rocFFT kernel returns to the GOOD codegen shape.

Why the gate is `f32` + gfx9 (and benchmark-independent)

Both narrowings are justified by the ISA, not by tuning to a benchmark:

f32 only. At 32-bit element width the only packed VOP3P ALU
instructions are v_pk_add_f32, v_pk_mul_f32, v_pk_fma_f32. There is no
packed 32-bit integer op (rg "V_PK_..._(I32|U32)" in
VOP3PInstructions.td is empty). The cost models the pair-alignment a
v_pk_*_f32 source needs; a <2 x i32> has no packed consumer at 32-bit, so
building one carries no such penalty and must not be taxed.
gfx9 only (getGeneration() == AMDGPUSubtarget::GFX9). Packed FP32 also
exists on gfx12 (gfx1250, Generation::GFX12). We scope this change to the
validated gfx9 CDNA family and leave newer targets unchanged pending separate
evaluation, to avoid unreviewed cost-model shifts there.

The patch uses plain integer InstructionCost values; it has no dependency on
PR llvm#178962 ("Always scale InstructionCost::Value") or any fractional-cost
infrastructure.

Validation

Reproducers compiled offline (no GPU) with the patched clang -cc1 on gfx950:

Reproducer	Metric	Baseline (BAD)	Patched	GOOD criterion	Result
rocFFT len-6561 (LCOMPILER-2268)	kernel `.vgpr_count`	172	123	≤ 130	regression fixed
rocRAND xorwow (LCOMPILER-2230)	`phi <2 x i32>` in `.preheader.i`	3	2	≤ 2	no regression (= GOOD)

Notes:

rocRAND's hot tree is integer (<2 x i32>); it is already at GOOD on current
amd-staging via the in-tree -slp-inst-count-check heuristic. This f32-only
patch does not touch i32, so rocRAND stays at its GOOD value (vphi = 2; GOOD
is ≤ 2, not 0).
ninja check-llvm on amd-staging + this patch: 0 unexpected failures
(45231 passed, 92 XFAIL).
Analysis/CostModel/AMDGPU + Transforms/SLPVectorizer/AMDGPU: 84/84 pass.

Test changes — per-test rationale (why none is a regression)

The only cost numbers that move are the <2 x float> min/max intrinsics on the
packed-FP32 gfx9 targets, each by +2 (= the two lane-align InsertElements
the scalarizer must emit to assemble a <2 x float> result). No scalar, no
wider vector (<3/4/8/16 x float>), no integer, no gfx900/generic/gfx12
cost changes. A GFX900-SIZE prefix was added to the non-packed gfx900
code-size RUN line so packed (gfx90a) and non-packed (gfx900) split cleanly
under -cost-kind=code-size.

Common reason none of these makes codegen worse: there is no packed min/max
instruction (v_pk_*_f32 exists only for add/mul/fma). A <2 x float> min or
max therefore never lowers to a packed op — it is always scalarized into two
v_{min,max}_f32. Raising the modeled cost of building that <2 x float>
only makes the scalarization overhead honest; it cannot suppress a profitable
vectorization, because there is no packed form to vectorize toward.

Test	Changed line(s)	Cost before → after	Why not worse
`CostModel/AMDGPU/maximum.ll`	`llvm.maximum.v2f32`	gfx950 2→4, gfx90a 20→22 (size: 2→4)	`llvm.maximum` (IEEE-2019, NaN-propagating) has no packed form; always scalarized. The +2 is exactly the two result-lane inserts; it reflects real `v_mov_b32`s, and cannot de-vectorize a packed op that does not exist.
`CostModel/AMDGPU/minimum.ll`	`llvm.minimum.v2f32`	gfx950 2→4, gfx90a 20→22 (size: 2→4)	Same as `maximum`: scalarized minimum, +2 = two lane inserts, no packed min to lose.
`CostModel/AMDGPU/maxnum.ll`	`llvm.maxnum.v2f32`	gfx90a 4→6 (size: 4→6)	`maxnum` lowers to scalar `v_max_f32` per lane (no `v_pk_max_f32` exists). The +2 captures the pair assembly; vector form was never cheaper than scalar, so nothing beneficial is suppressed.
`CostModel/AMDGPU/minnum.ll`	`llvm.minnum.v2f32`	gfx90a 4→6 (size: 4→6)	Same as `maxnum`: scalar `v_min_f32` per lane, +2 for pair assembly, no packed min to lose.

(The four tests are auto-generated via utils/update_analyze_test_checks.py;
the diffs also include the routine prefix-materialization the script emits when
gfx90a/gfx950 diverge from the non-packed path. The only semantic change is
the +2 above.)

Scope / follow-ups

gfx12 (gfx1250) packed-FP32 is intentionally left unchanged here.
Cherry-pick list for the TheRock pre-release: this single commit
(AMDGPUTargetTransformInfo.cpp + 4 regenerated CostModel tests). No
[Support] Always scale InstructionCost::Value llvm/llvm-project#178962 prerequisite.

michaelselehov requested a review from ronlieb May 29, 2026 16:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMDGPU] Cost packed-FP32 <2 x float> VGPR-pair setup on gfx9#2730

[AMDGPU] Cost packed-FP32 <2 x float> VGPR-pair setup on gfx9#2730
michaelselehov wants to merge 1 commit into
amd-stagingfrom
amd/dev/mselehov/lcompiler-2268-amdgpu-packed-fp32-tti-cost

michaelselehov commented May 29, 2026

Uh oh!

michaelselehov commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

michaelselehov commented May 29, 2026

Uh oh!

michaelselehov commented May 29, 2026

Summary

Why the gate is f32 + gfx9 (and benchmark-independent)

Validation

Test changes — per-test rationale (why none is a regression)

Scope / follow-ups

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Why the gate is `f32` + gfx9 (and benchmark-independent)