[AMDGPU] Cost packed-FP32 <2 x float> VGPR-pair setup on gfx9#2730
[AMDGPU] Cost packed-FP32 <2 x float> VGPR-pair setup on gfx9#2730michaelselehov wants to merge 1 commit into
Conversation
A <2 x float> feeding a v_pk_*_f32 op must occupy an aligned VGPR pair, so
charge ~1 v_mov_b32 per lane in getVectorInstrCost(InsertElement) and
getShuffleCost. This stops the SLP vectorizer over-vectorizing FP32 into
packed form, which was widening VGPR live ranges and cutting occupancy.
Gated to f32 (at 32-bit width the only packed VOP3P ALU ops are
v_pk_{add,mul,fma}_f32 - there is no packed 32-bit integer op) on the gfx9
generation; gfx12 packed-FP32 targets are left unchanged pending evaluation.
Fixes the rocFFT len-6561 regression (hot-kernel VGPR 172 -> ~117, occupancy
2 -> 4 waves/SIMD on MI350) with no rocRAND regression. Regenerates 4 min/max
CostModel tests; the only semantic change is the <2 x float> +2 (the two
lane-align inserts), with no packed min/max op to lose.
Ticket: LCOMPILER-2268
SummaryOn gfx9 packed-FP32 targets (gfx90a / gfx94x / gfx950) the SLP vectorizer This is the root cause of the rocFFT len-6561 regression (ww-19 amd-staging The patch teaches
With the cost model honest, SLP stops manufacturing packed pairs that do not pay Why the gate is
|
| Reproducer | Metric | Baseline (BAD) | Patched | GOOD criterion | Result |
|---|---|---|---|---|---|
| rocFFT len-6561 (LCOMPILER-2268) | kernel .vgpr_count |
172 | 123 | ≤ 130 | regression fixed |
| rocRAND xorwow (LCOMPILER-2230) | phi <2 x i32> in .preheader.i |
3 | 2 | ≤ 2 | no regression (= GOOD) |
Notes:
- rocRAND's hot tree is integer (
<2 x i32>); it is already at GOOD on current
amd-staging via the in-tree-slp-inst-count-checkheuristic. This f32-only
patch does not touch i32, so rocRAND stays at its GOOD value (vphi = 2; GOOD
is≤ 2, not 0). ninja check-llvmon amd-staging + this patch: 0 unexpected failures
(45231 passed, 92 XFAIL).Analysis/CostModel/AMDGPU+Transforms/SLPVectorizer/AMDGPU: 84/84 pass.
Test changes — per-test rationale (why none is a regression)
The only cost numbers that move are the <2 x float> min/max intrinsics on the
packed-FP32 gfx9 targets, each by +2 (= the two lane-align InsertElements
the scalarizer must emit to assemble a <2 x float> result). No scalar, no
wider vector (<3/4/8/16 x float>), no integer, no gfx900/generic/gfx12
cost changes. A GFX900-SIZE prefix was added to the non-packed gfx900
code-size RUN line so packed (gfx90a) and non-packed (gfx900) split cleanly
under -cost-kind=code-size.
Common reason none of these makes codegen worse: there is no packed min/max
instruction (v_pk_*_f32 exists only for add/mul/fma). A <2 x float> min or
max therefore never lowers to a packed op — it is always scalarized into two
v_{min,max}_f32. Raising the modeled cost of building that <2 x float>
only makes the scalarization overhead honest; it cannot suppress a profitable
vectorization, because there is no packed form to vectorize toward.
| Test | Changed line(s) | Cost before → after | Why not worse |
|---|---|---|---|
CostModel/AMDGPU/maximum.ll |
llvm.maximum.v2f32 |
gfx950 2→4, gfx90a 20→22 (size: 2→4) | llvm.maximum (IEEE-2019, NaN-propagating) has no packed form; always scalarized. The +2 is exactly the two result-lane inserts; it reflects real v_mov_b32s, and cannot de-vectorize a packed op that does not exist. |
CostModel/AMDGPU/minimum.ll |
llvm.minimum.v2f32 |
gfx950 2→4, gfx90a 20→22 (size: 2→4) | Same as maximum: scalarized minimum, +2 = two lane inserts, no packed min to lose. |
CostModel/AMDGPU/maxnum.ll |
llvm.maxnum.v2f32 |
gfx90a 4→6 (size: 4→6) | maxnum lowers to scalar v_max_f32 per lane (no v_pk_max_f32 exists). The +2 captures the pair assembly; vector form was never cheaper than scalar, so nothing beneficial is suppressed. |
CostModel/AMDGPU/minnum.ll |
llvm.minnum.v2f32 |
gfx90a 4→6 (size: 4→6) | Same as maxnum: scalar v_min_f32 per lane, +2 for pair assembly, no packed min to lose. |
(The four tests are auto-generated via utils/update_analyze_test_checks.py;
the diffs also include the routine prefix-materialization the script emits when
gfx90a/gfx950 diverge from the non-packed path. The only semantic change is
the +2 above.)
Scope / follow-ups
- gfx12 (gfx1250) packed-FP32 is intentionally left unchanged here.
- Cherry-pick list for the TheRock pre-release: this single commit
(AMDGPUTargetTransformInfo.cpp+ 4 regenerated CostModel tests). No
[Support] Always scale InstructionCost::Value llvm/llvm-project#178962 prerequisite.
A <2 x float> feeding a v_pk_*_f32 op must occupy an aligned VGPR pair, so charge ~1 v_mov_b32 per lane in getVectorInstrCost(InsertElement) and getShuffleCost. This stops the SLP vectorizer over-vectorizing FP32 into packed form, which was widening VGPR live ranges and cutting occupancy.
Gated to f32 (at 32-bit width the only packed VOP3P ALU ops are v_pk_{add,mul,fma}_f32 - there is no packed 32-bit integer op) on the gfx9 generation; gfx12 packed-FP32 targets are left unchanged pending evaluation.
Fixes the rocFFT len-6561 regression (hot-kernel VGPR 172 -> ~117, occupancy 2 -> 4 waves/SIMD on MI350) with no rocRAND regression. Regenerates 4 min/max CostModel tests; the only semantic change is the <2 x float> +2 (the two lane-align inserts), with no packed min/max op to lose.
Ticket: LCOMPILER-2268