[WaveTransform] Some more AMDGPU intrinsic tests fixed by cdevadas · Pull Request #2717 · ROCm/llvm-project

cdevadas · 2026-05-29T06:30:21Z

Enabled -amdgpu-late-wave-transform=1 and fixed their FileCheck patterns for the codegen changes in the new pass flow.

Tests updated:

llvm.amdgcn.ballot.{i32,i64}.ll
llvm.amdgcn.cvt.{fp8.e5m3,fp8,pk.i16,pk.u16,pknorm.i16,pknorm.u16}.ll
llvm.amdgcn.ds.gws.barrier-{bundle,fastregalloc}.ll
llvm.amdgcn.dead.ll, llvm.amdgcn.exp.ll
llvm.amdgcn.fmad.ftz.ll,
llvm.amdgcn.iglp.opt.exp.{large,small}.mir
llvm.amdgcn.image.sample.noret.ll
llvm.amdgcn.init.whole.wave-w32.ll

Enabled -amdgpu-late-wave-transform=1 and fixed their FileCheck patterns for the codegen changes in the new pass flow. Tests updated: - llvm.amdgcn.ballot.{i32,i64}.ll - llvm.amdgcn.cvt.{fp8.e5m3,fp8,pk.i16,pk.u16,pknorm.i16,pknorm.u16}.ll - llvm.amdgcn.ds.gws.barrier-{bundle,fastregalloc}.ll - llvm.amdgcn.dead.ll, llvm.amdgcn.exp.ll - llvm.amdgcn.fmad.ftz.ll, - llvm.amdgcn.iglp.opt.exp.{large,small}.mir - llvm.amdgcn.image.sample.noret.ll - llvm.amdgcn.init.whole.wave-w32.ll

rocm-cciapp · 2026-05-29T06:32:00Z

PSDB Build Link: http://mlse-bdc-20dd129:8065/#/builders/10/builds/399

TejaX-Alaghari · 2026-05-29T07:10:56Z

+; GFX10_11-SDAG-NEXT:    s_wqm_b32 exec_lo, exec_lo
+; GFX10_11-SDAG-NEXT:    v_mov_b32_e32 v4, v0
+; GFX10_11-SDAG-NEXT:    s_and_b32 exec_lo, exec_lo, s12
+; GFX10_11-SDAG-NEXT:    image_sample v[0:3], v4, s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_1D


Both the SDAG and GISEL outputs match except for this line.

Can the GFX10_11 label be retained for all the instructions except for the above one?

The auto-gen script doesn't work that way. It uses the common prefix when the entire ISA matches for multiple RUN lines. If there is even a single line or even a register mismatch, it prefers separate prefixes.

TejaX-Alaghari · 2026-05-29T07:11:49Z

+; GFX12PLUS-SDAG-NEXT:    v_mov_b32_e32 v4, v0
+; GFX12PLUS-SDAG-NEXT:    s_and_b32 exec_lo, exec_lo, s12
+; GFX12PLUS-SDAG-NEXT:    image_sample v[0:3], v4, s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_1D
+; GFX12PLUS-SDAG-NEXT:    image_sample off, v4, s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_1D


Same comment as above.

TejaX-Alaghari · 2026-05-29T07:39:24Z

 ; DAGISEL12-NEXT:    s_mov_b32 s7, s4
 ; DAGISEL12-NEXT:    s_mov_b32 s6, s3
+; DAGISEL12-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; DAGISEL12-NEXT:    v_cmp_ne_u32_e32 vcc_lo, 1, v0
+; DAGISEL12-NEXT:    s_xor_b32 s4, vcc_lo, exec_lo
 ; DAGISEL12-NEXT:    s_wait_alu depctr_sa_sdst(0)
-; DAGISEL12-NEXT:    s_and_saveexec_b32 s3, s8
-; DAGISEL12-NEXT:  ; %bb.1: ; %shader
+; DAGISEL12-NEXT:    s_xor_b32 s3, exec_lo, s4
+; DAGISEL12-NEXT:    s_mov_b32 exec_lo, s4


The below LWT CodeGen

(s8=old_exec, exec=-1) v_cndmask_b32_e64 v0, 0, 1, s8 v_cmp_ne_u32_e32 vcc_lo, 1, v0 (vcc_lo=~s8 -> ~old_exec) s_xor_b32 s4, vcc_lo, exec_lo (s4 =s8 -> old_exec) s_xor_b32 s3, exec_lo, s4 (s3 =~s8 -> ~old_exec) s_mov_b32 exec_lo, s4 (exec =s4 -> s8 -> old_exec)

Can be replaced with a single s_and_saveexec_b32 s3, s8 similar to St-WT

The V_cndmask+v_cmp_ne + s_xor + s_xor is significantly worse when branching based on lane mask (s8)

TejaX-Alaghari · 2026-05-29T09:58:30Z

-; CHECK-NEXT:    s_or_b64 s[0:1], s[0:1], s[4:5]
-; CHECK-NEXT:  ; %bb.4: ; %exit
+; CHECK-NEXT:    s_xor_b64 s[0:1], vcc, exec
+; CHECK-NEXT:    s_xor_b64 s[2:3], exec, s[0:1]


This s_xor is redundant - s[2:3] will be same as vcc

TejaX-Alaghari · 2026-05-29T10:03:05Z

-; CHECK-NEXT:    v_cmp_gt_u32_e64 s[0:1], 2, v2
-; CHECK-NEXT:    ; implicit-def: $vgpr2
-; CHECK-NEXT:  ; %bb.2: ; %Flow
-; CHECK-NEXT:    s_andn2_saveexec_b64 s[2:3], s[2:3]


Is lane mask being correctly flipped here?

Because exec=vcc & s[2:3]=~vcc before this instruction - And after the s_andn2_saveexec

exec=exec & ~s[2:3] =>exec=vcc (same as before)

LWT seems to correctly run the block %A with vcc and %B with ~vcc.

Please correct my understanding if I'm missing something in following the flow.

TejaX-Alaghari · 2026-05-29T10:06:16Z

 ; GFX1250-TRUE16-NEXT:    v_mov_b16_e32 v0.l, s2
-; GFX1250-TRUE16-NEXT:    v_mov_b16_e32 v0.h, s2
-; GFX1250-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-TRUE16-NEXT:    v_mov_b32_e32 v1, s2


This v1 doesn't seem to be used anywhere else?

TejaX-Alaghari · 2026-05-29T10:17:39Z

 ;
-; ASM-DAG-LABEL: dead_i32:
-; ASM-DAG:       ; %bb.0: ; %entry
-; ASM-DAG-NEXT:    s_wait_loadcnt_dscnt 0x0
-; ASM-DAG-NEXT:    s_wait_expcnt 0x0
-; ASM-DAG-NEXT:    s_wait_samplecnt 0x0
-; ASM-DAG-NEXT:    s_wait_bvhcnt 0x0
-; ASM-DAG-NEXT:    s_wait_kmcnt 0x0
-; ASM-DAG-NEXT:    v_mov_b32_e32 v4, v0
-; ASM-DAG-NEXT:    v_mov_b32_e32 v0, v1
-; ASM-DAG-NEXT:    s_mov_b32 s0, exec_lo
-; ASM-DAG-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
-; ASM-DAG-NEXT:    v_and_b32_e32 v1, 1, v4
-; ASM-DAG-NEXT:    v_cmpx_eq_u32_e32 1, v1
-; ASM-DAG-NEXT:    s_cbranch_execz .LBB0_2
-; ASM-DAG-NEXT:  ; %bb.1: ; %if.then
-; ASM-DAG-NEXT:    v_add_nc_u32_e32 v0, 1, v0
-; ASM-DAG-NEXT:    global_store_b32 v[2:3], v0, off
-; ASM-DAG-NEXT:    ; implicit-def: $vgpr0
-; ASM-DAG-NEXT:  .LBB0_2: ; %if.end
-; ASM-DAG-NEXT:    s_wait_alu depctr_sa_sdst(0)
-; ASM-DAG-NEXT:    s_or_b32 exec_lo, exec_lo, s0
-; ASM-DAG-NEXT:    s_setpc_b64 s[30:31]
+;
+;
+;
+;
+;


Can be removed.

TejaX-Alaghari · 2026-05-29T10:18:24Z

-; RUN: llc -global-isel=0 -mtriple=amdgcn -mcpu=gfx1200 -mattr=+real-true16 < %s | FileCheck -check-prefixes=ASM-DAG %s
-; RUN: llc -global-isel=0 -mtriple=amdgcn -mcpu=gfx1200 -mattr=-real-true16 < %s | FileCheck -check-prefixes=ASM-DAG %s
+; RUN: llc -amdgpu-late-wave-transform=1 -global-isel=0 -mtriple=amdgcn -mcpu=gfx1200 -mattr=+real-true16 < %s | FileCheck -check-prefixes=ASM-SDAG-TRUE16 %s
+; RUN: llc -amdgpu-late-wave-transform=1 -global-isel=0 -mtriple=amdgcn -mcpu=gfx1200 -mattr=-real-true16 < %s | FileCheck -check-prefixes=ASM-SDAG-FAKE16 %s


Both the TRUE and FAKE 16 labels seem to generate the same output.
Can both of them be replaced with an ASM-SDAG?

TejaX-Alaghari · 2026-05-29T10:21:02Z

@@ -1,56 +1,68 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
-; RUN: llc -global-isel=0 -mtriple=amdgcn -mcpu=gfx1200 -mattr=+real-true16 < %s | FileCheck -check-prefixes=ASM-DAG %s


This commit replaced DAG with SDAG.

Rebasing might have caused some issue as the DAG labels are still retained

TejaX-Alaghari · 2026-05-29T10:51:54Z

+; DAGISEL12-NEXT:    s_xor_b32 s10, vcc_lo, exec_lo
+; DAGISEL12-NEXT:    s_wait_alu depctr_sa_sdst(0)
+; DAGISEL12-NEXT:    s_or_b32 s9, s9, s10
+; DAGISEL12-NEXT:    s_or_b32 s4, s4, vcc_lo
+; DAGISEL12-NEXT:    s_wait_alu depctr_sa_sdst(0)
+; DAGISEL12-NEXT:    s_xor_b32 s10, exec_lo, s9
+; DAGISEL12-NEXT:    s_wait_alu depctr_sa_sdst(0)
+; DAGISEL12-NEXT:    s_and_b32 s10, s10, exec_lo
+; DAGISEL12-NEXT:    s_wait_alu depctr_sa_sdst(0)
+; DAGISEL12-NEXT:    s_or_b32 s8, s8, s10
+; DAGISEL12-NEXT:    s_mov_b32 exec_lo, s9


Many unnecessary ACCs happening - only the flip of exec (in s8) is needed in the reconv blocks

cdevadas requested review from TejaX-Alaghari, lalaniket8 and vg0204 May 29, 2026 06:30

TejaX-Alaghari reviewed May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WaveTransform] Some more AMDGPU intrinsic tests fixed#2717

[WaveTransform] Some more AMDGPU intrinsic tests fixed#2717
cdevadas wants to merge 1 commit into
amd-feature/wave-transformfrom
public/amd/dev/cdevadas/wave-transform/more-control-flow-test-fixup3

cdevadas commented May 29, 2026

Uh oh!

rocm-cciapp Bot commented May 29, 2026

Uh oh!

TejaX-Alaghari May 29, 2026

Uh oh!

cdevadas May 29, 2026

Uh oh!

TejaX-Alaghari May 29, 2026

Uh oh!

TejaX-Alaghari May 29, 2026 •

edited

Loading

Uh oh!

TejaX-Alaghari May 29, 2026

Uh oh!

TejaX-Alaghari May 29, 2026 •

edited

Loading

Uh oh!

TejaX-Alaghari May 29, 2026

Uh oh!

TejaX-Alaghari May 29, 2026

Uh oh!

TejaX-Alaghari May 29, 2026 •

edited

Loading

Uh oh!

TejaX-Alaghari May 29, 2026

Uh oh!

TejaX-Alaghari May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -1,56 +1,68 @@
		; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
		; RUN: llc -global-isel=0 -mtriple=amdgcn -mcpu=gfx1200 -mattr=+real-true16 < %s \| FileCheck -check-prefixes=ASM-DAG %s

Conversation

cdevadas commented May 29, 2026

Uh oh!

rocm-cciapp Bot commented May 29, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TejaX-Alaghari May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TejaX-Alaghari May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TejaX-Alaghari May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TejaX-Alaghari May 29, 2026 •

edited

Loading

TejaX-Alaghari May 29, 2026 •

edited

Loading

TejaX-Alaghari May 29, 2026 •

edited

Loading