Skip to content

[WaveTransform] Some more AMDGPU intrinsic tests fixed#2717

Open
cdevadas wants to merge 1 commit into
amd-feature/wave-transformfrom
public/amd/dev/cdevadas/wave-transform/more-control-flow-test-fixup3
Open

[WaveTransform] Some more AMDGPU intrinsic tests fixed#2717
cdevadas wants to merge 1 commit into
amd-feature/wave-transformfrom
public/amd/dev/cdevadas/wave-transform/more-control-flow-test-fixup3

Conversation

@cdevadas
Copy link
Copy Markdown

Enabled -amdgpu-late-wave-transform=1 and fixed their FileCheck patterns for the codegen changes in the new pass flow.

Tests updated:

  • llvm.amdgcn.ballot.{i32,i64}.ll
  • llvm.amdgcn.cvt.{fp8.e5m3,fp8,pk.i16,pk.u16,pknorm.i16,pknorm.u16}.ll
  • llvm.amdgcn.ds.gws.barrier-{bundle,fastregalloc}.ll
  • llvm.amdgcn.dead.ll, llvm.amdgcn.exp.ll
  • llvm.amdgcn.fmad.ftz.ll,
  • llvm.amdgcn.iglp.opt.exp.{large,small}.mir
  • llvm.amdgcn.image.sample.noret.ll
  • llvm.amdgcn.init.whole.wave-w32.ll

Enabled -amdgpu-late-wave-transform=1 and fixed their
FileCheck patterns for the codegen changes in the new
pass flow.

Tests updated:
- llvm.amdgcn.ballot.{i32,i64}.ll
- llvm.amdgcn.cvt.{fp8.e5m3,fp8,pk.i16,pk.u16,pknorm.i16,pknorm.u16}.ll
- llvm.amdgcn.ds.gws.barrier-{bundle,fastregalloc}.ll
- llvm.amdgcn.dead.ll, llvm.amdgcn.exp.ll
- llvm.amdgcn.fmad.ftz.ll,
- llvm.amdgcn.iglp.opt.exp.{large,small}.mir
- llvm.amdgcn.image.sample.noret.ll
- llvm.amdgcn.init.whole.wave-w32.ll
@rocm-cciapp
Copy link
Copy Markdown

rocm-cciapp Bot commented May 29, 2026

; GFX10_11-SDAG-NEXT: s_wqm_b32 exec_lo, exec_lo
; GFX10_11-SDAG-NEXT: v_mov_b32_e32 v4, v0
; GFX10_11-SDAG-NEXT: s_and_b32 exec_lo, exec_lo, s12
; GFX10_11-SDAG-NEXT: image_sample v[0:3], v4, s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_1D
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both the SDAG and GISEL outputs match except for this line.

Can the GFX10_11 label be retained for all the instructions except for the above one?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The auto-gen script doesn't work that way. It uses the common prefix when the entire ISA matches for multiple RUN lines. If there is even a single line or even a register mismatch, it prefers separate prefixes.

; GFX12PLUS-SDAG-NEXT: v_mov_b32_e32 v4, v0
; GFX12PLUS-SDAG-NEXT: s_and_b32 exec_lo, exec_lo, s12
; GFX12PLUS-SDAG-NEXT: image_sample v[0:3], v4, s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_1D
; GFX12PLUS-SDAG-NEXT: image_sample off, v4, s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_1D
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

Comment on lines 41 to +48
; DAGISEL12-NEXT: s_mov_b32 s7, s4
; DAGISEL12-NEXT: s_mov_b32 s6, s3
; DAGISEL12-NEXT: s_delay_alu instid0(VALU_DEP_1)
; DAGISEL12-NEXT: v_cmp_ne_u32_e32 vcc_lo, 1, v0
; DAGISEL12-NEXT: s_xor_b32 s4, vcc_lo, exec_lo
; DAGISEL12-NEXT: s_wait_alu depctr_sa_sdst(0)
; DAGISEL12-NEXT: s_and_saveexec_b32 s3, s8
; DAGISEL12-NEXT: ; %bb.1: ; %shader
; DAGISEL12-NEXT: s_xor_b32 s3, exec_lo, s4
; DAGISEL12-NEXT: s_mov_b32 exec_lo, s4
Copy link
Copy Markdown

@TejaX-Alaghari TejaX-Alaghari May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The below LWT CodeGen

(s8=old_exec, exec=-1)
v_cndmask_b32_e64 v0, 0, 1, s8
v_cmp_ne_u32_e32 vcc_lo, 1, v0   (vcc_lo=~s8 -> ~old_exec)
s_xor_b32 s4, vcc_lo, exec_lo    (s4    =s8  -> old_exec)
s_xor_b32 s3, exec_lo, s4        (s3    =~s8 -> ~old_exec)
s_mov_b32 exec_lo, s4            (exec  =s4  -> s8 -> old_exec)
  • Can be replaced with a single s_and_saveexec_b32 s3, s8 similar to St-WT
  • The V_cndmask+v_cmp_ne + s_xor + s_xor is significantly worse when branching based on lane mask (s8)

; CHECK-NEXT: s_or_b64 s[0:1], s[0:1], s[4:5]
; CHECK-NEXT: ; %bb.4: ; %exit
; CHECK-NEXT: s_xor_b64 s[0:1], vcc, exec
; CHECK-NEXT: s_xor_b64 s[2:3], exec, s[0:1]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This s_xor is redundant - s[2:3] will be same as vcc

; CHECK-NEXT: v_cmp_gt_u32_e64 s[0:1], 2, v2
; CHECK-NEXT: ; implicit-def: $vgpr2
; CHECK-NEXT: ; %bb.2: ; %Flow
; CHECK-NEXT: s_andn2_saveexec_b64 s[2:3], s[2:3]
Copy link
Copy Markdown

@TejaX-Alaghari TejaX-Alaghari May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is lane mask being correctly flipped here?

Because exec=vcc & s[2:3]=~vcc before this instruction - And after the s_andn2_saveexec

  • exec=exec & ~s[2:3] =>exec=vcc (same as before)

LWT seems to correctly run the block %A with vcc and %B with ~vcc.

Please correct my understanding if I'm missing something in following the flow.

; GFX1250-TRUE16-NEXT: v_mov_b16_e32 v0.l, s2
; GFX1250-TRUE16-NEXT: v_mov_b16_e32 v0.h, s2
; GFX1250-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-TRUE16-NEXT: v_mov_b32_e32 v1, s2
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This v1 doesn't seem to be used anywhere else?

Comment on lines 9 to +14
;
; ASM-DAG-LABEL: dead_i32:
; ASM-DAG: ; %bb.0: ; %entry
; ASM-DAG-NEXT: s_wait_loadcnt_dscnt 0x0
; ASM-DAG-NEXT: s_wait_expcnt 0x0
; ASM-DAG-NEXT: s_wait_samplecnt 0x0
; ASM-DAG-NEXT: s_wait_bvhcnt 0x0
; ASM-DAG-NEXT: s_wait_kmcnt 0x0
; ASM-DAG-NEXT: v_mov_b32_e32 v4, v0
; ASM-DAG-NEXT: v_mov_b32_e32 v0, v1
; ASM-DAG-NEXT: s_mov_b32 s0, exec_lo
; ASM-DAG-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; ASM-DAG-NEXT: v_and_b32_e32 v1, 1, v4
; ASM-DAG-NEXT: v_cmpx_eq_u32_e32 1, v1
; ASM-DAG-NEXT: s_cbranch_execz .LBB0_2
; ASM-DAG-NEXT: ; %bb.1: ; %if.then
; ASM-DAG-NEXT: v_add_nc_u32_e32 v0, 1, v0
; ASM-DAG-NEXT: global_store_b32 v[2:3], v0, off
; ASM-DAG-NEXT: ; implicit-def: $vgpr0
; ASM-DAG-NEXT: .LBB0_2: ; %if.end
; ASM-DAG-NEXT: s_wait_alu depctr_sa_sdst(0)
; ASM-DAG-NEXT: s_or_b32 exec_lo, exec_lo, s0
; ASM-DAG-NEXT: s_setpc_b64 s[30:31]
;
;
;
;
;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be removed.

; RUN: llc -global-isel=0 -mtriple=amdgcn -mcpu=gfx1200 -mattr=+real-true16 < %s | FileCheck -check-prefixes=ASM-DAG %s
; RUN: llc -global-isel=0 -mtriple=amdgcn -mcpu=gfx1200 -mattr=-real-true16 < %s | FileCheck -check-prefixes=ASM-DAG %s
; RUN: llc -amdgpu-late-wave-transform=1 -global-isel=0 -mtriple=amdgcn -mcpu=gfx1200 -mattr=+real-true16 < %s | FileCheck -check-prefixes=ASM-SDAG-TRUE16 %s
; RUN: llc -amdgpu-late-wave-transform=1 -global-isel=0 -mtriple=amdgcn -mcpu=gfx1200 -mattr=-real-true16 < %s | FileCheck -check-prefixes=ASM-SDAG-FAKE16 %s
Copy link
Copy Markdown

@TejaX-Alaghari TejaX-Alaghari May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both the TRUE and FAKE 16 labels seem to generate the same output.
Can both of them be replaced with an ASM-SDAG?

@@ -1,56 +1,68 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
; RUN: llc -global-isel=0 -mtriple=amdgcn -mcpu=gfx1200 -mattr=+real-true16 < %s | FileCheck -check-prefixes=ASM-DAG %s
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit replaced DAG with SDAG.

Rebasing might have caused some issue as the DAG labels are still retained

Comment on lines +506 to +516
; DAGISEL12-NEXT: s_xor_b32 s10, vcc_lo, exec_lo
; DAGISEL12-NEXT: s_wait_alu depctr_sa_sdst(0)
; DAGISEL12-NEXT: s_or_b32 s9, s9, s10
; DAGISEL12-NEXT: s_or_b32 s4, s4, vcc_lo
; DAGISEL12-NEXT: s_wait_alu depctr_sa_sdst(0)
; DAGISEL12-NEXT: s_xor_b32 s10, exec_lo, s9
; DAGISEL12-NEXT: s_wait_alu depctr_sa_sdst(0)
; DAGISEL12-NEXT: s_and_b32 s10, s10, exec_lo
; DAGISEL12-NEXT: s_wait_alu depctr_sa_sdst(0)
; DAGISEL12-NEXT: s_or_b32 s8, s8, s10
; DAGISEL12-NEXT: s_mov_b32 exec_lo, s9
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many unnecessary ACCs happening - only the flip of exec (in s8) is needed in the reconv blocks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants