[WaveTransform] Some more AMDGPU intrinsic tests fixed#2717
Conversation
Enabled -amdgpu-late-wave-transform=1 and fixed their
FileCheck patterns for the codegen changes in the new
pass flow.
Tests updated:
- llvm.amdgcn.ballot.{i32,i64}.ll
- llvm.amdgcn.cvt.{fp8.e5m3,fp8,pk.i16,pk.u16,pknorm.i16,pknorm.u16}.ll
- llvm.amdgcn.ds.gws.barrier-{bundle,fastregalloc}.ll
- llvm.amdgcn.dead.ll, llvm.amdgcn.exp.ll
- llvm.amdgcn.fmad.ftz.ll,
- llvm.amdgcn.iglp.opt.exp.{large,small}.mir
- llvm.amdgcn.image.sample.noret.ll
- llvm.amdgcn.init.whole.wave-w32.ll
|
PSDB Build Link: http://mlse-bdc-20dd129:8065/#/builders/10/builds/399 |
| ; GFX10_11-SDAG-NEXT: s_wqm_b32 exec_lo, exec_lo | ||
| ; GFX10_11-SDAG-NEXT: v_mov_b32_e32 v4, v0 | ||
| ; GFX10_11-SDAG-NEXT: s_and_b32 exec_lo, exec_lo, s12 | ||
| ; GFX10_11-SDAG-NEXT: image_sample v[0:3], v4, s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_1D |
There was a problem hiding this comment.
Both the SDAG and GISEL outputs match except for this line.
Can the GFX10_11 label be retained for all the instructions except for the above one?
There was a problem hiding this comment.
The auto-gen script doesn't work that way. It uses the common prefix when the entire ISA matches for multiple RUN lines. If there is even a single line or even a register mismatch, it prefers separate prefixes.
| ; GFX12PLUS-SDAG-NEXT: v_mov_b32_e32 v4, v0 | ||
| ; GFX12PLUS-SDAG-NEXT: s_and_b32 exec_lo, exec_lo, s12 | ||
| ; GFX12PLUS-SDAG-NEXT: image_sample v[0:3], v4, s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_1D | ||
| ; GFX12PLUS-SDAG-NEXT: image_sample off, v4, s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_1D |
| ; DAGISEL12-NEXT: s_mov_b32 s7, s4 | ||
| ; DAGISEL12-NEXT: s_mov_b32 s6, s3 | ||
| ; DAGISEL12-NEXT: s_delay_alu instid0(VALU_DEP_1) | ||
| ; DAGISEL12-NEXT: v_cmp_ne_u32_e32 vcc_lo, 1, v0 | ||
| ; DAGISEL12-NEXT: s_xor_b32 s4, vcc_lo, exec_lo | ||
| ; DAGISEL12-NEXT: s_wait_alu depctr_sa_sdst(0) | ||
| ; DAGISEL12-NEXT: s_and_saveexec_b32 s3, s8 | ||
| ; DAGISEL12-NEXT: ; %bb.1: ; %shader | ||
| ; DAGISEL12-NEXT: s_xor_b32 s3, exec_lo, s4 | ||
| ; DAGISEL12-NEXT: s_mov_b32 exec_lo, s4 |
There was a problem hiding this comment.
The below LWT CodeGen
(s8=old_exec, exec=-1)
v_cndmask_b32_e64 v0, 0, 1, s8
v_cmp_ne_u32_e32 vcc_lo, 1, v0 (vcc_lo=~s8 -> ~old_exec)
s_xor_b32 s4, vcc_lo, exec_lo (s4 =s8 -> old_exec)
s_xor_b32 s3, exec_lo, s4 (s3 =~s8 -> ~old_exec)
s_mov_b32 exec_lo, s4 (exec =s4 -> s8 -> old_exec)
- Can be replaced with a single
s_and_saveexec_b32 s3, s8similar to St-WT - The
V_cndmask+v_cmp_ne+s_xor+s_xoris significantly worse when branching based on lane mask (s8)
| ; CHECK-NEXT: s_or_b64 s[0:1], s[0:1], s[4:5] | ||
| ; CHECK-NEXT: ; %bb.4: ; %exit | ||
| ; CHECK-NEXT: s_xor_b64 s[0:1], vcc, exec | ||
| ; CHECK-NEXT: s_xor_b64 s[2:3], exec, s[0:1] |
There was a problem hiding this comment.
This s_xor is redundant - s[2:3] will be same as vcc
| ; CHECK-NEXT: v_cmp_gt_u32_e64 s[0:1], 2, v2 | ||
| ; CHECK-NEXT: ; implicit-def: $vgpr2 | ||
| ; CHECK-NEXT: ; %bb.2: ; %Flow | ||
| ; CHECK-NEXT: s_andn2_saveexec_b64 s[2:3], s[2:3] |
There was a problem hiding this comment.
Is lane mask being correctly flipped here?
Because exec=vcc & s[2:3]=~vcc before this instruction - And after the s_andn2_saveexec
exec=exec & ~s[2:3]=>exec=vcc(same as before)
LWT seems to correctly run the block %A with vcc and %B with ~vcc.
Please correct my understanding if I'm missing something in following the flow.
| ; GFX1250-TRUE16-NEXT: v_mov_b16_e32 v0.l, s2 | ||
| ; GFX1250-TRUE16-NEXT: v_mov_b16_e32 v0.h, s2 | ||
| ; GFX1250-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1) | ||
| ; GFX1250-TRUE16-NEXT: v_mov_b32_e32 v1, s2 |
There was a problem hiding this comment.
This v1 doesn't seem to be used anywhere else?
| ; | ||
| ; ASM-DAG-LABEL: dead_i32: | ||
| ; ASM-DAG: ; %bb.0: ; %entry | ||
| ; ASM-DAG-NEXT: s_wait_loadcnt_dscnt 0x0 | ||
| ; ASM-DAG-NEXT: s_wait_expcnt 0x0 | ||
| ; ASM-DAG-NEXT: s_wait_samplecnt 0x0 | ||
| ; ASM-DAG-NEXT: s_wait_bvhcnt 0x0 | ||
| ; ASM-DAG-NEXT: s_wait_kmcnt 0x0 | ||
| ; ASM-DAG-NEXT: v_mov_b32_e32 v4, v0 | ||
| ; ASM-DAG-NEXT: v_mov_b32_e32 v0, v1 | ||
| ; ASM-DAG-NEXT: s_mov_b32 s0, exec_lo | ||
| ; ASM-DAG-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1) | ||
| ; ASM-DAG-NEXT: v_and_b32_e32 v1, 1, v4 | ||
| ; ASM-DAG-NEXT: v_cmpx_eq_u32_e32 1, v1 | ||
| ; ASM-DAG-NEXT: s_cbranch_execz .LBB0_2 | ||
| ; ASM-DAG-NEXT: ; %bb.1: ; %if.then | ||
| ; ASM-DAG-NEXT: v_add_nc_u32_e32 v0, 1, v0 | ||
| ; ASM-DAG-NEXT: global_store_b32 v[2:3], v0, off | ||
| ; ASM-DAG-NEXT: ; implicit-def: $vgpr0 | ||
| ; ASM-DAG-NEXT: .LBB0_2: ; %if.end | ||
| ; ASM-DAG-NEXT: s_wait_alu depctr_sa_sdst(0) | ||
| ; ASM-DAG-NEXT: s_or_b32 exec_lo, exec_lo, s0 | ||
| ; ASM-DAG-NEXT: s_setpc_b64 s[30:31] | ||
| ; | ||
| ; | ||
| ; | ||
| ; | ||
| ; |
| ; RUN: llc -global-isel=0 -mtriple=amdgcn -mcpu=gfx1200 -mattr=+real-true16 < %s | FileCheck -check-prefixes=ASM-DAG %s | ||
| ; RUN: llc -global-isel=0 -mtriple=amdgcn -mcpu=gfx1200 -mattr=-real-true16 < %s | FileCheck -check-prefixes=ASM-DAG %s | ||
| ; RUN: llc -amdgpu-late-wave-transform=1 -global-isel=0 -mtriple=amdgcn -mcpu=gfx1200 -mattr=+real-true16 < %s | FileCheck -check-prefixes=ASM-SDAG-TRUE16 %s | ||
| ; RUN: llc -amdgpu-late-wave-transform=1 -global-isel=0 -mtriple=amdgcn -mcpu=gfx1200 -mattr=-real-true16 < %s | FileCheck -check-prefixes=ASM-SDAG-FAKE16 %s |
There was a problem hiding this comment.
Both the TRUE and FAKE 16 labels seem to generate the same output.
Can both of them be replaced with an ASM-SDAG?
| @@ -1,56 +1,68 @@ | |||
| ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5 | |||
| ; RUN: llc -global-isel=0 -mtriple=amdgcn -mcpu=gfx1200 -mattr=+real-true16 < %s | FileCheck -check-prefixes=ASM-DAG %s | |||
There was a problem hiding this comment.
This commit replaced DAG with SDAG.
Rebasing might have caused some issue as the DAG labels are still retained
| ; DAGISEL12-NEXT: s_xor_b32 s10, vcc_lo, exec_lo | ||
| ; DAGISEL12-NEXT: s_wait_alu depctr_sa_sdst(0) | ||
| ; DAGISEL12-NEXT: s_or_b32 s9, s9, s10 | ||
| ; DAGISEL12-NEXT: s_or_b32 s4, s4, vcc_lo | ||
| ; DAGISEL12-NEXT: s_wait_alu depctr_sa_sdst(0) | ||
| ; DAGISEL12-NEXT: s_xor_b32 s10, exec_lo, s9 | ||
| ; DAGISEL12-NEXT: s_wait_alu depctr_sa_sdst(0) | ||
| ; DAGISEL12-NEXT: s_and_b32 s10, s10, exec_lo | ||
| ; DAGISEL12-NEXT: s_wait_alu depctr_sa_sdst(0) | ||
| ; DAGISEL12-NEXT: s_or_b32 s8, s8, s10 | ||
| ; DAGISEL12-NEXT: s_mov_b32 exec_lo, s9 |
There was a problem hiding this comment.
Many unnecessary ACCs happening - only the flip of exec (in s8) is needed in the reconv blocks
Enabled -amdgpu-late-wave-transform=1 and fixed their FileCheck patterns for the codegen changes in the new pass flow.
Tests updated: