Skip to content

[MachineSink] Lower SplitEdgeProbabilityThreshold #127666

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: users/guy-david/machine-sink-powerpc
Choose a base branch
from

Conversation

guy-david
Copy link
Contributor

@guy-david guy-david commented Feb 18, 2025

Requires #128745.

Lower it slightly below the likeliness of a null-check to be true which is set to 37.5% (see PtrUntakenProb).
Otherwise, it will split the edge and create another basic-block (an else clause which wasn't there to begin with) and an unconditional branch which makes the CFG more complex and can result in a suboptimal block placement.
Note that if multiple instructions can be sinked from the same edge then a split will occur regardless of this change.

On M4 Pro:

$ ./utils/compare.py build-a/results1.json build-a/results2.json build-a/results3.json vs build-b/results1.json build-b/results2.json build-b/results3.json
Tests: 4314
Metric: exec_time
Program                                       exec_time             
                                              lhs       rhs    diff 
MultiSourc...chmarks/Prolangs-C/agrep/agrep     0.00      0.01 44.7%
MultiSourc...rks/McCat/03-testtrie/testtrie     0.01      0.01 31.7%
SingleSour...hmarks/Shootout/Shootout-lists     2.02      2.64 30.6%
SingleSour...ecute/GCC-C-execute-20170419-1     0.00      0.00 14.3%
SingleSour.../execute/GCC-C-execute-pr59101     0.00      0.00 14.3%
SingleSour...ecute/GCC-C-execute-20040311-1     0.00      0.00 14.3%
SingleSour.../execute/GCC-C-execute-pr57124     0.00      0.00 14.3%
SingleSour...ecute/GCC-C-execute-20031204-1     0.00      0.00 14.3%
SingleSour...xecute/GCC-C-execute-pr57344-3     0.00      0.00 14.3%
SingleSour.../execute/GCC-C-execute-pr57875     0.00      0.00 14.3%
SingleSour...ecute/GCC-C-execute-20030811-1     0.00      0.00 14.3%
SingleSour.../execute/GCC-C-execute-pr58640     0.00      0.00 14.3%
SingleSour...ecute/GCC-C-execute-20030408-1     0.00      0.00 14.3%
SingleSour...ecute/GCC-C-execute-20030323-1     0.00      0.00 14.3%
SingleSour...ecute/GCC-C-execute-20030203-1     0.00      0.00 14.3%
                           Geomean difference                   0.1%
           exec_time                            
l/r              lhs            rhs         diff
count  4314.000000    4314.000000    4294.000000
mean   453.919219     454.105532     0.002072   
std    10865.757400   10868.002426   0.043046   
min    0.000000       0.000000      -0.171642   
25%    0.000700       0.000700       0.000000   
50%    0.007400       0.007400       0.000000   
75%    0.047829       0.047950       0.000033   
max    321294.306703  321320.624713  0.447368

On Ryzen9 5950X:

$ ./utils/compare.py build-a/results1.json build-a/results2.json build-a/results3.json vs build-b/results1.json build-b/results2.json build-b/results3.json
Tests: 3326
Metric: exec_time
Program                                       exec_time               
                                              lhs       rhs     diff  
MemFunctio...mCmp<1, GreaterThanZero, None>   1741.26   1885.00 143.74
MemFunctio..._MemCmp<1, LessThanZero, Last>   1759.78   1873.93 114.15
MemFunctio...est:BM_MemCmp<1, EqZero, Last>   1747.19   1847.42 100.22
MemFunctio...Cmp<1, GreaterThanZero, First>   1750.17   1844.57  94.40
MemFunctio...mCmp<1, GreaterThanZero, Last>   1751.05   1844.68  93.63
MemFunctio...emCmp<1, GreaterThanZero, Mid>   1756.49   1849.62  93.13
MemFunctio..._MemCmp<1, LessThanZero, None>   1744.87   1835.22  90.35
MemFunctio...M_MemCmp<1, LessThanZero, Mid>   1757.53   1846.29  88.77
harris/har...est:BENCHMARK_HARRIS/1024/1024   5689.29   5754.88  65.59
MemFunctio...MemCmp<2, LessThanZero, First>   1123.00   1181.63  58.63
MemFunctio...test:BM_MemCmp<1, EqZero, Mid>   2524.93   2582.21  57.28
MemFunctio...est:BM_MemCmp<1, EqZero, None>   2525.97   2582.43  56.46
MemFunctio..._MemCmp<3, LessThanZero, Last>    869.04    924.66  55.62
MemFunctio...test:BM_MemCmp<3, EqZero, Mid>    878.39    932.53  54.14
MemFunctio...MemCmp<1, LessThanZero, First>   2528.37   2582.27  53.90
                                                                                                                                                   exec_time                             
l/r                                                                                                                                                      lhs            rhs          diff
Program                                                                                                                                                                                  
test-suite :: MicroBenchmarks/MemFunctions/MemFunctions.test:BM_MemCmp<1, GreaterThanZero, None>                                               1741.261663    1884.998860    143.737197  
test-suite :: MicroBenchmarks/MemFunctions/MemFunctions.test:BM_MemCmp<1, LessThanZero, Last>                                                  1759.779355    1873.926412    114.147056  
test-suite :: MicroBenchmarks/MemFunctions/MemFunctions.test:BM_MemCmp<1, EqZero, Last>                                                        1747.192734    1847.416650    100.223916  
test-suite :: MicroBenchmarks/MemFunctions/MemFunctions.test:BM_MemCmp<1, GreaterThanZero, First>                                              1750.171003    1844.569735    94.398732   
test-suite :: MicroBenchmarks/MemFunctions/MemFunctions.test:BM_MemCmp<1, GreaterThanZero, Last>                                               1751.049323    1844.682784    93.633461   
...                                                                                                                                                    ...            ...           ...  
test-suite :: MicroBenchmarks/LoopVectorization/LoopVectorizationBenchmarks.test:benchVecWithRuntimeChecks4PointersDAfterA/1000                435033.995649  412835.347288 -22198.648362
test-suite :: MicroBenchmarks/LoopVectorization/LoopVectorizationBenchmarks.test:benchVecWithRuntimeChecks4PointersAllDisjointDecreasing/1000  435136.829708  412921.450737 -22215.378970
test-suite :: MicroBenchmarks/LoopVectorization/LoopVectorizationBenchmarks.test:benchVecWithRuntimeChecks4PointersAllDisjointIncreasing/1000  435136.457427  412908.677876 -22227.779551
test-suite :: MicroBenchmarks/LoopVectorization/LoopVectorizationBenchmarks.test:benchVecWithRuntimeChecks4PointersDEqualsA/1000               435088.787446  412769.793042 -22318.994403
test-suite :: MicroBenchmarks/LoopVectorization/LoopVectorizationBenchmarks.test:benchVecWithRuntimeChecks4PointersDBeforeA/1000               835721.265233  791510.926471 -44210.338762

[3326 rows x 3 columns]
           exec_time                             
l/r              lhs            rhs          diff
count  3326.000000    3326.000000    3326.000000 
mean   916.350942     873.987972    -42.362970   
std    20951.565020   19865.106212   1087.788132 
min    0.000000       0.000000      -44210.338762
25%    0.000000       0.000000      -0.000400    
50%    0.000400       0.000400       0.000000    
75%    1.774625       1.732975       0.000400    
max    835721.265233  791510.926471  143.737197

I looked into the disassembly of BM_MemCmp<1, GreaterThanZero, None> in MemFunctions.test and it has not changed.

@llvmbot
Copy link
Member

llvmbot commented Feb 25, 2025

@llvm/pr-subscribers-debuginfo
@llvm/pr-subscribers-backend-aarch64
@llvm/pr-subscribers-backend-x86
@llvm/pr-subscribers-backend-arm

@llvm/pr-subscribers-backend-powerpc

Author: Guy David (guy-david)

Changes

Requires #128745.

Lower it slightly below the likeliness of a null-check to be true which is set to 37.5% (see PtrUntakenProb).
Otherwise, it will split the edge and create another basic-block and with an unconditional branch which might make the CFG more complex and with a suboptimal block placement.
Note that if multiple instructions can be sinked from the same edge then a split will occur regardless of this change.

On M4 Pro:

$ ./utils/compare.py build-a/results1.json build-a/results2.json build-a/results3.json vs build-b/results1.json build-b/results2.json build-b/results3.json
Tests: 4314
Metric: exec_time
Program                                       exec_time             
                                              lhs       rhs    diff 
MultiSourc...chmarks/Prolangs-C/agrep/agrep     0.00      0.01 44.7%
MultiSourc...rks/McCat/03-testtrie/testtrie     0.01      0.01 31.7%
SingleSour...hmarks/Shootout/Shootout-lists     2.02      2.64 30.6%
SingleSour...ecute/GCC-C-execute-20170419-1     0.00      0.00 14.3%
SingleSour.../execute/GCC-C-execute-pr59101     0.00      0.00 14.3%
SingleSour...ecute/GCC-C-execute-20040311-1     0.00      0.00 14.3%
SingleSour.../execute/GCC-C-execute-pr57124     0.00      0.00 14.3%
SingleSour...ecute/GCC-C-execute-20031204-1     0.00      0.00 14.3%
SingleSour...xecute/GCC-C-execute-pr57344-3     0.00      0.00 14.3%
SingleSour.../execute/GCC-C-execute-pr57875     0.00      0.00 14.3%
SingleSour...ecute/GCC-C-execute-20030811-1     0.00      0.00 14.3%
SingleSour.../execute/GCC-C-execute-pr58640     0.00      0.00 14.3%
SingleSour...ecute/GCC-C-execute-20030408-1     0.00      0.00 14.3%
SingleSour...ecute/GCC-C-execute-20030323-1     0.00      0.00 14.3%
SingleSour...ecute/GCC-C-execute-20030203-1     0.00      0.00 14.3%
                           Geomean difference                   0.1%
           exec_time                            
l/r              lhs            rhs         diff
count  4314.000000    4314.000000    4294.000000
mean   453.919219     454.105532     0.002072   
std    10865.757400   10868.002426   0.043046   
min    0.000000       0.000000      -0.171642   
25%    0.000700       0.000700       0.000000   
50%    0.007400       0.007400       0.000000   
75%    0.047829       0.047950       0.000033   
max    321294.306703  321320.624713  0.447368

On Ryzen9 5950X:

$ ./utils/compare.py build-a/results1.json build-a/results2.json build-a/results3.json vs build-b/results1.json build-b/results2.json build-b/results3.json
Tests: 3326
Metric: exec_time
Program                                       exec_time               
                                              lhs       rhs     diff  
MemFunctio...mCmp&lt;1, GreaterThanZero, None&gt;   1741.26   1885.00 143.74
MemFunctio..._MemCmp&lt;1, LessThanZero, Last&gt;   1759.78   1873.93 114.15
MemFunctio...est:BM_MemCmp&lt;1, EqZero, Last&gt;   1747.19   1847.42 100.22
MemFunctio...Cmp&lt;1, GreaterThanZero, First&gt;   1750.17   1844.57  94.40
MemFunctio...mCmp&lt;1, GreaterThanZero, Last&gt;   1751.05   1844.68  93.63
MemFunctio...emCmp&lt;1, GreaterThanZero, Mid&gt;   1756.49   1849.62  93.13
MemFunctio..._MemCmp&lt;1, LessThanZero, None&gt;   1744.87   1835.22  90.35
MemFunctio...M_MemCmp&lt;1, LessThanZero, Mid&gt;   1757.53   1846.29  88.77
harris/har...est:BENCHMARK_HARRIS/1024/1024   5689.29   5754.88  65.59
MemFunctio...MemCmp&lt;2, LessThanZero, First&gt;   1123.00   1181.63  58.63
MemFunctio...test:BM_MemCmp&lt;1, EqZero, Mid&gt;   2524.93   2582.21  57.28
MemFunctio...est:BM_MemCmp&lt;1, EqZero, None&gt;   2525.97   2582.43  56.46
MemFunctio..._MemCmp&lt;3, LessThanZero, Last&gt;    869.04    924.66  55.62
MemFunctio...test:BM_MemCmp&lt;3, EqZero, Mid&gt;    878.39    932.53  54.14
MemFunctio...MemCmp&lt;1, LessThanZero, First&gt;   2528.37   2582.27  53.90
                                                                                                                                                   exec_time                             
l/r                                                                                                                                                      lhs            rhs          diff
Program                                                                                                                                                                                  
test-suite :: MicroBenchmarks/MemFunctions/MemFunctions.test:BM_MemCmp&lt;1, GreaterThanZero, None&gt;                                               1741.261663    1884.998860    143.737197  
test-suite :: MicroBenchmarks/MemFunctions/MemFunctions.test:BM_MemCmp&lt;1, LessThanZero, Last&gt;                                                  1759.779355    1873.926412    114.147056  
test-suite :: MicroBenchmarks/MemFunctions/MemFunctions.test:BM_MemCmp&lt;1, EqZero, Last&gt;                                                        1747.192734    1847.416650    100.223916  
test-suite :: MicroBenchmarks/MemFunctions/MemFunctions.test:BM_MemCmp&lt;1, GreaterThanZero, First&gt;                                              1750.171003    1844.569735    94.398732   
test-suite :: MicroBenchmarks/MemFunctions/MemFunctions.test:BM_MemCmp&lt;1, GreaterThanZero, Last&gt;                                               1751.049323    1844.682784    93.633461   
...                                                                                                                                                    ...            ...           ...  
test-suite :: MicroBenchmarks/LoopVectorization/LoopVectorizationBenchmarks.test:benchVecWithRuntimeChecks4PointersDAfterA/1000                435033.995649  412835.347288 -22198.648362
test-suite :: MicroBenchmarks/LoopVectorization/LoopVectorizationBenchmarks.test:benchVecWithRuntimeChecks4PointersAllDisjointDecreasing/1000  435136.829708  412921.450737 -22215.378970
test-suite :: MicroBenchmarks/LoopVectorization/LoopVectorizationBenchmarks.test:benchVecWithRuntimeChecks4PointersAllDisjointIncreasing/1000  435136.457427  412908.677876 -22227.779551
test-suite :: MicroBenchmarks/LoopVectorization/LoopVectorizationBenchmarks.test:benchVecWithRuntimeChecks4PointersDEqualsA/1000               435088.787446  412769.793042 -22318.994403
test-suite :: MicroBenchmarks/LoopVectorization/LoopVectorizationBenchmarks.test:benchVecWithRuntimeChecks4PointersDBeforeA/1000               835721.265233  791510.926471 -44210.338762

[3326 rows x 3 columns]
           exec_time                             
l/r              lhs            rhs          diff
count  3326.000000    3326.000000    3326.000000 
mean   916.350942     873.987972    -42.362970   
std    20951.565020   19865.106212   1087.788132 
min    0.000000       0.000000      -44210.338762
25%    0.000000       0.000000      -0.000400    
50%    0.000400       0.000400       0.000000    
75%    1.774625       1.732975       0.000400    
max    835721.265233  791510.926471  143.737197

I looked into the disassembly of BM_MemCmp&lt;1, GreaterThanZero, None&gt; in MemFunctions.test and it has not changed.


Patch is 226.63 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/127666.diff

51 Files Affected:

  • (modified) llvm/lib/CodeGen/MachineSink.cpp (+1-1)
  • (modified) llvm/test/CodeGen/AArch64/aarch64-matrix-umull-smull.ll (+23-23)
  • (modified) llvm/test/CodeGen/AArch64/swifterror.ll (+2-4)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/sdiv.i64.ll (+5-7)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/srem.i64.ll (+5-7)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/udiv.i64.ll (+5-7)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/urem.i64.ll (+5-7)
  • (modified) llvm/test/CodeGen/AMDGPU/artificial-terminators.mir (+5-9)
  • (modified) llvm/test/CodeGen/AMDGPU/blender-no-live-segment-at-def-implicit-def.ll (+32-32)
  • (modified) llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll (+20-16)
  • (modified) llvm/test/CodeGen/AMDGPU/indirect-addressing-si.ll (+64-52)
  • (modified) llvm/test/CodeGen/AMDGPU/no-fold-accvgpr-mov.ll (+24-18)
  • (modified) llvm/test/CodeGen/AMDGPU/optimize-negated-cond.ll (+8-7)
  • (modified) llvm/test/CodeGen/AMDGPU/skip-if-dead.ll (+8-16)
  • (modified) llvm/test/CodeGen/ARM/and-cmp0-sink.ll (+14-14)
  • (modified) llvm/test/CodeGen/Mips/llvm-ir/sdiv-freebsd.ll (+6-3)
  • (modified) llvm/test/CodeGen/PowerPC/common-chain-aix32.ll (+16-17)
  • (modified) llvm/test/CodeGen/PowerPC/common-chain.ll (+80-89)
  • (modified) llvm/test/CodeGen/PowerPC/ifcvt_cr_field.ll (+6-12)
  • (modified) llvm/test/CodeGen/PowerPC/knowCRBitSpill.ll (+1)
  • (modified) llvm/test/CodeGen/PowerPC/loop-instr-form-prepare.ll (+88-114)
  • (modified) llvm/test/CodeGen/PowerPC/loop-instr-prep-non-const-increasement.ll (+16-19)
  • (modified) llvm/test/CodeGen/PowerPC/mma-phi-accs.ll (+6-12)
  • (modified) llvm/test/CodeGen/PowerPC/p10-spill-creq.ll (+28-33)
  • (modified) llvm/test/CodeGen/PowerPC/ppc64-rop-protection-aix.ll (+54-60)
  • (modified) llvm/test/CodeGen/PowerPC/ppc64-rop-protection.ll (+66-81)
  • (modified) llvm/test/CodeGen/PowerPC/shrink-wrap.ll (+12-20)
  • (modified) llvm/test/CodeGen/PowerPC/spe.ll (+2-4)
  • (modified) llvm/test/CodeGen/PowerPC/zext-and-cmp.ll (+16-6)
  • (modified) llvm/test/CodeGen/RISCV/fold-addi-loadstore.ll (+6-12)
  • (modified) llvm/test/CodeGen/Thumb2/mve-pipelineloops.ll (+7-10)
  • (modified) llvm/test/CodeGen/WebAssembly/implicit-def.ll (+23-12)
  • (modified) llvm/test/CodeGen/X86/2007-11-06-InstrSched.ll (+6-9)
  • (modified) llvm/test/CodeGen/X86/2008-04-28-CoalescerBug.ll (+8-11)
  • (modified) llvm/test/CodeGen/X86/atomic-rm-bit-test-64.ll (+61-68)
  • (modified) llvm/test/CodeGen/X86/atomic-rm-bit-test.ll (+342-395)
  • (modified) llvm/test/CodeGen/X86/branchfolding-debugloc.ll (+4-5)
  • (modified) llvm/test/CodeGen/X86/break-false-dep.ll (+28-48)
  • (modified) llvm/test/CodeGen/X86/coalescer-commute4.ll (+6-9)
  • (modified) llvm/test/CodeGen/X86/ctlo.ll (+21-26)
  • (modified) llvm/test/CodeGen/X86/ctlz.ll (+56-72)
  • (modified) llvm/test/CodeGen/X86/cttz.ll (+18-30)
  • (modified) llvm/test/CodeGen/X86/fold-loop-of-urem.ll (+26-32)
  • (modified) llvm/test/CodeGen/X86/lsr-sort.ll (+6-4)
  • (modified) llvm/test/CodeGen/X86/mmx-arith.ll (+7-9)
  • (modified) llvm/test/CodeGen/X86/pr2659.ll (+32-10)
  • (modified) llvm/test/CodeGen/X86/pr38795.ll (+50-53)
  • (modified) llvm/test/CodeGen/X86/probe-stack-eflags.ll (+4-6)
  • (modified) llvm/test/CodeGen/X86/taildup-heapallocsite.ll (+4-9)
  • (modified) llvm/test/CodeGen/X86/testb-je-fusion.ll (+8-10)
  • (modified) llvm/test/CodeGen/X86/x86-shrink-wrapping.ll (+12-16)
diff --git a/llvm/lib/CodeGen/MachineSink.cpp b/llvm/lib/CodeGen/MachineSink.cpp
index 82acb780cfb72..81459cf65d6c2 100644
--- a/llvm/lib/CodeGen/MachineSink.cpp
+++ b/llvm/lib/CodeGen/MachineSink.cpp
@@ -82,7 +82,7 @@ static cl::opt<unsigned> SplitEdgeProbabilityThreshold(
         "If the branch threshold is higher than this threshold, we allow "
         "speculative execution of up to 1 instruction to avoid branching to "
         "splitted critical edge"),
-    cl::init(40), cl::Hidden);
+    cl::init(35), cl::Hidden);
 
 static cl::opt<unsigned> SinkLoadInstsPerBlockThreshold(
     "machine-sink-load-instrs-threshold",
diff --git a/llvm/test/CodeGen/AArch64/aarch64-matrix-umull-smull.ll b/llvm/test/CodeGen/AArch64/aarch64-matrix-umull-smull.ll
index fb6575cc0ee83..fdc087e9c1991 100644
--- a/llvm/test/CodeGen/AArch64/aarch64-matrix-umull-smull.ll
+++ b/llvm/test/CodeGen/AArch64/aarch64-matrix-umull-smull.ll
@@ -632,20 +632,18 @@ define i16 @red_mla_dup_ext_u8_s8_s16(ptr noalias nocapture noundef readonly %A,
 ;
 ; CHECK-GI-LABEL: red_mla_dup_ext_u8_s8_s16:
 ; CHECK-GI:       // %bb.0: // %entry
-; CHECK-GI-NEXT:    cbz w2, .LBB5_3
+; CHECK-GI-NEXT:    mov w8, wzr
+; CHECK-GI-NEXT:    cbz w2, .LBB5_9
 ; CHECK-GI-NEXT:  // %bb.1: // %for.body.preheader
 ; CHECK-GI-NEXT:    cmp w2, #16
 ; CHECK-GI-NEXT:    mov w8, w2
-; CHECK-GI-NEXT:    b.hs .LBB5_4
+; CHECK-GI-NEXT:    b.hs .LBB5_3
 ; CHECK-GI-NEXT:  // %bb.2:
 ; CHECK-GI-NEXT:    mov w10, #0 // =0x0
 ; CHECK-GI-NEXT:    mov x9, xzr
 ; CHECK-GI-NEXT:    fmov s0, w10
-; CHECK-GI-NEXT:    b .LBB5_8
-; CHECK-GI-NEXT:  .LBB5_3:
-; CHECK-GI-NEXT:    mov w0, wzr
-; CHECK-GI-NEXT:    ret
-; CHECK-GI-NEXT:  .LBB5_4: // %vector.ph
+; CHECK-GI-NEXT:    b .LBB5_7
+; CHECK-GI-NEXT:  .LBB5_3: // %vector.ph
 ; CHECK-GI-NEXT:    lsl w9, w1, #8
 ; CHECK-GI-NEXT:    movi v0.2d, #0000000000000000
 ; CHECK-GI-NEXT:    movi v1.2d, #0000000000000000
@@ -654,7 +652,7 @@ define i16 @red_mla_dup_ext_u8_s8_s16(ptr noalias nocapture noundef readonly %A,
 ; CHECK-GI-NEXT:    dup v2.8h, w9
 ; CHECK-GI-NEXT:    and x9, x8, #0xfffffff0
 ; CHECK-GI-NEXT:    mov x11, x9
-; CHECK-GI-NEXT:  .LBB5_5: // %vector.body
+; CHECK-GI-NEXT:  .LBB5_4: // %vector.body
 ; CHECK-GI-NEXT:    // =>This Inner Loop Header: Depth=1
 ; CHECK-GI-NEXT:    ldp d3, d4, [x10, #-8]
 ; CHECK-GI-NEXT:    subs x11, x11, #16
@@ -663,29 +661,31 @@ define i16 @red_mla_dup_ext_u8_s8_s16(ptr noalias nocapture noundef readonly %A,
 ; CHECK-GI-NEXT:    ushll v4.8h, v4.8b, #0
 ; CHECK-GI-NEXT:    mla v0.8h, v2.8h, v3.8h
 ; CHECK-GI-NEXT:    mla v1.8h, v2.8h, v4.8h
-; CHECK-GI-NEXT:    b.ne .LBB5_5
-; CHECK-GI-NEXT:  // %bb.6: // %middle.block
+; CHECK-GI-NEXT:    b.ne .LBB5_4
+; CHECK-GI-NEXT:  // %bb.5: // %middle.block
 ; CHECK-GI-NEXT:    add v0.8h, v1.8h, v0.8h
 ; CHECK-GI-NEXT:    cmp x9, x8
 ; CHECK-GI-NEXT:    addv h0, v0.8h
-; CHECK-GI-NEXT:    b.ne .LBB5_8
-; CHECK-GI-NEXT:  // %bb.7:
-; CHECK-GI-NEXT:    fmov w0, s0
+; CHECK-GI-NEXT:    b.ne .LBB5_7
+; CHECK-GI-NEXT:  // %bb.6:
+; CHECK-GI-NEXT:    fmov w8, s0
+; CHECK-GI-NEXT:    mov w0, w8
 ; CHECK-GI-NEXT:    ret
-; CHECK-GI-NEXT:  .LBB5_8: // %for.body.preheader1
+; CHECK-GI-NEXT:  .LBB5_7: // %for.body.preheader1
 ; CHECK-GI-NEXT:    sxtb w10, w1
-; CHECK-GI-NEXT:    sub x8, x8, x9
+; CHECK-GI-NEXT:    sub x11, x8, x9
 ; CHECK-GI-NEXT:    add x9, x0, x9
-; CHECK-GI-NEXT:  .LBB5_9: // %for.body
+; CHECK-GI-NEXT:  .LBB5_8: // %for.body
 ; CHECK-GI-NEXT:    // =>This Inner Loop Header: Depth=1
-; CHECK-GI-NEXT:    ldrb w11, [x9], #1
+; CHECK-GI-NEXT:    ldrb w8, [x9], #1
 ; CHECK-GI-NEXT:    fmov w12, s0
-; CHECK-GI-NEXT:    subs x8, x8, #1
-; CHECK-GI-NEXT:    mul w11, w11, w10
-; CHECK-GI-NEXT:    add w0, w11, w12, uxth
-; CHECK-GI-NEXT:    fmov s0, w0
-; CHECK-GI-NEXT:    b.ne .LBB5_9
-; CHECK-GI-NEXT:  // %bb.10: // %for.cond.cleanup
+; CHECK-GI-NEXT:    subs x11, x11, #1
+; CHECK-GI-NEXT:    mul w8, w8, w10
+; CHECK-GI-NEXT:    add w8, w8, w12, uxth
+; CHECK-GI-NEXT:    fmov s0, w8
+; CHECK-GI-NEXT:    b.ne .LBB5_8
+; CHECK-GI-NEXT:  .LBB5_9: // %for.cond.cleanup
+; CHECK-GI-NEXT:    mov w0, w8
 ; CHECK-GI-NEXT:    ret
 entry:
   %conv2 = sext i8 %B to i16
diff --git a/llvm/test/CodeGen/AArch64/swifterror.ll b/llvm/test/CodeGen/AArch64/swifterror.ll
index 07ee87e880aff..1ca98f6015c11 100644
--- a/llvm/test/CodeGen/AArch64/swifterror.ll
+++ b/llvm/test/CodeGen/AArch64/swifterror.ll
@@ -412,6 +412,7 @@ define float @foo_if(ptr swifterror %error_ptr_ref, i32 %cc) {
 ; CHECK-APPLE-NEXT:    .cfi_def_cfa w29, 16
 ; CHECK-APPLE-NEXT:    .cfi_offset w30, -8
 ; CHECK-APPLE-NEXT:    .cfi_offset w29, -16
+; CHECK-APPLE-NEXT:    movi d0, #0000000000000000
 ; CHECK-APPLE-NEXT:    cbz w0, LBB3_2
 ; CHECK-APPLE-NEXT:  ; %bb.1: ; %gen_error
 ; CHECK-APPLE-NEXT:    mov w0, #16 ; =0x10
@@ -420,10 +421,7 @@ define float @foo_if(ptr swifterror %error_ptr_ref, i32 %cc) {
 ; CHECK-APPLE-NEXT:    fmov s0, #1.00000000
 ; CHECK-APPLE-NEXT:    mov w8, #1 ; =0x1
 ; CHECK-APPLE-NEXT:    strb w8, [x0, #8]
-; CHECK-APPLE-NEXT:    ldp x29, x30, [sp], #16 ; 16-byte Folded Reload
-; CHECK-APPLE-NEXT:    ret
-; CHECK-APPLE-NEXT:  LBB3_2:
-; CHECK-APPLE-NEXT:    movi d0, #0000000000000000
+; CHECK-APPLE-NEXT:  LBB3_2: ; %common.ret
 ; CHECK-APPLE-NEXT:    ldp x29, x30, [sp], #16 ; 16-byte Folded Reload
 ; CHECK-APPLE-NEXT:    ret
 ;
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/sdiv.i64.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/sdiv.i64.ll
index 0c9ff3eee8231..70caf812ea6c2 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/sdiv.i64.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/sdiv.i64.ll
@@ -200,6 +200,7 @@ define amdgpu_ps i64 @s_sdiv_i64(i64 inreg %num, i64 inreg %den) {
 ; CHECK-NEXT:    s_and_b64 s[0:1], s[0:1], s[6:7]
 ; CHECK-NEXT:    v_cmp_ne_u64_e64 vcc, s[0:1], 0
 ; CHECK-NEXT:    s_mov_b32 s0, 1
+; CHECK-NEXT:    ; implicit-def: $vgpr0_vgpr1
 ; CHECK-NEXT:    s_cbranch_vccz .LBB1_2
 ; CHECK-NEXT:  ; %bb.1:
 ; CHECK-NEXT:    s_ashr_i32 s6, s3, 31
@@ -330,15 +331,12 @@ define amdgpu_ps i64 @s_sdiv_i64(i64 inreg %num, i64 inreg %den) {
 ; CHECK-NEXT:    v_xor_b32_e32 v0, s6, v0
 ; CHECK-NEXT:    s_mov_b32 s0, 0
 ; CHECK-NEXT:    v_subrev_i32_e32 v0, vcc, s6, v0
-; CHECK-NEXT:    s_branch .LBB1_3
-; CHECK-NEXT:  .LBB1_2:
-; CHECK-NEXT:    ; implicit-def: $vgpr0_vgpr1
-; CHECK-NEXT:  .LBB1_3: ; %Flow
+; CHECK-NEXT:  .LBB1_2: ; %Flow
 ; CHECK-NEXT:    s_xor_b32 s0, s0, 1
 ; CHECK-NEXT:    s_and_b32 s0, s0, 1
 ; CHECK-NEXT:    s_cmp_lg_u32 s0, 0
-; CHECK-NEXT:    s_cbranch_scc1 .LBB1_5
-; CHECK-NEXT:  ; %bb.4:
+; CHECK-NEXT:    s_cbranch_scc1 .LBB1_4
+; CHECK-NEXT:  ; %bb.3:
 ; CHECK-NEXT:    v_cvt_f32_u32_e32 v0, s4
 ; CHECK-NEXT:    s_sub_i32 s0, 0, s4
 ; CHECK-NEXT:    v_rcp_iflag_f32_e32 v0, v0
@@ -358,7 +356,7 @@ define amdgpu_ps i64 @s_sdiv_i64(i64 inreg %num, i64 inreg %den) {
 ; CHECK-NEXT:    v_add_i32_e32 v2, vcc, 1, v0
 ; CHECK-NEXT:    v_cmp_le_u32_e32 vcc, s4, v1
 ; CHECK-NEXT:    v_cndmask_b32_e32 v0, v0, v2, vcc
-; CHECK-NEXT:  .LBB1_5:
+; CHECK-NEXT:  .LBB1_4:
 ; CHECK-NEXT:    v_readfirstlane_b32 s0, v0
 ; CHECK-NEXT:    s_mov_b32 s1, s0
 ; CHECK-NEXT:    ; return to shader part epilog
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/srem.i64.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/srem.i64.ll
index df645888626c6..2fcbc41895f03 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/srem.i64.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/srem.i64.ll
@@ -194,6 +194,7 @@ define amdgpu_ps i64 @s_srem_i64(i64 inreg %num, i64 inreg %den) {
 ; CHECK-NEXT:    s_and_b64 s[0:1], s[0:1], s[6:7]
 ; CHECK-NEXT:    v_cmp_ne_u64_e64 vcc, s[0:1], 0
 ; CHECK-NEXT:    s_mov_b32 s7, 1
+; CHECK-NEXT:    ; implicit-def: $vgpr0_vgpr1
 ; CHECK-NEXT:    s_cbranch_vccz .LBB1_2
 ; CHECK-NEXT:  ; %bb.1:
 ; CHECK-NEXT:    s_ashr_i32 s6, s3, 31
@@ -322,15 +323,12 @@ define amdgpu_ps i64 @s_srem_i64(i64 inreg %num, i64 inreg %den) {
 ; CHECK-NEXT:    v_cndmask_b32_e32 v0, v0, v1, vcc
 ; CHECK-NEXT:    v_xor_b32_e32 v0, s6, v0
 ; CHECK-NEXT:    v_subrev_i32_e32 v0, vcc, s6, v0
-; CHECK-NEXT:    s_branch .LBB1_3
-; CHECK-NEXT:  .LBB1_2:
-; CHECK-NEXT:    ; implicit-def: $vgpr0_vgpr1
-; CHECK-NEXT:  .LBB1_3: ; %Flow
+; CHECK-NEXT:  .LBB1_2: ; %Flow
 ; CHECK-NEXT:    s_xor_b32 s0, s7, 1
 ; CHECK-NEXT:    s_and_b32 s0, s0, 1
 ; CHECK-NEXT:    s_cmp_lg_u32 s0, 0
-; CHECK-NEXT:    s_cbranch_scc1 .LBB1_5
-; CHECK-NEXT:  ; %bb.4:
+; CHECK-NEXT:    s_cbranch_scc1 .LBB1_4
+; CHECK-NEXT:  ; %bb.3:
 ; CHECK-NEXT:    v_cvt_f32_u32_e32 v0, s4
 ; CHECK-NEXT:    s_sub_i32 s0, 0, s4
 ; CHECK-NEXT:    v_rcp_iflag_f32_e32 v0, v0
@@ -348,7 +346,7 @@ define amdgpu_ps i64 @s_srem_i64(i64 inreg %num, i64 inreg %den) {
 ; CHECK-NEXT:    v_subrev_i32_e32 v1, vcc, s4, v0
 ; CHECK-NEXT:    v_cmp_le_u32_e32 vcc, s4, v0
 ; CHECK-NEXT:    v_cndmask_b32_e32 v0, v0, v1, vcc
-; CHECK-NEXT:  .LBB1_5:
+; CHECK-NEXT:  .LBB1_4:
 ; CHECK-NEXT:    v_readfirstlane_b32 s0, v0
 ; CHECK-NEXT:    s_mov_b32 s1, s0
 ; CHECK-NEXT:    ; return to shader part epilog
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/udiv.i64.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/udiv.i64.ll
index f5a901b024ef5..c9a5a92188256 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/udiv.i64.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/udiv.i64.ll
@@ -193,6 +193,7 @@ define amdgpu_ps i64 @s_udiv_i64(i64 inreg %num, i64 inreg %den) {
 ; CHECK-NEXT:    v_cmp_ne_u64_e64 vcc, s[4:5], 0
 ; CHECK-NEXT:    s_mov_b32 s6, 1
 ; CHECK-NEXT:    v_cvt_f32_u32_e32 v2, s2
+; CHECK-NEXT:    ; implicit-def: $vgpr0_vgpr1
 ; CHECK-NEXT:    s_cbranch_vccz .LBB1_2
 ; CHECK-NEXT:  ; %bb.1:
 ; CHECK-NEXT:    v_mov_b32_e32 v0, s3
@@ -318,15 +319,12 @@ define amdgpu_ps i64 @s_udiv_i64(i64 inreg %num, i64 inreg %den) {
 ; CHECK-NEXT:    v_cndmask_b32_e32 v0, v9, v5, vcc
 ; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v3
 ; CHECK-NEXT:    v_cndmask_b32_e32 v0, v1, v0, vcc
-; CHECK-NEXT:    s_branch .LBB1_3
-; CHECK-NEXT:  .LBB1_2:
-; CHECK-NEXT:    ; implicit-def: $vgpr0_vgpr1
-; CHECK-NEXT:  .LBB1_3: ; %Flow
+; CHECK-NEXT:  .LBB1_2: ; %Flow
 ; CHECK-NEXT:    s_xor_b32 s1, s6, 1
 ; CHECK-NEXT:    s_and_b32 s1, s1, 1
 ; CHECK-NEXT:    s_cmp_lg_u32 s1, 0
-; CHECK-NEXT:    s_cbranch_scc1 .LBB1_5
-; CHECK-NEXT:  ; %bb.4:
+; CHECK-NEXT:    s_cbranch_scc1 .LBB1_4
+; CHECK-NEXT:  ; %bb.3:
 ; CHECK-NEXT:    v_rcp_iflag_f32_e32 v0, v2
 ; CHECK-NEXT:    s_sub_i32 s1, 0, s2
 ; CHECK-NEXT:    v_mul_f32_e32 v0, 0x4f7ffffe, v0
@@ -345,7 +343,7 @@ define amdgpu_ps i64 @s_udiv_i64(i64 inreg %num, i64 inreg %den) {
 ; CHECK-NEXT:    v_add_i32_e32 v2, vcc, 1, v0
 ; CHECK-NEXT:    v_cmp_le_u32_e32 vcc, s2, v1
 ; CHECK-NEXT:    v_cndmask_b32_e32 v0, v0, v2, vcc
-; CHECK-NEXT:  .LBB1_5:
+; CHECK-NEXT:  .LBB1_4:
 ; CHECK-NEXT:    v_readfirstlane_b32 s0, v0
 ; CHECK-NEXT:    s_mov_b32 s1, s0
 ; CHECK-NEXT:    ; return to shader part epilog
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/urem.i64.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/urem.i64.ll
index 2be4b52198b45..06e51387c8f21 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/urem.i64.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/urem.i64.ll
@@ -190,6 +190,7 @@ define amdgpu_ps i64 @s_urem_i64(i64 inreg %num, i64 inreg %den) {
 ; CHECK-NEXT:    v_cmp_ne_u64_e64 vcc, s[4:5], 0
 ; CHECK-NEXT:    s_mov_b32 s6, 1
 ; CHECK-NEXT:    v_cvt_f32_u32_e32 v2, s2
+; CHECK-NEXT:    ; implicit-def: $vgpr0_vgpr1
 ; CHECK-NEXT:    s_cbranch_vccz .LBB1_2
 ; CHECK-NEXT:  ; %bb.1:
 ; CHECK-NEXT:    v_mov_b32_e32 v0, s3
@@ -314,15 +315,12 @@ define amdgpu_ps i64 @s_urem_i64(i64 inreg %num, i64 inreg %den) {
 ; CHECK-NEXT:    v_cndmask_b32_e32 v0, v3, v6, vcc
 ; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v1
 ; CHECK-NEXT:    v_cndmask_b32_e32 v0, v4, v0, vcc
-; CHECK-NEXT:    s_branch .LBB1_3
-; CHECK-NEXT:  .LBB1_2:
-; CHECK-NEXT:    ; implicit-def: $vgpr0_vgpr1
-; CHECK-NEXT:  .LBB1_3: ; %Flow
+; CHECK-NEXT:  .LBB1_2: ; %Flow
 ; CHECK-NEXT:    s_xor_b32 s1, s6, 1
 ; CHECK-NEXT:    s_and_b32 s1, s1, 1
 ; CHECK-NEXT:    s_cmp_lg_u32 s1, 0
-; CHECK-NEXT:    s_cbranch_scc1 .LBB1_5
-; CHECK-NEXT:  ; %bb.4:
+; CHECK-NEXT:    s_cbranch_scc1 .LBB1_4
+; CHECK-NEXT:  ; %bb.3:
 ; CHECK-NEXT:    v_rcp_iflag_f32_e32 v0, v2
 ; CHECK-NEXT:    s_sub_i32 s1, 0, s2
 ; CHECK-NEXT:    v_mul_f32_e32 v0, 0x4f7ffffe, v0
@@ -339,7 +337,7 @@ define amdgpu_ps i64 @s_urem_i64(i64 inreg %num, i64 inreg %den) {
 ; CHECK-NEXT:    v_subrev_i32_e32 v1, vcc, s2, v0
 ; CHECK-NEXT:    v_cmp_le_u32_e32 vcc, s2, v0
 ; CHECK-NEXT:    v_cndmask_b32_e32 v0, v0, v1, vcc
-; CHECK-NEXT:  .LBB1_5:
+; CHECK-NEXT:  .LBB1_4:
 ; CHECK-NEXT:    v_readfirstlane_b32 s0, v0
 ; CHECK-NEXT:    s_mov_b32 s1, s0
 ; CHECK-NEXT:    ; return to shader part epilog
diff --git a/llvm/test/CodeGen/AMDGPU/artificial-terminators.mir b/llvm/test/CodeGen/AMDGPU/artificial-terminators.mir
index 1a76cae68f164..9e84d979e8547 100644
--- a/llvm/test/CodeGen/AMDGPU/artificial-terminators.mir
+++ b/llvm/test/CodeGen/AMDGPU/artificial-terminators.mir
@@ -34,18 +34,14 @@ body: |
   ; CHECK-NEXT:   S_BRANCH %bb.1
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.1:
-  ; CHECK-NEXT:   successors: %bb.5(0x30000000), %bb.2(0x50000000)
+  ; CHECK-NEXT:   successors: %bb.4(0x30000000), %bb.2(0x50000000)
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   [[V_CMP_LT_I32_e64_:%[0-9]+]]:sreg_32 = V_CMP_LT_I32_e64 [[V_ADD_U32_e64_3]], [[S_MOV_B32_1]], implicit $exec
   ; CHECK-NEXT:   [[S_XOR_B32_:%[0-9]+]]:sreg_32 = S_XOR_B32 $exec_lo, [[V_CMP_LT_I32_e64_]], implicit-def $scc
-  ; CHECK-NEXT:   $exec_lo = S_MOV_B32_term [[S_XOR_B32_]]
-  ; CHECK-NEXT:   S_CBRANCH_EXECNZ %bb.2, implicit $exec
-  ; CHECK-NEXT: {{  $}}
-  ; CHECK-NEXT: bb.5:
-  ; CHECK-NEXT:   successors: %bb.4(0x80000000)
-  ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   [[COPY3:%[0-9]+]]:sreg_32 = COPY [[V_CMP_LT_I32_e64_]]
-  ; CHECK-NEXT:   S_BRANCH %bb.4
+  ; CHECK-NEXT:   $exec_lo = S_MOV_B32_term [[S_XOR_B32_]]
+  ; CHECK-NEXT:   S_CBRANCH_EXECZ %bb.4, implicit $exec
+  ; CHECK-NEXT:   S_BRANCH %bb.2
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.2:
   ; CHECK-NEXT:   successors: %bb.4(0x40000000), %bb.3(0x40000000)
@@ -64,7 +60,7 @@ body: |
   ; CHECK-NEXT:   S_BRANCH %bb.4
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.4:
-  ; CHECK-NEXT:   [[PHI:%[0-9]+]]:sreg_32 = PHI [[COPY3]], %bb.5, [[S_OR_B32_]], %bb.2, [[S_OR_B32_]], %bb.3
+  ; CHECK-NEXT:   [[PHI:%[0-9]+]]:sreg_32 = PHI [[COPY3]], %bb.1, [[S_OR_B32_]], %bb.2, [[S_OR_B32_]], %bb.3
   ; CHECK-NEXT:   $exec_lo = S_OR_B32 $exec_lo, [[PHI]], implicit-def $scc
   ; CHECK-NEXT:   S_ENDPGM 0
   bb.0:
diff --git a/llvm/test/CodeGen/AMDGPU/blender-no-live-segment-at-def-implicit-def.ll b/llvm/test/CodeGen/AMDGPU/blender-no-live-segment-at-def-implicit-def.ll
index f9ffa5ae57f3e..dfbb5f6a64042 100644
--- a/llvm/test/CodeGen/AMDGPU/blender-no-live-segment-at-def-implicit-def.ll
+++ b/llvm/test/CodeGen/AMDGPU/blender-no-live-segment-at-def-implicit-def.ll
@@ -9,44 +9,34 @@ define amdgpu_kernel void @blender_no_live_segment_at_def_error(<4 x float> %ext
 ; CHECK-NEXT:    s_addc_u32 s13, s13, 0
 ; CHECK-NEXT:    s_setreg_b32 hwreg(HW_REG_FLAT_SCR_LO), s12
 ; CHECK-NEXT:    s_setreg_b32 hwreg(HW_REG_FLAT_SCR_HI), s13
-; CHECK-NEXT:    s_load_dwordx8 s[36:43], s[8:9], 0x0
+; CHECK-NEXT:    s_load_dwordx8 s[20:27], s[8:9], 0x0
 ; CHECK-NEXT:    s_add_u32 s0, s0, s17
 ; CHECK-NEXT:    s_addc_u32 s1, s1, 0
-; CHECK-NEXT:    s_mov_b32 s12, 0
-; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
-; CHECK-NEXT:    s_cmp_lg_u32 s40, 0
-; CHECK-NEXT:    s_cbranch_scc1 .LBB0_8
-; CHECK-NEXT:  ; %bb.1: ; %if.end13.i.i
-; CHECK-NEXT:    s_cmp_eq_u32 s42, 0
-; CHECK-NEXT:    s_cbranch_scc1 .LBB0_4
-; CHECK-NEXT:  ; %bb.2: ; %if.else251.i.i
-; CHECK-NEXT:    s_cmp_lg_u32 s43, 0
-; CHECK-NEXT:    s_mov_b32 s17, 0
-; CHECK-NEXT:    s_cselect_b32 s12, -1, 0
-; CHECK-NEXT:    s_and_b32 vcc_lo, exec_lo, s12
-; CHECK-NEXT:    s_cbranch_vccz .LBB0_5
-; CHECK-NEXT:  ; %bb.3:
 ; CHECK-NEXT:    s_mov_b32 s36, 0
-; CHECK-NEXT:    s_andn2_b32 vcc_lo, exec_lo, s12
-; CHECK-NEXT:    s_cbranch_vccz .LBB0_6
-; CHECK-NEXT:    s_branch .LBB0_7
-; CHECK-NEXT:  .LBB0_4:
-; CHECK-NEXT:    s_mov_b32 s14, s12
-; CHECK-NEXT:    s_mov_b32 s15, s12
-; CHECK-NEXT:    s_mov_b32 s13, s12
-; CHECK-NEXT:    s_mov_b64 s[38:39], s[14:15]
-; CHECK-NEXT:    s_mov_b64 s[36:37], s[12:13]
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    s_cmp_lg_u32 s24, 0
+; CHECK-NEXT:    s_cbranch_scc0 .LBB0_2
+; CHECK-NEXT:  ; %bb.1:
+; CHECK-NEXT:    s_mov_b64 s[38:39], s[22:23]
+; CHECK-NEXT:    s_mov_b64 s[36:37], s[20:21]
 ; CHECK-NEXT:    s_branch .LBB0_7
-; CHECK-NEXT:  .LBB0_5: ; %if.then263.i.i
-; CHECK-NEXT:    v_cmp_lt_f32_e64 s12, s41, 0
-; CHECK-NEXT:    s_mov_b32 s36, 1.0
-; CHECK-NEXT:    s_mov_b32 s17, 0x7fc00000
+; CHECK-NEXT:  .LBB0_2: ; %if.end13.i.i
 ; CHECK-NEXT:    s_mov_b32 s37, s36
 ; CHECK-NEXT:    s_mov_b32 s38, s36
+; CHECK-NEXT:    s_cmp_eq_u32 s26, 0
 ; CHECK-NEXT:    s_mov_b32 s39, s36
+; CHECK-NEXT:    s_cbranch_scc1 .LBB0_6
+; CHECK-NEXT:  ; %bb.3: ; %if.else251.i.i
+; CHECK-NEXT:    s_cmp_lg_u32 s27, 0
+; CHECK-NEXT:    s_mov_b32 s17, 0
+; CHECK-NEXT:    s_cselect_b32 s12, -1, 0
+; CHECK-NEXT:    s_and_b32 vcc_lo, exec_lo, s12
+; CHECK-NEXT:    s_cbranch_vccz .LBB0_8
+; CHECK-NEXT:  ; %bb.4:
+; CHECK-NEXT:    s_mov_b32 s36, 0
 ; CHECK-NEXT:    s_andn2_b32 vcc_lo, exec_lo, s12
-; CHECK-NEXT:    s_cbranch_vccnz .LBB0_7
-; CHECK-NEXT:  .LBB0_6: ; %if.end273.i.i
+; CHECK-NEXT:    s_cbranch_vccnz .LBB0_6
+; CHECK-NEXT:  .LBB0_5: ; %if.end273.i.i
 ; CHECK-NEXT:    s_add_u32 s12, s8, 40
 ; CHECK-NEXT:    s_addc_u32 s13, s9, 0
 ; CHECK-NEXT:    s_getpc_b64 s[18:19]
@@ -72,13 +62,13 @@ define amdgpu_kernel void @blender_no_live_segment_at_def_error(<4 x float> %ext
 ; CHECK-NEXT:    s_mov_b32 s37, s36
 ; CHECK-NEXT:    s_mov_b32 s38, s36
 ; CHECK-NEXT:    s_mov_b32 s39, s36
-; CHECK-NEXT:  .LBB0_7: ; %if.end294.i.i
+; CHECK-NEXT:  .LBB0_6: ; %if.end294.i.i
 ; CHECK-NEXT:    v_mov_b32_e32 v0, 0
 ; CHECK-NEXT:    buffer_store_dword v0, off, s[0:3], 0 offset:12
 ; CHECK-NEXT:    buffer_store_dword v0, off, s[0:3], 0 offset:8
 ; CHECK-NEXT:    buffer_store_dword v0, off, s[0:3], 0 offset:4
 ; CHECK-NEXT:    buffer_store_dword v0, off, s[0:3], 0
-; CHECK-NEXT:  .LBB0_8: ; %kernel_direct_lighting.exit
+; CHECK-NEXT:  .LBB0_7: ; %kernel_direct_lighting.exit
 ; CHECK-NEXT:    s_load_dwordx2 s[4:5], s[8:9], 0x20
 ; CHECK-NEXT:    v_mov_b32_e32 v0, s36
 ; CHECK-NEXT:    v_mov_b32_e32 v4, 0
@@ -88,6 +78,16 @@ define amdgpu_kernel void @blender_no_live_segment_at_def_error(<4 x float> %ext
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
 ; CHECK-NEXT:    global_store_dwordx4 v4, v[0:3], s[4:5]
 ; CHECK-NEXT:    s_endpgm
+; CHECK-NEXT:  .LBB0_8: ; %if.then263.i.i
+; CHECK-NEXT:    v_cmp_lt_f32_e64 s12, s25, 0
+; CHECK-NEXT:    s_mov_b32 s36, 1.0
+; CHECK-NEXT:    s_mov_b32 s17, 0x7fc00000
+; CHECK-NEXT:    s_mov_b32 s37, s36
+; CHECK-NEXT:    s_mov_b32 s38, s36
+; CHECK-NEXT:    s_mov_b32 s39, s36
+; CHECK-NEXT:    s_andn2_b32 vcc_lo, exec_lo, s12
+; CHECK-NEXT:    s_cbranch_vccz .LBB0_5
+; CHECK-NEXT:    s_branch .LBB0_6
 entry:
   %cmp5.i.i = icmp eq i32 %cmp5.i.i.arg, 0
   br i1 %cmp5.i.i, label %if.end13.i.i, label %kernel_direct_lighting.exit
diff --git a/llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll b/llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll
index d61c4b46596c0..ce0b79b0b358c 100644
--- a/llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll
+++ b/llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll
@@ -848,12 +848,13 @@ define amdgpu_kernel void @test_dynamic_stackalloc_kernel_control_flow(i32 %n, i
 ; GFX9-SDAG-NEXT:    s_load_dwordx2 s[4:5], s[8:9], 0x0
 ; GFX9-SDAG-NEXT:    s_add_u32 s0, s0, s17
 ; GFX9-SDAG-NEXT:    s_addc_u32 s1, s1, 0
+; GFX9-SDAG-NEXT:    s_mov_b64 s[6:7], -1
 ; GFX9-SDAG-NEXT:    s_mov_b32 s33, 0
-; GFX9-SDAG-NEXT:    s_movk_i32 s32, 0x1000
 ; GFX9-SDAG-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX9-SDAG-NEXT:    s_cmp_lg_u32 s4, 0
 ; GFX9-SDAG-NEXT:    s_mov_b32 s4, 0
-; GFX9-SDAG-NEXT:    s_cbranch_scc0 .LBB7_6
+; GFX9-SDAG-NEXT:    s_movk_i32 s32, 0x1000
+; GFX9-SDAG-NEXT:    s_cbranch_scc0 .LBB7_4
 ; GFX9-SDAG-NEXT:  ; %bb.1: ; %bb.1
 ; GFX9-SDAG-NEXT:    v_lshl_add_u32 v0, v0, 2, 15
 ; GFX9-SDAG-NEXT:    v_and_b32_e32 v0, 0x1ff0, v0
@@ -873,8 +874,11 @@ define amdgpu_kernel void @test_dynamic_stackalloc_kernel_control_flow(i32 %n, i
 ; GFX9-SDAG-NEXT:    v_mov_b32_e32 v0, 1
 ; GFX9-SDAG-NEXT:    buffer_store_dword v0, off, s[0:3], s6
 ; GFX9-SDAG-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-SDAG-NEXT:    s_cbranch_execnz .LB...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Feb 25, 2025

@llvm/pr-subscribers-llvm-globalisel

Author: Guy David (guy-david)

Changes

Requires #128745.

Lower it slightly below the likeliness of a null-check to be true which is set to 37.5% (see PtrUntakenProb).
Otherwise, it will split the edge and create another basic-block and with an unconditional branch which might make the CFG more complex and with a suboptimal block placement.
Note that if multiple instructions can be sinked from the same edge then a split will occur regardless of this change.

On M4 Pro:

$ ./utils/compare.py build-a/results1.json build-a/results2.json build-a/results3.json vs build-b/results1.json build-b/results2.json build-b/results3.json
Tests: 4314
Metric: exec_time
Program                                       exec_time             
                                              lhs       rhs    diff 
MultiSourc...chmarks/Prolangs-C/agrep/agrep     0.00      0.01 44.7%
MultiSourc...rks/McCat/03-testtrie/testtrie     0.01      0.01 31.7%
SingleSour...hmarks/Shootout/Shootout-lists     2.02      2.64 30.6%
SingleSour...ecute/GCC-C-execute-20170419-1     0.00      0.00 14.3%
SingleSour.../execute/GCC-C-execute-pr59101     0.00      0.00 14.3%
SingleSour...ecute/GCC-C-execute-20040311-1     0.00      0.00 14.3%
SingleSour.../execute/GCC-C-execute-pr57124     0.00      0.00 14.3%
SingleSour...ecute/GCC-C-execute-20031204-1     0.00      0.00 14.3%
SingleSour...xecute/GCC-C-execute-pr57344-3     0.00      0.00 14.3%
SingleSour.../execute/GCC-C-execute-pr57875     0.00      0.00 14.3%
SingleSour...ecute/GCC-C-execute-20030811-1     0.00      0.00 14.3%
SingleSour.../execute/GCC-C-execute-pr58640     0.00      0.00 14.3%
SingleSour...ecute/GCC-C-execute-20030408-1     0.00      0.00 14.3%
SingleSour...ecute/GCC-C-execute-20030323-1     0.00      0.00 14.3%
SingleSour...ecute/GCC-C-execute-20030203-1     0.00      0.00 14.3%
                           Geomean difference                   0.1%
           exec_time                            
l/r              lhs            rhs         diff
count  4314.000000    4314.000000    4294.000000
mean   453.919219     454.105532     0.002072   
std    10865.757400   10868.002426   0.043046   
min    0.000000       0.000000      -0.171642   
25%    0.000700       0.000700       0.000000   
50%    0.007400       0.007400       0.000000   
75%    0.047829       0.047950       0.000033   
max    321294.306703  321320.624713  0.447368

On Ryzen9 5950X:

$ ./utils/compare.py build-a/results1.json build-a/results2.json build-a/results3.json vs build-b/results1.json build-b/results2.json build-b/results3.json
Tests: 3326
Metric: exec_time
Program                                       exec_time               
                                              lhs       rhs     diff  
MemFunctio...mCmp&lt;1, GreaterThanZero, None&gt;   1741.26   1885.00 143.74
MemFunctio..._MemCmp&lt;1, LessThanZero, Last&gt;   1759.78   1873.93 114.15
MemFunctio...est:BM_MemCmp&lt;1, EqZero, Last&gt;   1747.19   1847.42 100.22
MemFunctio...Cmp&lt;1, GreaterThanZero, First&gt;   1750.17   1844.57  94.40
MemFunctio...mCmp&lt;1, GreaterThanZero, Last&gt;   1751.05   1844.68  93.63
MemFunctio...emCmp&lt;1, GreaterThanZero, Mid&gt;   1756.49   1849.62  93.13
MemFunctio..._MemCmp&lt;1, LessThanZero, None&gt;   1744.87   1835.22  90.35
MemFunctio...M_MemCmp&lt;1, LessThanZero, Mid&gt;   1757.53   1846.29  88.77
harris/har...est:BENCHMARK_HARRIS/1024/1024   5689.29   5754.88  65.59
MemFunctio...MemCmp&lt;2, LessThanZero, First&gt;   1123.00   1181.63  58.63
MemFunctio...test:BM_MemCmp&lt;1, EqZero, Mid&gt;   2524.93   2582.21  57.28
MemFunctio...est:BM_MemCmp&lt;1, EqZero, None&gt;   2525.97   2582.43  56.46
MemFunctio..._MemCmp&lt;3, LessThanZero, Last&gt;    869.04    924.66  55.62
MemFunctio...test:BM_MemCmp&lt;3, EqZero, Mid&gt;    878.39    932.53  54.14
MemFunctio...MemCmp&lt;1, LessThanZero, First&gt;   2528.37   2582.27  53.90
                                                                                                                                                   exec_time                             
l/r                                                                                                                                                      lhs            rhs          diff
Program                                                                                                                                                                                  
test-suite :: MicroBenchmarks/MemFunctions/MemFunctions.test:BM_MemCmp&lt;1, GreaterThanZero, None&gt;                                               1741.261663    1884.998860    143.737197  
test-suite :: MicroBenchmarks/MemFunctions/MemFunctions.test:BM_MemCmp&lt;1, LessThanZero, Last&gt;                                                  1759.779355    1873.926412    114.147056  
test-suite :: MicroBenchmarks/MemFunctions/MemFunctions.test:BM_MemCmp&lt;1, EqZero, Last&gt;                                                        1747.192734    1847.416650    100.223916  
test-suite :: MicroBenchmarks/MemFunctions/MemFunctions.test:BM_MemCmp&lt;1, GreaterThanZero, First&gt;                                              1750.171003    1844.569735    94.398732   
test-suite :: MicroBenchmarks/MemFunctions/MemFunctions.test:BM_MemCmp&lt;1, GreaterThanZero, Last&gt;                                               1751.049323    1844.682784    93.633461   
...                                                                                                                                                    ...            ...           ...  
test-suite :: MicroBenchmarks/LoopVectorization/LoopVectorizationBenchmarks.test:benchVecWithRuntimeChecks4PointersDAfterA/1000                435033.995649  412835.347288 -22198.648362
test-suite :: MicroBenchmarks/LoopVectorization/LoopVectorizationBenchmarks.test:benchVecWithRuntimeChecks4PointersAllDisjointDecreasing/1000  435136.829708  412921.450737 -22215.378970
test-suite :: MicroBenchmarks/LoopVectorization/LoopVectorizationBenchmarks.test:benchVecWithRuntimeChecks4PointersAllDisjointIncreasing/1000  435136.457427  412908.677876 -22227.779551
test-suite :: MicroBenchmarks/LoopVectorization/LoopVectorizationBenchmarks.test:benchVecWithRuntimeChecks4PointersDEqualsA/1000               435088.787446  412769.793042 -22318.994403
test-suite :: MicroBenchmarks/LoopVectorization/LoopVectorizationBenchmarks.test:benchVecWithRuntimeChecks4PointersDBeforeA/1000               835721.265233  791510.926471 -44210.338762

[3326 rows x 3 columns]
           exec_time                             
l/r              lhs            rhs          diff
count  3326.000000    3326.000000    3326.000000 
mean   916.350942     873.987972    -42.362970   
std    20951.565020   19865.106212   1087.788132 
min    0.000000       0.000000      -44210.338762
25%    0.000000       0.000000      -0.000400    
50%    0.000400       0.000400       0.000000    
75%    1.774625       1.732975       0.000400    
max    835721.265233  791510.926471  143.737197

I looked into the disassembly of BM_MemCmp&lt;1, GreaterThanZero, None&gt; in MemFunctions.test and it has not changed.


Patch is 226.63 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/127666.diff

51 Files Affected:

  • (modified) llvm/lib/CodeGen/MachineSink.cpp (+1-1)
  • (modified) llvm/test/CodeGen/AArch64/aarch64-matrix-umull-smull.ll (+23-23)
  • (modified) llvm/test/CodeGen/AArch64/swifterror.ll (+2-4)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/sdiv.i64.ll (+5-7)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/srem.i64.ll (+5-7)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/udiv.i64.ll (+5-7)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/urem.i64.ll (+5-7)
  • (modified) llvm/test/CodeGen/AMDGPU/artificial-terminators.mir (+5-9)
  • (modified) llvm/test/CodeGen/AMDGPU/blender-no-live-segment-at-def-implicit-def.ll (+32-32)
  • (modified) llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll (+20-16)
  • (modified) llvm/test/CodeGen/AMDGPU/indirect-addressing-si.ll (+64-52)
  • (modified) llvm/test/CodeGen/AMDGPU/no-fold-accvgpr-mov.ll (+24-18)
  • (modified) llvm/test/CodeGen/AMDGPU/optimize-negated-cond.ll (+8-7)
  • (modified) llvm/test/CodeGen/AMDGPU/skip-if-dead.ll (+8-16)
  • (modified) llvm/test/CodeGen/ARM/and-cmp0-sink.ll (+14-14)
  • (modified) llvm/test/CodeGen/Mips/llvm-ir/sdiv-freebsd.ll (+6-3)
  • (modified) llvm/test/CodeGen/PowerPC/common-chain-aix32.ll (+16-17)
  • (modified) llvm/test/CodeGen/PowerPC/common-chain.ll (+80-89)
  • (modified) llvm/test/CodeGen/PowerPC/ifcvt_cr_field.ll (+6-12)
  • (modified) llvm/test/CodeGen/PowerPC/knowCRBitSpill.ll (+1)
  • (modified) llvm/test/CodeGen/PowerPC/loop-instr-form-prepare.ll (+88-114)
  • (modified) llvm/test/CodeGen/PowerPC/loop-instr-prep-non-const-increasement.ll (+16-19)
  • (modified) llvm/test/CodeGen/PowerPC/mma-phi-accs.ll (+6-12)
  • (modified) llvm/test/CodeGen/PowerPC/p10-spill-creq.ll (+28-33)
  • (modified) llvm/test/CodeGen/PowerPC/ppc64-rop-protection-aix.ll (+54-60)
  • (modified) llvm/test/CodeGen/PowerPC/ppc64-rop-protection.ll (+66-81)
  • (modified) llvm/test/CodeGen/PowerPC/shrink-wrap.ll (+12-20)
  • (modified) llvm/test/CodeGen/PowerPC/spe.ll (+2-4)
  • (modified) llvm/test/CodeGen/PowerPC/zext-and-cmp.ll (+16-6)
  • (modified) llvm/test/CodeGen/RISCV/fold-addi-loadstore.ll (+6-12)
  • (modified) llvm/test/CodeGen/Thumb2/mve-pipelineloops.ll (+7-10)
  • (modified) llvm/test/CodeGen/WebAssembly/implicit-def.ll (+23-12)
  • (modified) llvm/test/CodeGen/X86/2007-11-06-InstrSched.ll (+6-9)
  • (modified) llvm/test/CodeGen/X86/2008-04-28-CoalescerBug.ll (+8-11)
  • (modified) llvm/test/CodeGen/X86/atomic-rm-bit-test-64.ll (+61-68)
  • (modified) llvm/test/CodeGen/X86/atomic-rm-bit-test.ll (+342-395)
  • (modified) llvm/test/CodeGen/X86/branchfolding-debugloc.ll (+4-5)
  • (modified) llvm/test/CodeGen/X86/break-false-dep.ll (+28-48)
  • (modified) llvm/test/CodeGen/X86/coalescer-commute4.ll (+6-9)
  • (modified) llvm/test/CodeGen/X86/ctlo.ll (+21-26)
  • (modified) llvm/test/CodeGen/X86/ctlz.ll (+56-72)
  • (modified) llvm/test/CodeGen/X86/cttz.ll (+18-30)
  • (modified) llvm/test/CodeGen/X86/fold-loop-of-urem.ll (+26-32)
  • (modified) llvm/test/CodeGen/X86/lsr-sort.ll (+6-4)
  • (modified) llvm/test/CodeGen/X86/mmx-arith.ll (+7-9)
  • (modified) llvm/test/CodeGen/X86/pr2659.ll (+32-10)
  • (modified) llvm/test/CodeGen/X86/pr38795.ll (+50-53)
  • (modified) llvm/test/CodeGen/X86/probe-stack-eflags.ll (+4-6)
  • (modified) llvm/test/CodeGen/X86/taildup-heapallocsite.ll (+4-9)
  • (modified) llvm/test/CodeGen/X86/testb-je-fusion.ll (+8-10)
  • (modified) llvm/test/CodeGen/X86/x86-shrink-wrapping.ll (+12-16)
diff --git a/llvm/lib/CodeGen/MachineSink.cpp b/llvm/lib/CodeGen/MachineSink.cpp
index 82acb780cfb72..81459cf65d6c2 100644
--- a/llvm/lib/CodeGen/MachineSink.cpp
+++ b/llvm/lib/CodeGen/MachineSink.cpp
@@ -82,7 +82,7 @@ static cl::opt<unsigned> SplitEdgeProbabilityThreshold(
         "If the branch threshold is higher than this threshold, we allow "
         "speculative execution of up to 1 instruction to avoid branching to "
         "splitted critical edge"),
-    cl::init(40), cl::Hidden);
+    cl::init(35), cl::Hidden);
 
 static cl::opt<unsigned> SinkLoadInstsPerBlockThreshold(
     "machine-sink-load-instrs-threshold",
diff --git a/llvm/test/CodeGen/AArch64/aarch64-matrix-umull-smull.ll b/llvm/test/CodeGen/AArch64/aarch64-matrix-umull-smull.ll
index fb6575cc0ee83..fdc087e9c1991 100644
--- a/llvm/test/CodeGen/AArch64/aarch64-matrix-umull-smull.ll
+++ b/llvm/test/CodeGen/AArch64/aarch64-matrix-umull-smull.ll
@@ -632,20 +632,18 @@ define i16 @red_mla_dup_ext_u8_s8_s16(ptr noalias nocapture noundef readonly %A,
 ;
 ; CHECK-GI-LABEL: red_mla_dup_ext_u8_s8_s16:
 ; CHECK-GI:       // %bb.0: // %entry
-; CHECK-GI-NEXT:    cbz w2, .LBB5_3
+; CHECK-GI-NEXT:    mov w8, wzr
+; CHECK-GI-NEXT:    cbz w2, .LBB5_9
 ; CHECK-GI-NEXT:  // %bb.1: // %for.body.preheader
 ; CHECK-GI-NEXT:    cmp w2, #16
 ; CHECK-GI-NEXT:    mov w8, w2
-; CHECK-GI-NEXT:    b.hs .LBB5_4
+; CHECK-GI-NEXT:    b.hs .LBB5_3
 ; CHECK-GI-NEXT:  // %bb.2:
 ; CHECK-GI-NEXT:    mov w10, #0 // =0x0
 ; CHECK-GI-NEXT:    mov x9, xzr
 ; CHECK-GI-NEXT:    fmov s0, w10
-; CHECK-GI-NEXT:    b .LBB5_8
-; CHECK-GI-NEXT:  .LBB5_3:
-; CHECK-GI-NEXT:    mov w0, wzr
-; CHECK-GI-NEXT:    ret
-; CHECK-GI-NEXT:  .LBB5_4: // %vector.ph
+; CHECK-GI-NEXT:    b .LBB5_7
+; CHECK-GI-NEXT:  .LBB5_3: // %vector.ph
 ; CHECK-GI-NEXT:    lsl w9, w1, #8
 ; CHECK-GI-NEXT:    movi v0.2d, #0000000000000000
 ; CHECK-GI-NEXT:    movi v1.2d, #0000000000000000
@@ -654,7 +652,7 @@ define i16 @red_mla_dup_ext_u8_s8_s16(ptr noalias nocapture noundef readonly %A,
 ; CHECK-GI-NEXT:    dup v2.8h, w9
 ; CHECK-GI-NEXT:    and x9, x8, #0xfffffff0
 ; CHECK-GI-NEXT:    mov x11, x9
-; CHECK-GI-NEXT:  .LBB5_5: // %vector.body
+; CHECK-GI-NEXT:  .LBB5_4: // %vector.body
 ; CHECK-GI-NEXT:    // =>This Inner Loop Header: Depth=1
 ; CHECK-GI-NEXT:    ldp d3, d4, [x10, #-8]
 ; CHECK-GI-NEXT:    subs x11, x11, #16
@@ -663,29 +661,31 @@ define i16 @red_mla_dup_ext_u8_s8_s16(ptr noalias nocapture noundef readonly %A,
 ; CHECK-GI-NEXT:    ushll v4.8h, v4.8b, #0
 ; CHECK-GI-NEXT:    mla v0.8h, v2.8h, v3.8h
 ; CHECK-GI-NEXT:    mla v1.8h, v2.8h, v4.8h
-; CHECK-GI-NEXT:    b.ne .LBB5_5
-; CHECK-GI-NEXT:  // %bb.6: // %middle.block
+; CHECK-GI-NEXT:    b.ne .LBB5_4
+; CHECK-GI-NEXT:  // %bb.5: // %middle.block
 ; CHECK-GI-NEXT:    add v0.8h, v1.8h, v0.8h
 ; CHECK-GI-NEXT:    cmp x9, x8
 ; CHECK-GI-NEXT:    addv h0, v0.8h
-; CHECK-GI-NEXT:    b.ne .LBB5_8
-; CHECK-GI-NEXT:  // %bb.7:
-; CHECK-GI-NEXT:    fmov w0, s0
+; CHECK-GI-NEXT:    b.ne .LBB5_7
+; CHECK-GI-NEXT:  // %bb.6:
+; CHECK-GI-NEXT:    fmov w8, s0
+; CHECK-GI-NEXT:    mov w0, w8
 ; CHECK-GI-NEXT:    ret
-; CHECK-GI-NEXT:  .LBB5_8: // %for.body.preheader1
+; CHECK-GI-NEXT:  .LBB5_7: // %for.body.preheader1
 ; CHECK-GI-NEXT:    sxtb w10, w1
-; CHECK-GI-NEXT:    sub x8, x8, x9
+; CHECK-GI-NEXT:    sub x11, x8, x9
 ; CHECK-GI-NEXT:    add x9, x0, x9
-; CHECK-GI-NEXT:  .LBB5_9: // %for.body
+; CHECK-GI-NEXT:  .LBB5_8: // %for.body
 ; CHECK-GI-NEXT:    // =>This Inner Loop Header: Depth=1
-; CHECK-GI-NEXT:    ldrb w11, [x9], #1
+; CHECK-GI-NEXT:    ldrb w8, [x9], #1
 ; CHECK-GI-NEXT:    fmov w12, s0
-; CHECK-GI-NEXT:    subs x8, x8, #1
-; CHECK-GI-NEXT:    mul w11, w11, w10
-; CHECK-GI-NEXT:    add w0, w11, w12, uxth
-; CHECK-GI-NEXT:    fmov s0, w0
-; CHECK-GI-NEXT:    b.ne .LBB5_9
-; CHECK-GI-NEXT:  // %bb.10: // %for.cond.cleanup
+; CHECK-GI-NEXT:    subs x11, x11, #1
+; CHECK-GI-NEXT:    mul w8, w8, w10
+; CHECK-GI-NEXT:    add w8, w8, w12, uxth
+; CHECK-GI-NEXT:    fmov s0, w8
+; CHECK-GI-NEXT:    b.ne .LBB5_8
+; CHECK-GI-NEXT:  .LBB5_9: // %for.cond.cleanup
+; CHECK-GI-NEXT:    mov w0, w8
 ; CHECK-GI-NEXT:    ret
 entry:
   %conv2 = sext i8 %B to i16
diff --git a/llvm/test/CodeGen/AArch64/swifterror.ll b/llvm/test/CodeGen/AArch64/swifterror.ll
index 07ee87e880aff..1ca98f6015c11 100644
--- a/llvm/test/CodeGen/AArch64/swifterror.ll
+++ b/llvm/test/CodeGen/AArch64/swifterror.ll
@@ -412,6 +412,7 @@ define float @foo_if(ptr swifterror %error_ptr_ref, i32 %cc) {
 ; CHECK-APPLE-NEXT:    .cfi_def_cfa w29, 16
 ; CHECK-APPLE-NEXT:    .cfi_offset w30, -8
 ; CHECK-APPLE-NEXT:    .cfi_offset w29, -16
+; CHECK-APPLE-NEXT:    movi d0, #0000000000000000
 ; CHECK-APPLE-NEXT:    cbz w0, LBB3_2
 ; CHECK-APPLE-NEXT:  ; %bb.1: ; %gen_error
 ; CHECK-APPLE-NEXT:    mov w0, #16 ; =0x10
@@ -420,10 +421,7 @@ define float @foo_if(ptr swifterror %error_ptr_ref, i32 %cc) {
 ; CHECK-APPLE-NEXT:    fmov s0, #1.00000000
 ; CHECK-APPLE-NEXT:    mov w8, #1 ; =0x1
 ; CHECK-APPLE-NEXT:    strb w8, [x0, #8]
-; CHECK-APPLE-NEXT:    ldp x29, x30, [sp], #16 ; 16-byte Folded Reload
-; CHECK-APPLE-NEXT:    ret
-; CHECK-APPLE-NEXT:  LBB3_2:
-; CHECK-APPLE-NEXT:    movi d0, #0000000000000000
+; CHECK-APPLE-NEXT:  LBB3_2: ; %common.ret
 ; CHECK-APPLE-NEXT:    ldp x29, x30, [sp], #16 ; 16-byte Folded Reload
 ; CHECK-APPLE-NEXT:    ret
 ;
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/sdiv.i64.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/sdiv.i64.ll
index 0c9ff3eee8231..70caf812ea6c2 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/sdiv.i64.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/sdiv.i64.ll
@@ -200,6 +200,7 @@ define amdgpu_ps i64 @s_sdiv_i64(i64 inreg %num, i64 inreg %den) {
 ; CHECK-NEXT:    s_and_b64 s[0:1], s[0:1], s[6:7]
 ; CHECK-NEXT:    v_cmp_ne_u64_e64 vcc, s[0:1], 0
 ; CHECK-NEXT:    s_mov_b32 s0, 1
+; CHECK-NEXT:    ; implicit-def: $vgpr0_vgpr1
 ; CHECK-NEXT:    s_cbranch_vccz .LBB1_2
 ; CHECK-NEXT:  ; %bb.1:
 ; CHECK-NEXT:    s_ashr_i32 s6, s3, 31
@@ -330,15 +331,12 @@ define amdgpu_ps i64 @s_sdiv_i64(i64 inreg %num, i64 inreg %den) {
 ; CHECK-NEXT:    v_xor_b32_e32 v0, s6, v0
 ; CHECK-NEXT:    s_mov_b32 s0, 0
 ; CHECK-NEXT:    v_subrev_i32_e32 v0, vcc, s6, v0
-; CHECK-NEXT:    s_branch .LBB1_3
-; CHECK-NEXT:  .LBB1_2:
-; CHECK-NEXT:    ; implicit-def: $vgpr0_vgpr1
-; CHECK-NEXT:  .LBB1_3: ; %Flow
+; CHECK-NEXT:  .LBB1_2: ; %Flow
 ; CHECK-NEXT:    s_xor_b32 s0, s0, 1
 ; CHECK-NEXT:    s_and_b32 s0, s0, 1
 ; CHECK-NEXT:    s_cmp_lg_u32 s0, 0
-; CHECK-NEXT:    s_cbranch_scc1 .LBB1_5
-; CHECK-NEXT:  ; %bb.4:
+; CHECK-NEXT:    s_cbranch_scc1 .LBB1_4
+; CHECK-NEXT:  ; %bb.3:
 ; CHECK-NEXT:    v_cvt_f32_u32_e32 v0, s4
 ; CHECK-NEXT:    s_sub_i32 s0, 0, s4
 ; CHECK-NEXT:    v_rcp_iflag_f32_e32 v0, v0
@@ -358,7 +356,7 @@ define amdgpu_ps i64 @s_sdiv_i64(i64 inreg %num, i64 inreg %den) {
 ; CHECK-NEXT:    v_add_i32_e32 v2, vcc, 1, v0
 ; CHECK-NEXT:    v_cmp_le_u32_e32 vcc, s4, v1
 ; CHECK-NEXT:    v_cndmask_b32_e32 v0, v0, v2, vcc
-; CHECK-NEXT:  .LBB1_5:
+; CHECK-NEXT:  .LBB1_4:
 ; CHECK-NEXT:    v_readfirstlane_b32 s0, v0
 ; CHECK-NEXT:    s_mov_b32 s1, s0
 ; CHECK-NEXT:    ; return to shader part epilog
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/srem.i64.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/srem.i64.ll
index df645888626c6..2fcbc41895f03 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/srem.i64.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/srem.i64.ll
@@ -194,6 +194,7 @@ define amdgpu_ps i64 @s_srem_i64(i64 inreg %num, i64 inreg %den) {
 ; CHECK-NEXT:    s_and_b64 s[0:1], s[0:1], s[6:7]
 ; CHECK-NEXT:    v_cmp_ne_u64_e64 vcc, s[0:1], 0
 ; CHECK-NEXT:    s_mov_b32 s7, 1
+; CHECK-NEXT:    ; implicit-def: $vgpr0_vgpr1
 ; CHECK-NEXT:    s_cbranch_vccz .LBB1_2
 ; CHECK-NEXT:  ; %bb.1:
 ; CHECK-NEXT:    s_ashr_i32 s6, s3, 31
@@ -322,15 +323,12 @@ define amdgpu_ps i64 @s_srem_i64(i64 inreg %num, i64 inreg %den) {
 ; CHECK-NEXT:    v_cndmask_b32_e32 v0, v0, v1, vcc
 ; CHECK-NEXT:    v_xor_b32_e32 v0, s6, v0
 ; CHECK-NEXT:    v_subrev_i32_e32 v0, vcc, s6, v0
-; CHECK-NEXT:    s_branch .LBB1_3
-; CHECK-NEXT:  .LBB1_2:
-; CHECK-NEXT:    ; implicit-def: $vgpr0_vgpr1
-; CHECK-NEXT:  .LBB1_3: ; %Flow
+; CHECK-NEXT:  .LBB1_2: ; %Flow
 ; CHECK-NEXT:    s_xor_b32 s0, s7, 1
 ; CHECK-NEXT:    s_and_b32 s0, s0, 1
 ; CHECK-NEXT:    s_cmp_lg_u32 s0, 0
-; CHECK-NEXT:    s_cbranch_scc1 .LBB1_5
-; CHECK-NEXT:  ; %bb.4:
+; CHECK-NEXT:    s_cbranch_scc1 .LBB1_4
+; CHECK-NEXT:  ; %bb.3:
 ; CHECK-NEXT:    v_cvt_f32_u32_e32 v0, s4
 ; CHECK-NEXT:    s_sub_i32 s0, 0, s4
 ; CHECK-NEXT:    v_rcp_iflag_f32_e32 v0, v0
@@ -348,7 +346,7 @@ define amdgpu_ps i64 @s_srem_i64(i64 inreg %num, i64 inreg %den) {
 ; CHECK-NEXT:    v_subrev_i32_e32 v1, vcc, s4, v0
 ; CHECK-NEXT:    v_cmp_le_u32_e32 vcc, s4, v0
 ; CHECK-NEXT:    v_cndmask_b32_e32 v0, v0, v1, vcc
-; CHECK-NEXT:  .LBB1_5:
+; CHECK-NEXT:  .LBB1_4:
 ; CHECK-NEXT:    v_readfirstlane_b32 s0, v0
 ; CHECK-NEXT:    s_mov_b32 s1, s0
 ; CHECK-NEXT:    ; return to shader part epilog
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/udiv.i64.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/udiv.i64.ll
index f5a901b024ef5..c9a5a92188256 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/udiv.i64.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/udiv.i64.ll
@@ -193,6 +193,7 @@ define amdgpu_ps i64 @s_udiv_i64(i64 inreg %num, i64 inreg %den) {
 ; CHECK-NEXT:    v_cmp_ne_u64_e64 vcc, s[4:5], 0
 ; CHECK-NEXT:    s_mov_b32 s6, 1
 ; CHECK-NEXT:    v_cvt_f32_u32_e32 v2, s2
+; CHECK-NEXT:    ; implicit-def: $vgpr0_vgpr1
 ; CHECK-NEXT:    s_cbranch_vccz .LBB1_2
 ; CHECK-NEXT:  ; %bb.1:
 ; CHECK-NEXT:    v_mov_b32_e32 v0, s3
@@ -318,15 +319,12 @@ define amdgpu_ps i64 @s_udiv_i64(i64 inreg %num, i64 inreg %den) {
 ; CHECK-NEXT:    v_cndmask_b32_e32 v0, v9, v5, vcc
 ; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v3
 ; CHECK-NEXT:    v_cndmask_b32_e32 v0, v1, v0, vcc
-; CHECK-NEXT:    s_branch .LBB1_3
-; CHECK-NEXT:  .LBB1_2:
-; CHECK-NEXT:    ; implicit-def: $vgpr0_vgpr1
-; CHECK-NEXT:  .LBB1_3: ; %Flow
+; CHECK-NEXT:  .LBB1_2: ; %Flow
 ; CHECK-NEXT:    s_xor_b32 s1, s6, 1
 ; CHECK-NEXT:    s_and_b32 s1, s1, 1
 ; CHECK-NEXT:    s_cmp_lg_u32 s1, 0
-; CHECK-NEXT:    s_cbranch_scc1 .LBB1_5
-; CHECK-NEXT:  ; %bb.4:
+; CHECK-NEXT:    s_cbranch_scc1 .LBB1_4
+; CHECK-NEXT:  ; %bb.3:
 ; CHECK-NEXT:    v_rcp_iflag_f32_e32 v0, v2
 ; CHECK-NEXT:    s_sub_i32 s1, 0, s2
 ; CHECK-NEXT:    v_mul_f32_e32 v0, 0x4f7ffffe, v0
@@ -345,7 +343,7 @@ define amdgpu_ps i64 @s_udiv_i64(i64 inreg %num, i64 inreg %den) {
 ; CHECK-NEXT:    v_add_i32_e32 v2, vcc, 1, v0
 ; CHECK-NEXT:    v_cmp_le_u32_e32 vcc, s2, v1
 ; CHECK-NEXT:    v_cndmask_b32_e32 v0, v0, v2, vcc
-; CHECK-NEXT:  .LBB1_5:
+; CHECK-NEXT:  .LBB1_4:
 ; CHECK-NEXT:    v_readfirstlane_b32 s0, v0
 ; CHECK-NEXT:    s_mov_b32 s1, s0
 ; CHECK-NEXT:    ; return to shader part epilog
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/urem.i64.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/urem.i64.ll
index 2be4b52198b45..06e51387c8f21 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/urem.i64.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/urem.i64.ll
@@ -190,6 +190,7 @@ define amdgpu_ps i64 @s_urem_i64(i64 inreg %num, i64 inreg %den) {
 ; CHECK-NEXT:    v_cmp_ne_u64_e64 vcc, s[4:5], 0
 ; CHECK-NEXT:    s_mov_b32 s6, 1
 ; CHECK-NEXT:    v_cvt_f32_u32_e32 v2, s2
+; CHECK-NEXT:    ; implicit-def: $vgpr0_vgpr1
 ; CHECK-NEXT:    s_cbranch_vccz .LBB1_2
 ; CHECK-NEXT:  ; %bb.1:
 ; CHECK-NEXT:    v_mov_b32_e32 v0, s3
@@ -314,15 +315,12 @@ define amdgpu_ps i64 @s_urem_i64(i64 inreg %num, i64 inreg %den) {
 ; CHECK-NEXT:    v_cndmask_b32_e32 v0, v3, v6, vcc
 ; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v1
 ; CHECK-NEXT:    v_cndmask_b32_e32 v0, v4, v0, vcc
-; CHECK-NEXT:    s_branch .LBB1_3
-; CHECK-NEXT:  .LBB1_2:
-; CHECK-NEXT:    ; implicit-def: $vgpr0_vgpr1
-; CHECK-NEXT:  .LBB1_3: ; %Flow
+; CHECK-NEXT:  .LBB1_2: ; %Flow
 ; CHECK-NEXT:    s_xor_b32 s1, s6, 1
 ; CHECK-NEXT:    s_and_b32 s1, s1, 1
 ; CHECK-NEXT:    s_cmp_lg_u32 s1, 0
-; CHECK-NEXT:    s_cbranch_scc1 .LBB1_5
-; CHECK-NEXT:  ; %bb.4:
+; CHECK-NEXT:    s_cbranch_scc1 .LBB1_4
+; CHECK-NEXT:  ; %bb.3:
 ; CHECK-NEXT:    v_rcp_iflag_f32_e32 v0, v2
 ; CHECK-NEXT:    s_sub_i32 s1, 0, s2
 ; CHECK-NEXT:    v_mul_f32_e32 v0, 0x4f7ffffe, v0
@@ -339,7 +337,7 @@ define amdgpu_ps i64 @s_urem_i64(i64 inreg %num, i64 inreg %den) {
 ; CHECK-NEXT:    v_subrev_i32_e32 v1, vcc, s2, v0
 ; CHECK-NEXT:    v_cmp_le_u32_e32 vcc, s2, v0
 ; CHECK-NEXT:    v_cndmask_b32_e32 v0, v0, v1, vcc
-; CHECK-NEXT:  .LBB1_5:
+; CHECK-NEXT:  .LBB1_4:
 ; CHECK-NEXT:    v_readfirstlane_b32 s0, v0
 ; CHECK-NEXT:    s_mov_b32 s1, s0
 ; CHECK-NEXT:    ; return to shader part epilog
diff --git a/llvm/test/CodeGen/AMDGPU/artificial-terminators.mir b/llvm/test/CodeGen/AMDGPU/artificial-terminators.mir
index 1a76cae68f164..9e84d979e8547 100644
--- a/llvm/test/CodeGen/AMDGPU/artificial-terminators.mir
+++ b/llvm/test/CodeGen/AMDGPU/artificial-terminators.mir
@@ -34,18 +34,14 @@ body: |
   ; CHECK-NEXT:   S_BRANCH %bb.1
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.1:
-  ; CHECK-NEXT:   successors: %bb.5(0x30000000), %bb.2(0x50000000)
+  ; CHECK-NEXT:   successors: %bb.4(0x30000000), %bb.2(0x50000000)
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   [[V_CMP_LT_I32_e64_:%[0-9]+]]:sreg_32 = V_CMP_LT_I32_e64 [[V_ADD_U32_e64_3]], [[S_MOV_B32_1]], implicit $exec
   ; CHECK-NEXT:   [[S_XOR_B32_:%[0-9]+]]:sreg_32 = S_XOR_B32 $exec_lo, [[V_CMP_LT_I32_e64_]], implicit-def $scc
-  ; CHECK-NEXT:   $exec_lo = S_MOV_B32_term [[S_XOR_B32_]]
-  ; CHECK-NEXT:   S_CBRANCH_EXECNZ %bb.2, implicit $exec
-  ; CHECK-NEXT: {{  $}}
-  ; CHECK-NEXT: bb.5:
-  ; CHECK-NEXT:   successors: %bb.4(0x80000000)
-  ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   [[COPY3:%[0-9]+]]:sreg_32 = COPY [[V_CMP_LT_I32_e64_]]
-  ; CHECK-NEXT:   S_BRANCH %bb.4
+  ; CHECK-NEXT:   $exec_lo = S_MOV_B32_term [[S_XOR_B32_]]
+  ; CHECK-NEXT:   S_CBRANCH_EXECZ %bb.4, implicit $exec
+  ; CHECK-NEXT:   S_BRANCH %bb.2
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.2:
   ; CHECK-NEXT:   successors: %bb.4(0x40000000), %bb.3(0x40000000)
@@ -64,7 +60,7 @@ body: |
   ; CHECK-NEXT:   S_BRANCH %bb.4
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.4:
-  ; CHECK-NEXT:   [[PHI:%[0-9]+]]:sreg_32 = PHI [[COPY3]], %bb.5, [[S_OR_B32_]], %bb.2, [[S_OR_B32_]], %bb.3
+  ; CHECK-NEXT:   [[PHI:%[0-9]+]]:sreg_32 = PHI [[COPY3]], %bb.1, [[S_OR_B32_]], %bb.2, [[S_OR_B32_]], %bb.3
   ; CHECK-NEXT:   $exec_lo = S_OR_B32 $exec_lo, [[PHI]], implicit-def $scc
   ; CHECK-NEXT:   S_ENDPGM 0
   bb.0:
diff --git a/llvm/test/CodeGen/AMDGPU/blender-no-live-segment-at-def-implicit-def.ll b/llvm/test/CodeGen/AMDGPU/blender-no-live-segment-at-def-implicit-def.ll
index f9ffa5ae57f3e..dfbb5f6a64042 100644
--- a/llvm/test/CodeGen/AMDGPU/blender-no-live-segment-at-def-implicit-def.ll
+++ b/llvm/test/CodeGen/AMDGPU/blender-no-live-segment-at-def-implicit-def.ll
@@ -9,44 +9,34 @@ define amdgpu_kernel void @blender_no_live_segment_at_def_error(<4 x float> %ext
 ; CHECK-NEXT:    s_addc_u32 s13, s13, 0
 ; CHECK-NEXT:    s_setreg_b32 hwreg(HW_REG_FLAT_SCR_LO), s12
 ; CHECK-NEXT:    s_setreg_b32 hwreg(HW_REG_FLAT_SCR_HI), s13
-; CHECK-NEXT:    s_load_dwordx8 s[36:43], s[8:9], 0x0
+; CHECK-NEXT:    s_load_dwordx8 s[20:27], s[8:9], 0x0
 ; CHECK-NEXT:    s_add_u32 s0, s0, s17
 ; CHECK-NEXT:    s_addc_u32 s1, s1, 0
-; CHECK-NEXT:    s_mov_b32 s12, 0
-; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
-; CHECK-NEXT:    s_cmp_lg_u32 s40, 0
-; CHECK-NEXT:    s_cbranch_scc1 .LBB0_8
-; CHECK-NEXT:  ; %bb.1: ; %if.end13.i.i
-; CHECK-NEXT:    s_cmp_eq_u32 s42, 0
-; CHECK-NEXT:    s_cbranch_scc1 .LBB0_4
-; CHECK-NEXT:  ; %bb.2: ; %if.else251.i.i
-; CHECK-NEXT:    s_cmp_lg_u32 s43, 0
-; CHECK-NEXT:    s_mov_b32 s17, 0
-; CHECK-NEXT:    s_cselect_b32 s12, -1, 0
-; CHECK-NEXT:    s_and_b32 vcc_lo, exec_lo, s12
-; CHECK-NEXT:    s_cbranch_vccz .LBB0_5
-; CHECK-NEXT:  ; %bb.3:
 ; CHECK-NEXT:    s_mov_b32 s36, 0
-; CHECK-NEXT:    s_andn2_b32 vcc_lo, exec_lo, s12
-; CHECK-NEXT:    s_cbranch_vccz .LBB0_6
-; CHECK-NEXT:    s_branch .LBB0_7
-; CHECK-NEXT:  .LBB0_4:
-; CHECK-NEXT:    s_mov_b32 s14, s12
-; CHECK-NEXT:    s_mov_b32 s15, s12
-; CHECK-NEXT:    s_mov_b32 s13, s12
-; CHECK-NEXT:    s_mov_b64 s[38:39], s[14:15]
-; CHECK-NEXT:    s_mov_b64 s[36:37], s[12:13]
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    s_cmp_lg_u32 s24, 0
+; CHECK-NEXT:    s_cbranch_scc0 .LBB0_2
+; CHECK-NEXT:  ; %bb.1:
+; CHECK-NEXT:    s_mov_b64 s[38:39], s[22:23]
+; CHECK-NEXT:    s_mov_b64 s[36:37], s[20:21]
 ; CHECK-NEXT:    s_branch .LBB0_7
-; CHECK-NEXT:  .LBB0_5: ; %if.then263.i.i
-; CHECK-NEXT:    v_cmp_lt_f32_e64 s12, s41, 0
-; CHECK-NEXT:    s_mov_b32 s36, 1.0
-; CHECK-NEXT:    s_mov_b32 s17, 0x7fc00000
+; CHECK-NEXT:  .LBB0_2: ; %if.end13.i.i
 ; CHECK-NEXT:    s_mov_b32 s37, s36
 ; CHECK-NEXT:    s_mov_b32 s38, s36
+; CHECK-NEXT:    s_cmp_eq_u32 s26, 0
 ; CHECK-NEXT:    s_mov_b32 s39, s36
+; CHECK-NEXT:    s_cbranch_scc1 .LBB0_6
+; CHECK-NEXT:  ; %bb.3: ; %if.else251.i.i
+; CHECK-NEXT:    s_cmp_lg_u32 s27, 0
+; CHECK-NEXT:    s_mov_b32 s17, 0
+; CHECK-NEXT:    s_cselect_b32 s12, -1, 0
+; CHECK-NEXT:    s_and_b32 vcc_lo, exec_lo, s12
+; CHECK-NEXT:    s_cbranch_vccz .LBB0_8
+; CHECK-NEXT:  ; %bb.4:
+; CHECK-NEXT:    s_mov_b32 s36, 0
 ; CHECK-NEXT:    s_andn2_b32 vcc_lo, exec_lo, s12
-; CHECK-NEXT:    s_cbranch_vccnz .LBB0_7
-; CHECK-NEXT:  .LBB0_6: ; %if.end273.i.i
+; CHECK-NEXT:    s_cbranch_vccnz .LBB0_6
+; CHECK-NEXT:  .LBB0_5: ; %if.end273.i.i
 ; CHECK-NEXT:    s_add_u32 s12, s8, 40
 ; CHECK-NEXT:    s_addc_u32 s13, s9, 0
 ; CHECK-NEXT:    s_getpc_b64 s[18:19]
@@ -72,13 +62,13 @@ define amdgpu_kernel void @blender_no_live_segment_at_def_error(<4 x float> %ext
 ; CHECK-NEXT:    s_mov_b32 s37, s36
 ; CHECK-NEXT:    s_mov_b32 s38, s36
 ; CHECK-NEXT:    s_mov_b32 s39, s36
-; CHECK-NEXT:  .LBB0_7: ; %if.end294.i.i
+; CHECK-NEXT:  .LBB0_6: ; %if.end294.i.i
 ; CHECK-NEXT:    v_mov_b32_e32 v0, 0
 ; CHECK-NEXT:    buffer_store_dword v0, off, s[0:3], 0 offset:12
 ; CHECK-NEXT:    buffer_store_dword v0, off, s[0:3], 0 offset:8
 ; CHECK-NEXT:    buffer_store_dword v0, off, s[0:3], 0 offset:4
 ; CHECK-NEXT:    buffer_store_dword v0, off, s[0:3], 0
-; CHECK-NEXT:  .LBB0_8: ; %kernel_direct_lighting.exit
+; CHECK-NEXT:  .LBB0_7: ; %kernel_direct_lighting.exit
 ; CHECK-NEXT:    s_load_dwordx2 s[4:5], s[8:9], 0x20
 ; CHECK-NEXT:    v_mov_b32_e32 v0, s36
 ; CHECK-NEXT:    v_mov_b32_e32 v4, 0
@@ -88,6 +78,16 @@ define amdgpu_kernel void @blender_no_live_segment_at_def_error(<4 x float> %ext
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
 ; CHECK-NEXT:    global_store_dwordx4 v4, v[0:3], s[4:5]
 ; CHECK-NEXT:    s_endpgm
+; CHECK-NEXT:  .LBB0_8: ; %if.then263.i.i
+; CHECK-NEXT:    v_cmp_lt_f32_e64 s12, s25, 0
+; CHECK-NEXT:    s_mov_b32 s36, 1.0
+; CHECK-NEXT:    s_mov_b32 s17, 0x7fc00000
+; CHECK-NEXT:    s_mov_b32 s37, s36
+; CHECK-NEXT:    s_mov_b32 s38, s36
+; CHECK-NEXT:    s_mov_b32 s39, s36
+; CHECK-NEXT:    s_andn2_b32 vcc_lo, exec_lo, s12
+; CHECK-NEXT:    s_cbranch_vccz .LBB0_5
+; CHECK-NEXT:    s_branch .LBB0_6
 entry:
   %cmp5.i.i = icmp eq i32 %cmp5.i.i.arg, 0
   br i1 %cmp5.i.i, label %if.end13.i.i, label %kernel_direct_lighting.exit
diff --git a/llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll b/llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll
index d61c4b46596c0..ce0b79b0b358c 100644
--- a/llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll
+++ b/llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll
@@ -848,12 +848,13 @@ define amdgpu_kernel void @test_dynamic_stackalloc_kernel_control_flow(i32 %n, i
 ; GFX9-SDAG-NEXT:    s_load_dwordx2 s[4:5], s[8:9], 0x0
 ; GFX9-SDAG-NEXT:    s_add_u32 s0, s0, s17
 ; GFX9-SDAG-NEXT:    s_addc_u32 s1, s1, 0
+; GFX9-SDAG-NEXT:    s_mov_b64 s[6:7], -1
 ; GFX9-SDAG-NEXT:    s_mov_b32 s33, 0
-; GFX9-SDAG-NEXT:    s_movk_i32 s32, 0x1000
 ; GFX9-SDAG-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX9-SDAG-NEXT:    s_cmp_lg_u32 s4, 0
 ; GFX9-SDAG-NEXT:    s_mov_b32 s4, 0
-; GFX9-SDAG-NEXT:    s_cbranch_scc0 .LBB7_6
+; GFX9-SDAG-NEXT:    s_movk_i32 s32, 0x1000
+; GFX9-SDAG-NEXT:    s_cbranch_scc0 .LBB7_4
 ; GFX9-SDAG-NEXT:  ; %bb.1: ; %bb.1
 ; GFX9-SDAG-NEXT:    v_lshl_add_u32 v0, v0, 2, 15
 ; GFX9-SDAG-NEXT:    v_and_b32_e32 v0, 0x1ff0, v0
@@ -873,8 +874,11 @@ define amdgpu_kernel void @test_dynamic_stackalloc_kernel_control_flow(i32 %n, i
 ; GFX9-SDAG-NEXT:    v_mov_b32_e32 v0, 1
 ; GFX9-SDAG-NEXT:    buffer_store_dword v0, off, s[0:3], s6
 ; GFX9-SDAG-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-SDAG-NEXT:    s_cbranch_execnz .LB...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Feb 25, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Guy David (guy-david)

Changes

Requires #128745.

Lower it slightly below the likeliness of a null-check to be true which is set to 37.5% (see PtrUntakenProb).
Otherwise, it will split the edge and create another basic-block and with an unconditional branch which might make the CFG more complex and with a suboptimal block placement.
Note that if multiple instructions can be sinked from the same edge then a split will occur regardless of this change.

On M4 Pro:

$ ./utils/compare.py build-a/results1.json build-a/results2.json build-a/results3.json vs build-b/results1.json build-b/results2.json build-b/results3.json
Tests: 4314
Metric: exec_time
Program                                       exec_time             
                                              lhs       rhs    diff 
MultiSourc...chmarks/Prolangs-C/agrep/agrep     0.00      0.01 44.7%
MultiSourc...rks/McCat/03-testtrie/testtrie     0.01      0.01 31.7%
SingleSour...hmarks/Shootout/Shootout-lists     2.02      2.64 30.6%
SingleSour...ecute/GCC-C-execute-20170419-1     0.00      0.00 14.3%
SingleSour.../execute/GCC-C-execute-pr59101     0.00      0.00 14.3%
SingleSour...ecute/GCC-C-execute-20040311-1     0.00      0.00 14.3%
SingleSour.../execute/GCC-C-execute-pr57124     0.00      0.00 14.3%
SingleSour...ecute/GCC-C-execute-20031204-1     0.00      0.00 14.3%
SingleSour...xecute/GCC-C-execute-pr57344-3     0.00      0.00 14.3%
SingleSour.../execute/GCC-C-execute-pr57875     0.00      0.00 14.3%
SingleSour...ecute/GCC-C-execute-20030811-1     0.00      0.00 14.3%
SingleSour.../execute/GCC-C-execute-pr58640     0.00      0.00 14.3%
SingleSour...ecute/GCC-C-execute-20030408-1     0.00      0.00 14.3%
SingleSour...ecute/GCC-C-execute-20030323-1     0.00      0.00 14.3%
SingleSour...ecute/GCC-C-execute-20030203-1     0.00      0.00 14.3%
                           Geomean difference                   0.1%
           exec_time                            
l/r              lhs            rhs         diff
count  4314.000000    4314.000000    4294.000000
mean   453.919219     454.105532     0.002072   
std    10865.757400   10868.002426   0.043046   
min    0.000000       0.000000      -0.171642   
25%    0.000700       0.000700       0.000000   
50%    0.007400       0.007400       0.000000   
75%    0.047829       0.047950       0.000033   
max    321294.306703  321320.624713  0.447368

On Ryzen9 5950X:

$ ./utils/compare.py build-a/results1.json build-a/results2.json build-a/results3.json vs build-b/results1.json build-b/results2.json build-b/results3.json
Tests: 3326
Metric: exec_time
Program                                       exec_time               
                                              lhs       rhs     diff  
MemFunctio...mCmp&lt;1, GreaterThanZero, None&gt;   1741.26   1885.00 143.74
MemFunctio..._MemCmp&lt;1, LessThanZero, Last&gt;   1759.78   1873.93 114.15
MemFunctio...est:BM_MemCmp&lt;1, EqZero, Last&gt;   1747.19   1847.42 100.22
MemFunctio...Cmp&lt;1, GreaterThanZero, First&gt;   1750.17   1844.57  94.40
MemFunctio...mCmp&lt;1, GreaterThanZero, Last&gt;   1751.05   1844.68  93.63
MemFunctio...emCmp&lt;1, GreaterThanZero, Mid&gt;   1756.49   1849.62  93.13
MemFunctio..._MemCmp&lt;1, LessThanZero, None&gt;   1744.87   1835.22  90.35
MemFunctio...M_MemCmp&lt;1, LessThanZero, Mid&gt;   1757.53   1846.29  88.77
harris/har...est:BENCHMARK_HARRIS/1024/1024   5689.29   5754.88  65.59
MemFunctio...MemCmp&lt;2, LessThanZero, First&gt;   1123.00   1181.63  58.63
MemFunctio...test:BM_MemCmp&lt;1, EqZero, Mid&gt;   2524.93   2582.21  57.28
MemFunctio...est:BM_MemCmp&lt;1, EqZero, None&gt;   2525.97   2582.43  56.46
MemFunctio..._MemCmp&lt;3, LessThanZero, Last&gt;    869.04    924.66  55.62
MemFunctio...test:BM_MemCmp&lt;3, EqZero, Mid&gt;    878.39    932.53  54.14
MemFunctio...MemCmp&lt;1, LessThanZero, First&gt;   2528.37   2582.27  53.90
                                                                                                                                                   exec_time                             
l/r                                                                                                                                                      lhs            rhs          diff
Program                                                                                                                                                                                  
test-suite :: MicroBenchmarks/MemFunctions/MemFunctions.test:BM_MemCmp&lt;1, GreaterThanZero, None&gt;                                               1741.261663    1884.998860    143.737197  
test-suite :: MicroBenchmarks/MemFunctions/MemFunctions.test:BM_MemCmp&lt;1, LessThanZero, Last&gt;                                                  1759.779355    1873.926412    114.147056  
test-suite :: MicroBenchmarks/MemFunctions/MemFunctions.test:BM_MemCmp&lt;1, EqZero, Last&gt;                                                        1747.192734    1847.416650    100.223916  
test-suite :: MicroBenchmarks/MemFunctions/MemFunctions.test:BM_MemCmp&lt;1, GreaterThanZero, First&gt;                                              1750.171003    1844.569735    94.398732   
test-suite :: MicroBenchmarks/MemFunctions/MemFunctions.test:BM_MemCmp&lt;1, GreaterThanZero, Last&gt;                                               1751.049323    1844.682784    93.633461   
...                                                                                                                                                    ...            ...           ...  
test-suite :: MicroBenchmarks/LoopVectorization/LoopVectorizationBenchmarks.test:benchVecWithRuntimeChecks4PointersDAfterA/1000                435033.995649  412835.347288 -22198.648362
test-suite :: MicroBenchmarks/LoopVectorization/LoopVectorizationBenchmarks.test:benchVecWithRuntimeChecks4PointersAllDisjointDecreasing/1000  435136.829708  412921.450737 -22215.378970
test-suite :: MicroBenchmarks/LoopVectorization/LoopVectorizationBenchmarks.test:benchVecWithRuntimeChecks4PointersAllDisjointIncreasing/1000  435136.457427  412908.677876 -22227.779551
test-suite :: MicroBenchmarks/LoopVectorization/LoopVectorizationBenchmarks.test:benchVecWithRuntimeChecks4PointersDEqualsA/1000               435088.787446  412769.793042 -22318.994403
test-suite :: MicroBenchmarks/LoopVectorization/LoopVectorizationBenchmarks.test:benchVecWithRuntimeChecks4PointersDBeforeA/1000               835721.265233  791510.926471 -44210.338762

[3326 rows x 3 columns]
           exec_time                             
l/r              lhs            rhs          diff
count  3326.000000    3326.000000    3326.000000 
mean   916.350942     873.987972    -42.362970   
std    20951.565020   19865.106212   1087.788132 
min    0.000000       0.000000      -44210.338762
25%    0.000000       0.000000      -0.000400    
50%    0.000400       0.000400       0.000000    
75%    1.774625       1.732975       0.000400    
max    835721.265233  791510.926471  143.737197

I looked into the disassembly of BM_MemCmp&lt;1, GreaterThanZero, None&gt; in MemFunctions.test and it has not changed.


Patch is 226.63 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/127666.diff

51 Files Affected:

  • (modified) llvm/lib/CodeGen/MachineSink.cpp (+1-1)
  • (modified) llvm/test/CodeGen/AArch64/aarch64-matrix-umull-smull.ll (+23-23)
  • (modified) llvm/test/CodeGen/AArch64/swifterror.ll (+2-4)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/sdiv.i64.ll (+5-7)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/srem.i64.ll (+5-7)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/udiv.i64.ll (+5-7)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/urem.i64.ll (+5-7)
  • (modified) llvm/test/CodeGen/AMDGPU/artificial-terminators.mir (+5-9)
  • (modified) llvm/test/CodeGen/AMDGPU/blender-no-live-segment-at-def-implicit-def.ll (+32-32)
  • (modified) llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll (+20-16)
  • (modified) llvm/test/CodeGen/AMDGPU/indirect-addressing-si.ll (+64-52)
  • (modified) llvm/test/CodeGen/AMDGPU/no-fold-accvgpr-mov.ll (+24-18)
  • (modified) llvm/test/CodeGen/AMDGPU/optimize-negated-cond.ll (+8-7)
  • (modified) llvm/test/CodeGen/AMDGPU/skip-if-dead.ll (+8-16)
  • (modified) llvm/test/CodeGen/ARM/and-cmp0-sink.ll (+14-14)
  • (modified) llvm/test/CodeGen/Mips/llvm-ir/sdiv-freebsd.ll (+6-3)
  • (modified) llvm/test/CodeGen/PowerPC/common-chain-aix32.ll (+16-17)
  • (modified) llvm/test/CodeGen/PowerPC/common-chain.ll (+80-89)
  • (modified) llvm/test/CodeGen/PowerPC/ifcvt_cr_field.ll (+6-12)
  • (modified) llvm/test/CodeGen/PowerPC/knowCRBitSpill.ll (+1)
  • (modified) llvm/test/CodeGen/PowerPC/loop-instr-form-prepare.ll (+88-114)
  • (modified) llvm/test/CodeGen/PowerPC/loop-instr-prep-non-const-increasement.ll (+16-19)
  • (modified) llvm/test/CodeGen/PowerPC/mma-phi-accs.ll (+6-12)
  • (modified) llvm/test/CodeGen/PowerPC/p10-spill-creq.ll (+28-33)
  • (modified) llvm/test/CodeGen/PowerPC/ppc64-rop-protection-aix.ll (+54-60)
  • (modified) llvm/test/CodeGen/PowerPC/ppc64-rop-protection.ll (+66-81)
  • (modified) llvm/test/CodeGen/PowerPC/shrink-wrap.ll (+12-20)
  • (modified) llvm/test/CodeGen/PowerPC/spe.ll (+2-4)
  • (modified) llvm/test/CodeGen/PowerPC/zext-and-cmp.ll (+16-6)
  • (modified) llvm/test/CodeGen/RISCV/fold-addi-loadstore.ll (+6-12)
  • (modified) llvm/test/CodeGen/Thumb2/mve-pipelineloops.ll (+7-10)
  • (modified) llvm/test/CodeGen/WebAssembly/implicit-def.ll (+23-12)
  • (modified) llvm/test/CodeGen/X86/2007-11-06-InstrSched.ll (+6-9)
  • (modified) llvm/test/CodeGen/X86/2008-04-28-CoalescerBug.ll (+8-11)
  • (modified) llvm/test/CodeGen/X86/atomic-rm-bit-test-64.ll (+61-68)
  • (modified) llvm/test/CodeGen/X86/atomic-rm-bit-test.ll (+342-395)
  • (modified) llvm/test/CodeGen/X86/branchfolding-debugloc.ll (+4-5)
  • (modified) llvm/test/CodeGen/X86/break-false-dep.ll (+28-48)
  • (modified) llvm/test/CodeGen/X86/coalescer-commute4.ll (+6-9)
  • (modified) llvm/test/CodeGen/X86/ctlo.ll (+21-26)
  • (modified) llvm/test/CodeGen/X86/ctlz.ll (+56-72)
  • (modified) llvm/test/CodeGen/X86/cttz.ll (+18-30)
  • (modified) llvm/test/CodeGen/X86/fold-loop-of-urem.ll (+26-32)
  • (modified) llvm/test/CodeGen/X86/lsr-sort.ll (+6-4)
  • (modified) llvm/test/CodeGen/X86/mmx-arith.ll (+7-9)
  • (modified) llvm/test/CodeGen/X86/pr2659.ll (+32-10)
  • (modified) llvm/test/CodeGen/X86/pr38795.ll (+50-53)
  • (modified) llvm/test/CodeGen/X86/probe-stack-eflags.ll (+4-6)
  • (modified) llvm/test/CodeGen/X86/taildup-heapallocsite.ll (+4-9)
  • (modified) llvm/test/CodeGen/X86/testb-je-fusion.ll (+8-10)
  • (modified) llvm/test/CodeGen/X86/x86-shrink-wrapping.ll (+12-16)
diff --git a/llvm/lib/CodeGen/MachineSink.cpp b/llvm/lib/CodeGen/MachineSink.cpp
index 82acb780cfb72..81459cf65d6c2 100644
--- a/llvm/lib/CodeGen/MachineSink.cpp
+++ b/llvm/lib/CodeGen/MachineSink.cpp
@@ -82,7 +82,7 @@ static cl::opt<unsigned> SplitEdgeProbabilityThreshold(
         "If the branch threshold is higher than this threshold, we allow "
         "speculative execution of up to 1 instruction to avoid branching to "
         "splitted critical edge"),
-    cl::init(40), cl::Hidden);
+    cl::init(35), cl::Hidden);
 
 static cl::opt<unsigned> SinkLoadInstsPerBlockThreshold(
     "machine-sink-load-instrs-threshold",
diff --git a/llvm/test/CodeGen/AArch64/aarch64-matrix-umull-smull.ll b/llvm/test/CodeGen/AArch64/aarch64-matrix-umull-smull.ll
index fb6575cc0ee83..fdc087e9c1991 100644
--- a/llvm/test/CodeGen/AArch64/aarch64-matrix-umull-smull.ll
+++ b/llvm/test/CodeGen/AArch64/aarch64-matrix-umull-smull.ll
@@ -632,20 +632,18 @@ define i16 @red_mla_dup_ext_u8_s8_s16(ptr noalias nocapture noundef readonly %A,
 ;
 ; CHECK-GI-LABEL: red_mla_dup_ext_u8_s8_s16:
 ; CHECK-GI:       // %bb.0: // %entry
-; CHECK-GI-NEXT:    cbz w2, .LBB5_3
+; CHECK-GI-NEXT:    mov w8, wzr
+; CHECK-GI-NEXT:    cbz w2, .LBB5_9
 ; CHECK-GI-NEXT:  // %bb.1: // %for.body.preheader
 ; CHECK-GI-NEXT:    cmp w2, #16
 ; CHECK-GI-NEXT:    mov w8, w2
-; CHECK-GI-NEXT:    b.hs .LBB5_4
+; CHECK-GI-NEXT:    b.hs .LBB5_3
 ; CHECK-GI-NEXT:  // %bb.2:
 ; CHECK-GI-NEXT:    mov w10, #0 // =0x0
 ; CHECK-GI-NEXT:    mov x9, xzr
 ; CHECK-GI-NEXT:    fmov s0, w10
-; CHECK-GI-NEXT:    b .LBB5_8
-; CHECK-GI-NEXT:  .LBB5_3:
-; CHECK-GI-NEXT:    mov w0, wzr
-; CHECK-GI-NEXT:    ret
-; CHECK-GI-NEXT:  .LBB5_4: // %vector.ph
+; CHECK-GI-NEXT:    b .LBB5_7
+; CHECK-GI-NEXT:  .LBB5_3: // %vector.ph
 ; CHECK-GI-NEXT:    lsl w9, w1, #8
 ; CHECK-GI-NEXT:    movi v0.2d, #0000000000000000
 ; CHECK-GI-NEXT:    movi v1.2d, #0000000000000000
@@ -654,7 +652,7 @@ define i16 @red_mla_dup_ext_u8_s8_s16(ptr noalias nocapture noundef readonly %A,
 ; CHECK-GI-NEXT:    dup v2.8h, w9
 ; CHECK-GI-NEXT:    and x9, x8, #0xfffffff0
 ; CHECK-GI-NEXT:    mov x11, x9
-; CHECK-GI-NEXT:  .LBB5_5: // %vector.body
+; CHECK-GI-NEXT:  .LBB5_4: // %vector.body
 ; CHECK-GI-NEXT:    // =>This Inner Loop Header: Depth=1
 ; CHECK-GI-NEXT:    ldp d3, d4, [x10, #-8]
 ; CHECK-GI-NEXT:    subs x11, x11, #16
@@ -663,29 +661,31 @@ define i16 @red_mla_dup_ext_u8_s8_s16(ptr noalias nocapture noundef readonly %A,
 ; CHECK-GI-NEXT:    ushll v4.8h, v4.8b, #0
 ; CHECK-GI-NEXT:    mla v0.8h, v2.8h, v3.8h
 ; CHECK-GI-NEXT:    mla v1.8h, v2.8h, v4.8h
-; CHECK-GI-NEXT:    b.ne .LBB5_5
-; CHECK-GI-NEXT:  // %bb.6: // %middle.block
+; CHECK-GI-NEXT:    b.ne .LBB5_4
+; CHECK-GI-NEXT:  // %bb.5: // %middle.block
 ; CHECK-GI-NEXT:    add v0.8h, v1.8h, v0.8h
 ; CHECK-GI-NEXT:    cmp x9, x8
 ; CHECK-GI-NEXT:    addv h0, v0.8h
-; CHECK-GI-NEXT:    b.ne .LBB5_8
-; CHECK-GI-NEXT:  // %bb.7:
-; CHECK-GI-NEXT:    fmov w0, s0
+; CHECK-GI-NEXT:    b.ne .LBB5_7
+; CHECK-GI-NEXT:  // %bb.6:
+; CHECK-GI-NEXT:    fmov w8, s0
+; CHECK-GI-NEXT:    mov w0, w8
 ; CHECK-GI-NEXT:    ret
-; CHECK-GI-NEXT:  .LBB5_8: // %for.body.preheader1
+; CHECK-GI-NEXT:  .LBB5_7: // %for.body.preheader1
 ; CHECK-GI-NEXT:    sxtb w10, w1
-; CHECK-GI-NEXT:    sub x8, x8, x9
+; CHECK-GI-NEXT:    sub x11, x8, x9
 ; CHECK-GI-NEXT:    add x9, x0, x9
-; CHECK-GI-NEXT:  .LBB5_9: // %for.body
+; CHECK-GI-NEXT:  .LBB5_8: // %for.body
 ; CHECK-GI-NEXT:    // =>This Inner Loop Header: Depth=1
-; CHECK-GI-NEXT:    ldrb w11, [x9], #1
+; CHECK-GI-NEXT:    ldrb w8, [x9], #1
 ; CHECK-GI-NEXT:    fmov w12, s0
-; CHECK-GI-NEXT:    subs x8, x8, #1
-; CHECK-GI-NEXT:    mul w11, w11, w10
-; CHECK-GI-NEXT:    add w0, w11, w12, uxth
-; CHECK-GI-NEXT:    fmov s0, w0
-; CHECK-GI-NEXT:    b.ne .LBB5_9
-; CHECK-GI-NEXT:  // %bb.10: // %for.cond.cleanup
+; CHECK-GI-NEXT:    subs x11, x11, #1
+; CHECK-GI-NEXT:    mul w8, w8, w10
+; CHECK-GI-NEXT:    add w8, w8, w12, uxth
+; CHECK-GI-NEXT:    fmov s0, w8
+; CHECK-GI-NEXT:    b.ne .LBB5_8
+; CHECK-GI-NEXT:  .LBB5_9: // %for.cond.cleanup
+; CHECK-GI-NEXT:    mov w0, w8
 ; CHECK-GI-NEXT:    ret
 entry:
   %conv2 = sext i8 %B to i16
diff --git a/llvm/test/CodeGen/AArch64/swifterror.ll b/llvm/test/CodeGen/AArch64/swifterror.ll
index 07ee87e880aff..1ca98f6015c11 100644
--- a/llvm/test/CodeGen/AArch64/swifterror.ll
+++ b/llvm/test/CodeGen/AArch64/swifterror.ll
@@ -412,6 +412,7 @@ define float @foo_if(ptr swifterror %error_ptr_ref, i32 %cc) {
 ; CHECK-APPLE-NEXT:    .cfi_def_cfa w29, 16
 ; CHECK-APPLE-NEXT:    .cfi_offset w30, -8
 ; CHECK-APPLE-NEXT:    .cfi_offset w29, -16
+; CHECK-APPLE-NEXT:    movi d0, #0000000000000000
 ; CHECK-APPLE-NEXT:    cbz w0, LBB3_2
 ; CHECK-APPLE-NEXT:  ; %bb.1: ; %gen_error
 ; CHECK-APPLE-NEXT:    mov w0, #16 ; =0x10
@@ -420,10 +421,7 @@ define float @foo_if(ptr swifterror %error_ptr_ref, i32 %cc) {
 ; CHECK-APPLE-NEXT:    fmov s0, #1.00000000
 ; CHECK-APPLE-NEXT:    mov w8, #1 ; =0x1
 ; CHECK-APPLE-NEXT:    strb w8, [x0, #8]
-; CHECK-APPLE-NEXT:    ldp x29, x30, [sp], #16 ; 16-byte Folded Reload
-; CHECK-APPLE-NEXT:    ret
-; CHECK-APPLE-NEXT:  LBB3_2:
-; CHECK-APPLE-NEXT:    movi d0, #0000000000000000
+; CHECK-APPLE-NEXT:  LBB3_2: ; %common.ret
 ; CHECK-APPLE-NEXT:    ldp x29, x30, [sp], #16 ; 16-byte Folded Reload
 ; CHECK-APPLE-NEXT:    ret
 ;
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/sdiv.i64.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/sdiv.i64.ll
index 0c9ff3eee8231..70caf812ea6c2 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/sdiv.i64.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/sdiv.i64.ll
@@ -200,6 +200,7 @@ define amdgpu_ps i64 @s_sdiv_i64(i64 inreg %num, i64 inreg %den) {
 ; CHECK-NEXT:    s_and_b64 s[0:1], s[0:1], s[6:7]
 ; CHECK-NEXT:    v_cmp_ne_u64_e64 vcc, s[0:1], 0
 ; CHECK-NEXT:    s_mov_b32 s0, 1
+; CHECK-NEXT:    ; implicit-def: $vgpr0_vgpr1
 ; CHECK-NEXT:    s_cbranch_vccz .LBB1_2
 ; CHECK-NEXT:  ; %bb.1:
 ; CHECK-NEXT:    s_ashr_i32 s6, s3, 31
@@ -330,15 +331,12 @@ define amdgpu_ps i64 @s_sdiv_i64(i64 inreg %num, i64 inreg %den) {
 ; CHECK-NEXT:    v_xor_b32_e32 v0, s6, v0
 ; CHECK-NEXT:    s_mov_b32 s0, 0
 ; CHECK-NEXT:    v_subrev_i32_e32 v0, vcc, s6, v0
-; CHECK-NEXT:    s_branch .LBB1_3
-; CHECK-NEXT:  .LBB1_2:
-; CHECK-NEXT:    ; implicit-def: $vgpr0_vgpr1
-; CHECK-NEXT:  .LBB1_3: ; %Flow
+; CHECK-NEXT:  .LBB1_2: ; %Flow
 ; CHECK-NEXT:    s_xor_b32 s0, s0, 1
 ; CHECK-NEXT:    s_and_b32 s0, s0, 1
 ; CHECK-NEXT:    s_cmp_lg_u32 s0, 0
-; CHECK-NEXT:    s_cbranch_scc1 .LBB1_5
-; CHECK-NEXT:  ; %bb.4:
+; CHECK-NEXT:    s_cbranch_scc1 .LBB1_4
+; CHECK-NEXT:  ; %bb.3:
 ; CHECK-NEXT:    v_cvt_f32_u32_e32 v0, s4
 ; CHECK-NEXT:    s_sub_i32 s0, 0, s4
 ; CHECK-NEXT:    v_rcp_iflag_f32_e32 v0, v0
@@ -358,7 +356,7 @@ define amdgpu_ps i64 @s_sdiv_i64(i64 inreg %num, i64 inreg %den) {
 ; CHECK-NEXT:    v_add_i32_e32 v2, vcc, 1, v0
 ; CHECK-NEXT:    v_cmp_le_u32_e32 vcc, s4, v1
 ; CHECK-NEXT:    v_cndmask_b32_e32 v0, v0, v2, vcc
-; CHECK-NEXT:  .LBB1_5:
+; CHECK-NEXT:  .LBB1_4:
 ; CHECK-NEXT:    v_readfirstlane_b32 s0, v0
 ; CHECK-NEXT:    s_mov_b32 s1, s0
 ; CHECK-NEXT:    ; return to shader part epilog
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/srem.i64.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/srem.i64.ll
index df645888626c6..2fcbc41895f03 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/srem.i64.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/srem.i64.ll
@@ -194,6 +194,7 @@ define amdgpu_ps i64 @s_srem_i64(i64 inreg %num, i64 inreg %den) {
 ; CHECK-NEXT:    s_and_b64 s[0:1], s[0:1], s[6:7]
 ; CHECK-NEXT:    v_cmp_ne_u64_e64 vcc, s[0:1], 0
 ; CHECK-NEXT:    s_mov_b32 s7, 1
+; CHECK-NEXT:    ; implicit-def: $vgpr0_vgpr1
 ; CHECK-NEXT:    s_cbranch_vccz .LBB1_2
 ; CHECK-NEXT:  ; %bb.1:
 ; CHECK-NEXT:    s_ashr_i32 s6, s3, 31
@@ -322,15 +323,12 @@ define amdgpu_ps i64 @s_srem_i64(i64 inreg %num, i64 inreg %den) {
 ; CHECK-NEXT:    v_cndmask_b32_e32 v0, v0, v1, vcc
 ; CHECK-NEXT:    v_xor_b32_e32 v0, s6, v0
 ; CHECK-NEXT:    v_subrev_i32_e32 v0, vcc, s6, v0
-; CHECK-NEXT:    s_branch .LBB1_3
-; CHECK-NEXT:  .LBB1_2:
-; CHECK-NEXT:    ; implicit-def: $vgpr0_vgpr1
-; CHECK-NEXT:  .LBB1_3: ; %Flow
+; CHECK-NEXT:  .LBB1_2: ; %Flow
 ; CHECK-NEXT:    s_xor_b32 s0, s7, 1
 ; CHECK-NEXT:    s_and_b32 s0, s0, 1
 ; CHECK-NEXT:    s_cmp_lg_u32 s0, 0
-; CHECK-NEXT:    s_cbranch_scc1 .LBB1_5
-; CHECK-NEXT:  ; %bb.4:
+; CHECK-NEXT:    s_cbranch_scc1 .LBB1_4
+; CHECK-NEXT:  ; %bb.3:
 ; CHECK-NEXT:    v_cvt_f32_u32_e32 v0, s4
 ; CHECK-NEXT:    s_sub_i32 s0, 0, s4
 ; CHECK-NEXT:    v_rcp_iflag_f32_e32 v0, v0
@@ -348,7 +346,7 @@ define amdgpu_ps i64 @s_srem_i64(i64 inreg %num, i64 inreg %den) {
 ; CHECK-NEXT:    v_subrev_i32_e32 v1, vcc, s4, v0
 ; CHECK-NEXT:    v_cmp_le_u32_e32 vcc, s4, v0
 ; CHECK-NEXT:    v_cndmask_b32_e32 v0, v0, v1, vcc
-; CHECK-NEXT:  .LBB1_5:
+; CHECK-NEXT:  .LBB1_4:
 ; CHECK-NEXT:    v_readfirstlane_b32 s0, v0
 ; CHECK-NEXT:    s_mov_b32 s1, s0
 ; CHECK-NEXT:    ; return to shader part epilog
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/udiv.i64.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/udiv.i64.ll
index f5a901b024ef5..c9a5a92188256 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/udiv.i64.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/udiv.i64.ll
@@ -193,6 +193,7 @@ define amdgpu_ps i64 @s_udiv_i64(i64 inreg %num, i64 inreg %den) {
 ; CHECK-NEXT:    v_cmp_ne_u64_e64 vcc, s[4:5], 0
 ; CHECK-NEXT:    s_mov_b32 s6, 1
 ; CHECK-NEXT:    v_cvt_f32_u32_e32 v2, s2
+; CHECK-NEXT:    ; implicit-def: $vgpr0_vgpr1
 ; CHECK-NEXT:    s_cbranch_vccz .LBB1_2
 ; CHECK-NEXT:  ; %bb.1:
 ; CHECK-NEXT:    v_mov_b32_e32 v0, s3
@@ -318,15 +319,12 @@ define amdgpu_ps i64 @s_udiv_i64(i64 inreg %num, i64 inreg %den) {
 ; CHECK-NEXT:    v_cndmask_b32_e32 v0, v9, v5, vcc
 ; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v3
 ; CHECK-NEXT:    v_cndmask_b32_e32 v0, v1, v0, vcc
-; CHECK-NEXT:    s_branch .LBB1_3
-; CHECK-NEXT:  .LBB1_2:
-; CHECK-NEXT:    ; implicit-def: $vgpr0_vgpr1
-; CHECK-NEXT:  .LBB1_3: ; %Flow
+; CHECK-NEXT:  .LBB1_2: ; %Flow
 ; CHECK-NEXT:    s_xor_b32 s1, s6, 1
 ; CHECK-NEXT:    s_and_b32 s1, s1, 1
 ; CHECK-NEXT:    s_cmp_lg_u32 s1, 0
-; CHECK-NEXT:    s_cbranch_scc1 .LBB1_5
-; CHECK-NEXT:  ; %bb.4:
+; CHECK-NEXT:    s_cbranch_scc1 .LBB1_4
+; CHECK-NEXT:  ; %bb.3:
 ; CHECK-NEXT:    v_rcp_iflag_f32_e32 v0, v2
 ; CHECK-NEXT:    s_sub_i32 s1, 0, s2
 ; CHECK-NEXT:    v_mul_f32_e32 v0, 0x4f7ffffe, v0
@@ -345,7 +343,7 @@ define amdgpu_ps i64 @s_udiv_i64(i64 inreg %num, i64 inreg %den) {
 ; CHECK-NEXT:    v_add_i32_e32 v2, vcc, 1, v0
 ; CHECK-NEXT:    v_cmp_le_u32_e32 vcc, s2, v1
 ; CHECK-NEXT:    v_cndmask_b32_e32 v0, v0, v2, vcc
-; CHECK-NEXT:  .LBB1_5:
+; CHECK-NEXT:  .LBB1_4:
 ; CHECK-NEXT:    v_readfirstlane_b32 s0, v0
 ; CHECK-NEXT:    s_mov_b32 s1, s0
 ; CHECK-NEXT:    ; return to shader part epilog
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/urem.i64.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/urem.i64.ll
index 2be4b52198b45..06e51387c8f21 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/urem.i64.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/urem.i64.ll
@@ -190,6 +190,7 @@ define amdgpu_ps i64 @s_urem_i64(i64 inreg %num, i64 inreg %den) {
 ; CHECK-NEXT:    v_cmp_ne_u64_e64 vcc, s[4:5], 0
 ; CHECK-NEXT:    s_mov_b32 s6, 1
 ; CHECK-NEXT:    v_cvt_f32_u32_e32 v2, s2
+; CHECK-NEXT:    ; implicit-def: $vgpr0_vgpr1
 ; CHECK-NEXT:    s_cbranch_vccz .LBB1_2
 ; CHECK-NEXT:  ; %bb.1:
 ; CHECK-NEXT:    v_mov_b32_e32 v0, s3
@@ -314,15 +315,12 @@ define amdgpu_ps i64 @s_urem_i64(i64 inreg %num, i64 inreg %den) {
 ; CHECK-NEXT:    v_cndmask_b32_e32 v0, v3, v6, vcc
 ; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v1
 ; CHECK-NEXT:    v_cndmask_b32_e32 v0, v4, v0, vcc
-; CHECK-NEXT:    s_branch .LBB1_3
-; CHECK-NEXT:  .LBB1_2:
-; CHECK-NEXT:    ; implicit-def: $vgpr0_vgpr1
-; CHECK-NEXT:  .LBB1_3: ; %Flow
+; CHECK-NEXT:  .LBB1_2: ; %Flow
 ; CHECK-NEXT:    s_xor_b32 s1, s6, 1
 ; CHECK-NEXT:    s_and_b32 s1, s1, 1
 ; CHECK-NEXT:    s_cmp_lg_u32 s1, 0
-; CHECK-NEXT:    s_cbranch_scc1 .LBB1_5
-; CHECK-NEXT:  ; %bb.4:
+; CHECK-NEXT:    s_cbranch_scc1 .LBB1_4
+; CHECK-NEXT:  ; %bb.3:
 ; CHECK-NEXT:    v_rcp_iflag_f32_e32 v0, v2
 ; CHECK-NEXT:    s_sub_i32 s1, 0, s2
 ; CHECK-NEXT:    v_mul_f32_e32 v0, 0x4f7ffffe, v0
@@ -339,7 +337,7 @@ define amdgpu_ps i64 @s_urem_i64(i64 inreg %num, i64 inreg %den) {
 ; CHECK-NEXT:    v_subrev_i32_e32 v1, vcc, s2, v0
 ; CHECK-NEXT:    v_cmp_le_u32_e32 vcc, s2, v0
 ; CHECK-NEXT:    v_cndmask_b32_e32 v0, v0, v1, vcc
-; CHECK-NEXT:  .LBB1_5:
+; CHECK-NEXT:  .LBB1_4:
 ; CHECK-NEXT:    v_readfirstlane_b32 s0, v0
 ; CHECK-NEXT:    s_mov_b32 s1, s0
 ; CHECK-NEXT:    ; return to shader part epilog
diff --git a/llvm/test/CodeGen/AMDGPU/artificial-terminators.mir b/llvm/test/CodeGen/AMDGPU/artificial-terminators.mir
index 1a76cae68f164..9e84d979e8547 100644
--- a/llvm/test/CodeGen/AMDGPU/artificial-terminators.mir
+++ b/llvm/test/CodeGen/AMDGPU/artificial-terminators.mir
@@ -34,18 +34,14 @@ body: |
   ; CHECK-NEXT:   S_BRANCH %bb.1
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.1:
-  ; CHECK-NEXT:   successors: %bb.5(0x30000000), %bb.2(0x50000000)
+  ; CHECK-NEXT:   successors: %bb.4(0x30000000), %bb.2(0x50000000)
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   [[V_CMP_LT_I32_e64_:%[0-9]+]]:sreg_32 = V_CMP_LT_I32_e64 [[V_ADD_U32_e64_3]], [[S_MOV_B32_1]], implicit $exec
   ; CHECK-NEXT:   [[S_XOR_B32_:%[0-9]+]]:sreg_32 = S_XOR_B32 $exec_lo, [[V_CMP_LT_I32_e64_]], implicit-def $scc
-  ; CHECK-NEXT:   $exec_lo = S_MOV_B32_term [[S_XOR_B32_]]
-  ; CHECK-NEXT:   S_CBRANCH_EXECNZ %bb.2, implicit $exec
-  ; CHECK-NEXT: {{  $}}
-  ; CHECK-NEXT: bb.5:
-  ; CHECK-NEXT:   successors: %bb.4(0x80000000)
-  ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   [[COPY3:%[0-9]+]]:sreg_32 = COPY [[V_CMP_LT_I32_e64_]]
-  ; CHECK-NEXT:   S_BRANCH %bb.4
+  ; CHECK-NEXT:   $exec_lo = S_MOV_B32_term [[S_XOR_B32_]]
+  ; CHECK-NEXT:   S_CBRANCH_EXECZ %bb.4, implicit $exec
+  ; CHECK-NEXT:   S_BRANCH %bb.2
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.2:
   ; CHECK-NEXT:   successors: %bb.4(0x40000000), %bb.3(0x40000000)
@@ -64,7 +60,7 @@ body: |
   ; CHECK-NEXT:   S_BRANCH %bb.4
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.4:
-  ; CHECK-NEXT:   [[PHI:%[0-9]+]]:sreg_32 = PHI [[COPY3]], %bb.5, [[S_OR_B32_]], %bb.2, [[S_OR_B32_]], %bb.3
+  ; CHECK-NEXT:   [[PHI:%[0-9]+]]:sreg_32 = PHI [[COPY3]], %bb.1, [[S_OR_B32_]], %bb.2, [[S_OR_B32_]], %bb.3
   ; CHECK-NEXT:   $exec_lo = S_OR_B32 $exec_lo, [[PHI]], implicit-def $scc
   ; CHECK-NEXT:   S_ENDPGM 0
   bb.0:
diff --git a/llvm/test/CodeGen/AMDGPU/blender-no-live-segment-at-def-implicit-def.ll b/llvm/test/CodeGen/AMDGPU/blender-no-live-segment-at-def-implicit-def.ll
index f9ffa5ae57f3e..dfbb5f6a64042 100644
--- a/llvm/test/CodeGen/AMDGPU/blender-no-live-segment-at-def-implicit-def.ll
+++ b/llvm/test/CodeGen/AMDGPU/blender-no-live-segment-at-def-implicit-def.ll
@@ -9,44 +9,34 @@ define amdgpu_kernel void @blender_no_live_segment_at_def_error(<4 x float> %ext
 ; CHECK-NEXT:    s_addc_u32 s13, s13, 0
 ; CHECK-NEXT:    s_setreg_b32 hwreg(HW_REG_FLAT_SCR_LO), s12
 ; CHECK-NEXT:    s_setreg_b32 hwreg(HW_REG_FLAT_SCR_HI), s13
-; CHECK-NEXT:    s_load_dwordx8 s[36:43], s[8:9], 0x0
+; CHECK-NEXT:    s_load_dwordx8 s[20:27], s[8:9], 0x0
 ; CHECK-NEXT:    s_add_u32 s0, s0, s17
 ; CHECK-NEXT:    s_addc_u32 s1, s1, 0
-; CHECK-NEXT:    s_mov_b32 s12, 0
-; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
-; CHECK-NEXT:    s_cmp_lg_u32 s40, 0
-; CHECK-NEXT:    s_cbranch_scc1 .LBB0_8
-; CHECK-NEXT:  ; %bb.1: ; %if.end13.i.i
-; CHECK-NEXT:    s_cmp_eq_u32 s42, 0
-; CHECK-NEXT:    s_cbranch_scc1 .LBB0_4
-; CHECK-NEXT:  ; %bb.2: ; %if.else251.i.i
-; CHECK-NEXT:    s_cmp_lg_u32 s43, 0
-; CHECK-NEXT:    s_mov_b32 s17, 0
-; CHECK-NEXT:    s_cselect_b32 s12, -1, 0
-; CHECK-NEXT:    s_and_b32 vcc_lo, exec_lo, s12
-; CHECK-NEXT:    s_cbranch_vccz .LBB0_5
-; CHECK-NEXT:  ; %bb.3:
 ; CHECK-NEXT:    s_mov_b32 s36, 0
-; CHECK-NEXT:    s_andn2_b32 vcc_lo, exec_lo, s12
-; CHECK-NEXT:    s_cbranch_vccz .LBB0_6
-; CHECK-NEXT:    s_branch .LBB0_7
-; CHECK-NEXT:  .LBB0_4:
-; CHECK-NEXT:    s_mov_b32 s14, s12
-; CHECK-NEXT:    s_mov_b32 s15, s12
-; CHECK-NEXT:    s_mov_b32 s13, s12
-; CHECK-NEXT:    s_mov_b64 s[38:39], s[14:15]
-; CHECK-NEXT:    s_mov_b64 s[36:37], s[12:13]
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    s_cmp_lg_u32 s24, 0
+; CHECK-NEXT:    s_cbranch_scc0 .LBB0_2
+; CHECK-NEXT:  ; %bb.1:
+; CHECK-NEXT:    s_mov_b64 s[38:39], s[22:23]
+; CHECK-NEXT:    s_mov_b64 s[36:37], s[20:21]
 ; CHECK-NEXT:    s_branch .LBB0_7
-; CHECK-NEXT:  .LBB0_5: ; %if.then263.i.i
-; CHECK-NEXT:    v_cmp_lt_f32_e64 s12, s41, 0
-; CHECK-NEXT:    s_mov_b32 s36, 1.0
-; CHECK-NEXT:    s_mov_b32 s17, 0x7fc00000
+; CHECK-NEXT:  .LBB0_2: ; %if.end13.i.i
 ; CHECK-NEXT:    s_mov_b32 s37, s36
 ; CHECK-NEXT:    s_mov_b32 s38, s36
+; CHECK-NEXT:    s_cmp_eq_u32 s26, 0
 ; CHECK-NEXT:    s_mov_b32 s39, s36
+; CHECK-NEXT:    s_cbranch_scc1 .LBB0_6
+; CHECK-NEXT:  ; %bb.3: ; %if.else251.i.i
+; CHECK-NEXT:    s_cmp_lg_u32 s27, 0
+; CHECK-NEXT:    s_mov_b32 s17, 0
+; CHECK-NEXT:    s_cselect_b32 s12, -1, 0
+; CHECK-NEXT:    s_and_b32 vcc_lo, exec_lo, s12
+; CHECK-NEXT:    s_cbranch_vccz .LBB0_8
+; CHECK-NEXT:  ; %bb.4:
+; CHECK-NEXT:    s_mov_b32 s36, 0
 ; CHECK-NEXT:    s_andn2_b32 vcc_lo, exec_lo, s12
-; CHECK-NEXT:    s_cbranch_vccnz .LBB0_7
-; CHECK-NEXT:  .LBB0_6: ; %if.end273.i.i
+; CHECK-NEXT:    s_cbranch_vccnz .LBB0_6
+; CHECK-NEXT:  .LBB0_5: ; %if.end273.i.i
 ; CHECK-NEXT:    s_add_u32 s12, s8, 40
 ; CHECK-NEXT:    s_addc_u32 s13, s9, 0
 ; CHECK-NEXT:    s_getpc_b64 s[18:19]
@@ -72,13 +62,13 @@ define amdgpu_kernel void @blender_no_live_segment_at_def_error(<4 x float> %ext
 ; CHECK-NEXT:    s_mov_b32 s37, s36
 ; CHECK-NEXT:    s_mov_b32 s38, s36
 ; CHECK-NEXT:    s_mov_b32 s39, s36
-; CHECK-NEXT:  .LBB0_7: ; %if.end294.i.i
+; CHECK-NEXT:  .LBB0_6: ; %if.end294.i.i
 ; CHECK-NEXT:    v_mov_b32_e32 v0, 0
 ; CHECK-NEXT:    buffer_store_dword v0, off, s[0:3], 0 offset:12
 ; CHECK-NEXT:    buffer_store_dword v0, off, s[0:3], 0 offset:8
 ; CHECK-NEXT:    buffer_store_dword v0, off, s[0:3], 0 offset:4
 ; CHECK-NEXT:    buffer_store_dword v0, off, s[0:3], 0
-; CHECK-NEXT:  .LBB0_8: ; %kernel_direct_lighting.exit
+; CHECK-NEXT:  .LBB0_7: ; %kernel_direct_lighting.exit
 ; CHECK-NEXT:    s_load_dwordx2 s[4:5], s[8:9], 0x20
 ; CHECK-NEXT:    v_mov_b32_e32 v0, s36
 ; CHECK-NEXT:    v_mov_b32_e32 v4, 0
@@ -88,6 +78,16 @@ define amdgpu_kernel void @blender_no_live_segment_at_def_error(<4 x float> %ext
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
 ; CHECK-NEXT:    global_store_dwordx4 v4, v[0:3], s[4:5]
 ; CHECK-NEXT:    s_endpgm
+; CHECK-NEXT:  .LBB0_8: ; %if.then263.i.i
+; CHECK-NEXT:    v_cmp_lt_f32_e64 s12, s25, 0
+; CHECK-NEXT:    s_mov_b32 s36, 1.0
+; CHECK-NEXT:    s_mov_b32 s17, 0x7fc00000
+; CHECK-NEXT:    s_mov_b32 s37, s36
+; CHECK-NEXT:    s_mov_b32 s38, s36
+; CHECK-NEXT:    s_mov_b32 s39, s36
+; CHECK-NEXT:    s_andn2_b32 vcc_lo, exec_lo, s12
+; CHECK-NEXT:    s_cbranch_vccz .LBB0_5
+; CHECK-NEXT:    s_branch .LBB0_6
 entry:
   %cmp5.i.i = icmp eq i32 %cmp5.i.i.arg, 0
   br i1 %cmp5.i.i, label %if.end13.i.i, label %kernel_direct_lighting.exit
diff --git a/llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll b/llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll
index d61c4b46596c0..ce0b79b0b358c 100644
--- a/llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll
+++ b/llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll
@@ -848,12 +848,13 @@ define amdgpu_kernel void @test_dynamic_stackalloc_kernel_control_flow(i32 %n, i
 ; GFX9-SDAG-NEXT:    s_load_dwordx2 s[4:5], s[8:9], 0x0
 ; GFX9-SDAG-NEXT:    s_add_u32 s0, s0, s17
 ; GFX9-SDAG-NEXT:    s_addc_u32 s1, s1, 0
+; GFX9-SDAG-NEXT:    s_mov_b64 s[6:7], -1
 ; GFX9-SDAG-NEXT:    s_mov_b32 s33, 0
-; GFX9-SDAG-NEXT:    s_movk_i32 s32, 0x1000
 ; GFX9-SDAG-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX9-SDAG-NEXT:    s_cmp_lg_u32 s4, 0
 ; GFX9-SDAG-NEXT:    s_mov_b32 s4, 0
-; GFX9-SDAG-NEXT:    s_cbranch_scc0 .LBB7_6
+; GFX9-SDAG-NEXT:    s_movk_i32 s32, 0x1000
+; GFX9-SDAG-NEXT:    s_cbranch_scc0 .LBB7_4
 ; GFX9-SDAG-NEXT:  ; %bb.1: ; %bb.1
 ; GFX9-SDAG-NEXT:    v_lshl_add_u32 v0, v0, 2, 15
 ; GFX9-SDAG-NEXT:    v_and_b32_e32 v0, 0x1ff0, v0
@@ -873,8 +874,11 @@ define amdgpu_kernel void @test_dynamic_stackalloc_kernel_control_flow(i32 %n, i
 ; GFX9-SDAG-NEXT:    v_mov_b32_e32 v0, 1
 ; GFX9-SDAG-NEXT:    buffer_store_dword v0, off, s[0:3], s6
 ; GFX9-SDAG-NEXT:    s_waitcnt vmcnt(0)
-; GFX9-SDAG-NEXT:    s_cbranch_execnz .LB...
[truncated]

@guy-david guy-david force-pushed the users/guy-david/machine-sink branch from 725de8f to 0846b5c Compare February 25, 2025 19:38
@guy-david guy-david force-pushed the users/guy-david/machine-sink-powerpc branch from 2001146 to fdcf246 Compare February 25, 2025 22:37
@guy-david guy-david force-pushed the users/guy-david/machine-sink branch from 0846b5c to f95740d Compare February 25, 2025 22:38
@guy-david guy-david force-pushed the users/guy-david/machine-sink-powerpc branch from fdcf246 to 40b923a Compare February 26, 2025 17:20
@guy-david guy-david force-pushed the users/guy-david/machine-sink branch from f95740d to 955685d Compare February 26, 2025 17:21
@guy-david guy-david force-pushed the users/guy-david/machine-sink-powerpc branch from 40b923a to cebcc2f Compare February 27, 2025 00:00
@guy-david guy-david force-pushed the users/guy-david/machine-sink branch from 955685d to ddb2ff2 Compare February 27, 2025 00:00
@guy-david guy-david force-pushed the users/guy-david/machine-sink-powerpc branch from cebcc2f to 95af9b3 Compare February 27, 2025 00:01
Lower it slightly below the likeliness of a null-check to be true which
is set to 37.5% (see PtrUntakenProb).
Otherwise, it will split the edge and create another basic-block and
with an unconditional branch which might make the CFG more complex and
with a suboptimal block placement.
Note that if multiple instructions can be sinked from the same edge then
a split will occur regardless of this change.
@guy-david guy-david force-pushed the users/guy-david/machine-sink branch from ddb2ff2 to f2418cc Compare February 27, 2025 00:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants