Skip to content

[MachineCopyPropagation, Scheduler] Detect and fix suboptimal instruction order to enable optimizations #98087

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

spaits
Copy link
Contributor

@spaits spaits commented Jul 8, 2024

The issue

In the Coremark becnhmark the following code can be found in core_state.c's core_bench_state function (https://github.com/eembc/coremark/blob/d5fad6bd094899101a4e5fd53af7298160ced6ab/core_state.c#L61):

for (i = 0; i < NUM_CORE_STATES; i++)
{
    crc = crcu32(final_counts[i], crc);
    crc = crcu32(track_counts[i], crc);
}

Let's have some code that can be compiled by itself that also reproduces the issue:

void init_var(int *v);
int chain(int c, int n);

void start() {
    int a, b, c;

    init_var(&a);
    init_var(&b);
    init_var(&c);

    int r = chain(b, a);
    r = chain(c, r);
}

https://godbolt.org/z/7h8E9nG5q

Clang produces the following assembly for the section between the two function call (on aarch64/arm64):

bl      _Z5chainii
ldr     w8, [sp, #4]
mov     w1, w0
mov     w0, w8
bl      _Z5chainii

While GCC produces this assembly:

bl      _Z5chainii
mov     w1, w0
ldr     w0, [sp, 28]
bl      _Z5chainii

I think we shouldn't move these values around so much.

As you can see gcc does not "shuffle" the values from register to regsiter, but solves the "switching" with only two instruction.

This problem is also present for riscv and it is even worse.
Clang generates:

call    _Z5chainii
lw      a1, 0(sp)
mv      a2, a0
mv      a0, a1
mv      a1, a2
call    _Z5chainii

GCC generates:

call    _Z5chainii
mv      a1,a0
lw      a0,12(sp)
call    _Z5chainii

Also see on godbolt: https://godbolt.org/z/77rncrb3b

The dissasembly of coremark will also have this suboptimal pattern:

jal	0x8000f596 <crcu32>
lw	a1, 0x24(sp)
mv	a2, a0
mv	a0, a1
mv	a1, a2
jal	0x8000f596 <crcu32>

The cause

The reason for the sub optimal code is the inability of copy propagation to find optimization due to sub optimal order of instructions. The sub optimal order of instructions introduces unnecessary data dependencies between instructions. (Let's say there is data dependency between MI A and MI B if there exists a register unit that is used or defined by both A and B.)

The reason for this data dependency in the examples above is that, the scheduler places loads as early as possible.
This is usually a good thing see https://discourse.llvm.org/t/how-to-copy-propagate-physical-register-introduced-before-ra/74828/4.

After creating the current patch and checking the regression test I found out, that the issue is more general, than the scheduler prioritizing loads. If you look at the test cases you can see that the current state of the patch also enables optimization that have nothing to do with loads, so there are other cases when unnecessary data dependencies block optimizations. (See the changes in llvm/test/CodeGen/RISCV/double-fcmp-strict.ll llvm/test/CodeGen/AArch64/sve-vector-interleave.ll and many more not load related improvements.)

The current solution

The issue is more generic that sub optimal data dependencies occurring when prioritizing loads, so I think adjusting the scheduler is not enough. Also this affects almost all the targets (see the tests).

I think the proper way to solve this issue is to make the machine copy propagation
"smarter", so in the cases where an unnecessary data dependency(ies) is(are) blocking an optimization
it is recognized and resolved by moving the instruction(s) without changing the semantics.

I have created a new struct that can administer the data dependencies of instructions in a tree.
Also added logic that uses this tree to find unnecessary data dependencies.

For example let's have the following MIR:

0. r1 = LOAD sp
1. r2 = COPY r5
2. r3 = ADD r1 r4
3. r1 = COPY r2

Let's see what are the dependencies:
Instruction 3 has a dependency on instruction 2 since instruction two uses a value that is in r1 and r1 is re-defined by 3. This means that instruction 2 must precede instruction 3. (This is called an anti dependency.)

Instruction 3 has a dependency on instruction 0. Both of these instructions define r1. instruction 0 must precede instruction 3. (This is called an out dependency.)

And finally instruction 2 has also a dependency on instruction 0, since the value in r1 in the time of execution instruction 2 is put into r1 in instruction 0. (This is called a true dependency.)

From the above, we can deduce, that instruction zero must precede instruction 2 and instruction 2 must precede instruction 3.

We can also observe that instruction 3 has a dependency on instruction 1 (true dependency). Basically in 1 and 3 we are just moving these values around. Could we erase this moving around and instantly do r1 = COPY r5? We can't just do that, since as seen before instruction 2 must precede any redefinition of r1. So in the current case this optimization is blocked by instruction 2.

What if we have switched instruction 2 ad instruction 1 they don't have dependency on each other. Then we will get this:

0. r1 = LOAD sp
2. r3 = ADD r1 r4
1. r2 = COPY r5 ; instruction 1 and 2 switched places
3. r1 = COPY r2

As we can see, the required order of instruction 0 2 and 3 are still maintained. The semantics of the program has stayed the same.
We can also see that the previously disabled optimization became viable, so the machine copy propagation pass can do the following modification legally:

0. r1 = LOAD sp
2. r3 = ADD r1 r4
1. r1 = COPY r5
3. r1 = COPY r2

So later on it can be recognized that one instruction is unnecessary (instruction 3) and erase it, so we can end up with more efficient code like this:

0. r1 = LOAD sp
2. r3 = ADD r1 r4
1. r1 = COPY r5

Why the current solution is the best

Sub-optimal order of instructions that cause copy propagation optimizations to be missed can appear for many different reasons on almost all the targets.
For this reason, I think the best way to deal with them is in a general target independent analysis of the data dependencies between instructions that are relevant to copy propagation.

Limitations

Dependency trees can be complicated. An instruction can have multiple dependencies that can also have multiple dependencies and so on.
Also these dependency trees can have intersections at different points. This extension might be unnecessary since I don't think that there are many cases where we have this complex data dependencies in MIR. (I have only found one in llvm/test/CodeGen/RISCV/rvv/fixed-vectors-interleaved-access-zve32x.ll when checking the MIR during debugging.)

If we want to extend the PR to handle any tree, then the correct order of the instructions in the dependency tree can be calculated with
in order traversal of the dependency graph (and also some merging based on the MI positions in the basic block).

The current state of the PR

I have tried to check all the tests, but I could not decide if the code is actually correct everywhere. I think in the most places it is correct. I would be really glad if you could also check the tests.

There are some regression tests that are failing. These are hand written tests, whose results can not be generated by update llc or mir scripts. If you see a chance that this PR may get approved and merged I will fix those tests.

Conclusion

I would like to ask your opinion on this topic. Is it a good direction?
I would be really happy to finish this work, since I enjoy creating stuff like this, if you think it is good direction and could be merged when it is in consistent state.

Do you have another solution in mind to optimize enable optimization in those cases, where unnecessary data dependencies are blocking it?

If you think that this approach is good and I shall continue I will write tests for it and also fix the hand written tests.

Thank you for checking the PR and giving me feedback.

@llvmbot
Copy link
Member

llvmbot commented Jul 8, 2024

@llvm/pr-subscribers-backend-systemz
@llvm/pr-subscribers-llvm-globalisel
@llvm/pr-subscribers-backend-x86
@llvm/pr-subscribers-backend-amdgpu
@llvm/pr-subscribers-backend-aarch64
@llvm/pr-subscribers-backend-powerpc

@llvm/pr-subscribers-llvm-regalloc

Author: Gábor Spaits (spaits)

Changes

The issue

In the Coremark becnhmark the following code can be found in core_state.c's core_bench_state function (https://github.com/eembc/coremark/blob/d5fad6bd094899101a4e5fd53af7298160ced6ab/core_state.c#L61):

for (i = 0; i &lt; NUM_CORE_STATES; i++)
{
    crc = crcu32(final_counts[i], crc);
    crc = crcu32(track_counts[i], crc);
}

Let's have some code that can be compiled by itself that also reproduces the issue:

void init_var(int *v);
int chain(int c, int n);

void start() {
    int a, b, c;

    init_var(&amp;a);
    init_var(&amp;b);
    init_var(&amp;c);

    int r = chain(b, a);
    r = chain(c, r);
}

https://godbolt.org/z/7h8E9nG5q

Clang produces the following assembly for the section between the two function call (on aarch64/arm64):

bl      _Z5chainii
ldr     w8, [sp, #<!-- -->4]
mov     w1, w0
mov     w0, w8
bl      _Z5chainii

While GCC produces this assembly:

bl      _Z5chainii
mov     w1, w0
ldr     w0, [sp, 28]
bl      _Z5chainii

I think we shouldn't move these values around so much.

As you can see gcc does not "shuffle" the values from register to regsiter, but solves the "switching" with only two instruction.

This problem is also present for riscv and it is even worse.
Clang generates:

call    _Z5chainii
lw      a1, 0(sp)
mv      a2, a0
mv      a0, a1
mv      a1, a2
call    _Z5chainii

GCC generates:

call    _Z5chainii
mv      a1,a0
lw      a0,12(sp)
call    _Z5chainii

Also see on godbolt: https://godbolt.org/z/77rncrb3b

The dissasembly of coremark will also have this suboptimal pattern:

jal	0x8000f596 &lt;crcu32&gt;
lw	a1, 0x24(sp)
mv	a2, a0
mv	a0, a1
mv	a1, a2
jal	0x8000f596 &lt;crcu32&gt;

The cause

The reason for the suboptimal code is the inabilty of copy propagation to find optimization due to suboptimal order of instructions. The suboptimal order of instructions introduces unnecesary data dependencies between instructions. (Let's say there is data dependency between MI A and MI B if there exists a register unit that is used or defined by both A and B.)

The reason for this data dependency in the examples above is that, the scheduler places loads as early as possible.
This is usually a good thing see https://discourse.llvm.org/t/how-to-copy-propagate-physical-register-introduced-before-ra/74828/4.

After creating the current patch and checking the regression test I found out, that the issue is more general, than the scheduker prioritizing loads. If you look at the test cases you can see that the current state of the patch also enables optimization that have nothing to do with loads, so there are other cases when unnecesary data dependencies block optimizations. (See the changes in llvm/test/CodeGen/RISCV/double-fcmp-strict.ll llvm/test/CodeGen/AArch64/sve-vector-interleave.ll and many more not load reletad improvements.)

The current solution

The issue is more generic that subotimal data dependecies occuring when prioritizing loads, so I think adjusting the scheduler is not enough. Also this affects almost all the targets (see the tests).

I think the proper way to solve this issue is to make the machine copy propagation
"smarter", so in the cases where an unnecesary data dependency(ies) is(are) blocking an optimization
it is recignized and resolved by moving the instruction(s) without changing the semantics.

I have created a new struct that can administer the data dependencies of instructions in a tree.
Also added logic that uses this tree to find unnecesary data dependencies.

For example let's have the following MIR:

0. r1 = LOAD sp
1. r2 = COPY r5
2. r3 = ADD r1 r4
3. r1 = COPY r2

Let's see what are the dependncies:
Instruction 3 has a dependency on instruction 2 since instruction two uses a value that is in r1 and r1 is re-defined by 3. This means that instruction 2 must preceed instruction 3.

Instruction 3 has a dependency on instruction 0. Both of these instructions define r1. instruction 0 must preceed instruction 3.

And finnaly instruction 2 has also a dependency on isntruction 0, since the value in r1 in the time of executiong instruction 2 is put into r1 in instruction 0.

From the above, we can deduce, that instruction zero must preceed instruction 2 and instruction 2 must preceed isntruction 4.

We can also observe that instruction 3 has a dapendency on instruction 1. Basically in 1 and 3 we are just moving these values around. Could we erase this moving around and instantly do r1 = COPY r5? We cant just do that, since as seen before instruction 2 must preceed any redefinition of r1. So in the current case this optimization is blocked by instruction 2.

What if we have switched instruction 2 ad instruction 1 they dont have dependency on eachother. Then we will get this:

0. r1 = LOAD sp
2. r3 = ADD r1 r4
1. r2 = COPY r5 ; instruction 1 and 2 switched places
3. r1 = COPY r2

As we can see, the required order of instruction 0 2 and 3 are still maintained. The semantics of the program has stayed the same.
We can also see that the previously disabled optimization became vailble, so the machine copy propagation pass can do the following modfication legally:

0. r1 = LOAD sp
2. r3 = ADD r1 r4
1. r1 = COPY r5
3. r1 = COPY r2

So later on it can be recognised that one instruction is unnecesary (instruction 3) and erase it, so we can end up with more efficent code like this:

0. r1 = LOAD sp
2. r3 = ADD r1 r4
1. r1 = COPY r5

Why the current soulution is the best

Suboptimal order of instructions that cause copy propagation optimizations to be missed can appear for many different reasons on almost all the targets.
For this reason, I think the best way to deal with them is in a general target independent analysis of the data dependencies between instructions that are relevant to copy propagation.

Limitations

Dependency trees can be complicated. An instruction can have multiple dependencies that can also have multiple dependecies and so on.
Also these dependency trees can have intersections at different points. This extension might be unnecesary since I don't think that there are many cases where we have this complex data dependencies in MIR. (I have only found one in llvm/test/CodeGen/RISCV/rvv/fixed-vectors-interleaved-access-zve32x.ll when checking the MIR during debugging.)

If we want to extend the PR to handle any tree, then the correct order of the instructions in the dependency tree can be calculated with
in order traversal of the dependncy graph (and also some merging based on the MI positions in the basic block).

The current state of the PR

I have tried to check all the tests, but I could not decide if the code is actually correct everywhere. I think in the most places it is correct. I would be really glad if you could also check the tests.

There are some regression tests that are failing. These are hand written tests, whose resulst can not be generated by update llc or mir scripts. If ypu see a chance that this PR may get approved and merged I will fix those tests.

Conclusion

I would like to ask your opinion on this topic. Is it a good direction?
I would be really happy to finish this work, since I enjoy creating stuff like this, if you think it is good direction and could be merged when it is in consistent state.

Do you have another solution in mind to optimize enable optimization in those cases, where unnecesary data dependencies are blocking it?

If you think that this approach is good and I shall continue I will write tests for it and also fix the hand written tests.

Thank you for checking the PR and giving me feedback.


Patch is 250.89 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/98087.diff

69 Files Affected:

  • (modified) llvm/lib/CodeGen/MachineCopyPropagation.cpp (+224-14)
  • (modified) llvm/test/CodeGen/AArch64/GlobalISel/arm64-atomic.ll (+1-2)
  • (modified) llvm/test/CodeGen/AArch64/GlobalISel/merge-stores-truncating.ll (+1-2)
  • (modified) llvm/test/CodeGen/AArch64/aarch64-wide-mul.ll (+4-6)
  • (modified) llvm/test/CodeGen/AArch64/addp-shuffle.ll (+1-2)
  • (modified) llvm/test/CodeGen/AArch64/cgp-usubo.ll (+5-10)
  • (added) llvm/test/CodeGen/AArch64/machine-cp.mir (+215)
  • (modified) llvm/test/CodeGen/AArch64/neon-extmul.ll (+4-6)
  • (modified) llvm/test/CodeGen/AArch64/shufflevector.ll (+5-12)
  • (modified) llvm/test/CodeGen/AArch64/streaming-compatible-memory-ops.ll (+2-3)
  • (modified) llvm/test/CodeGen/AArch64/sve-vector-deinterleave.ll (+6-10)
  • (modified) llvm/test/CodeGen/AArch64/sve-vector-interleave.ll (+2-4)
  • (modified) llvm/test/CodeGen/AArch64/vec_umulo.ll (+4-7)
  • (modified) llvm/test/CodeGen/AArch64/vselect-ext.ll (+5-7)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_pixelshader.ll (+8-8)
  • (modified) llvm/test/CodeGen/AMDGPU/vector_shuffle.packed.ll (+85-70)
  • (modified) llvm/test/CodeGen/LoongArch/tail-calls.ll (+2-3)
  • (modified) llvm/test/CodeGen/LoongArch/vector-fp-imm.ll (+3-6)
  • (modified) llvm/test/CodeGen/Mips/llvm-ir/or.ll (+42-60)
  • (modified) llvm/test/CodeGen/Mips/llvm-ir/xor.ll (+13-18)
  • (modified) llvm/test/CodeGen/PowerPC/BreakableToken-reduced.ll (+2-3)
  • (modified) llvm/test/CodeGen/PowerPC/atomics-i128-ldst.ll (+3-5)
  • (modified) llvm/test/CodeGen/PowerPC/legalize-invert-br_cc.ll (+1-2)
  • (modified) llvm/test/CodeGen/PowerPC/lower-intrinsics-afn-mass_notail.ll (+2-4)
  • (modified) llvm/test/CodeGen/PowerPC/lower-intrinsics-fast-mass_notail.ll (+2-4)
  • (modified) llvm/test/CodeGen/PowerPC/machine-backward-cp.mir (+65-56)
  • (modified) llvm/test/CodeGen/PowerPC/stack-restore-with-setjmp.ll (+1-2)
  • (modified) llvm/test/CodeGen/RISCV/alu64.ll (+1-2)
  • (modified) llvm/test/CodeGen/RISCV/condops.ll (+4-8)
  • (modified) llvm/test/CodeGen/RISCV/double-fcmp-strict.ll (+12-24)
  • (modified) llvm/test/CodeGen/RISCV/float-fcmp-strict.ll (+6-12)
  • (modified) llvm/test/CodeGen/RISCV/half-fcmp-strict.ll (+6-12)
  • (modified) llvm/test/CodeGen/RISCV/llvm.frexp.ll (+6-12)
  • (modified) llvm/test/CodeGen/RISCV/neg-abs.ll (+4-6)
  • (modified) llvm/test/CodeGen/RISCV/rv64-statepoint-call-lowering.ll (+1-2)
  • (modified) llvm/test/CodeGen/RISCV/rvv/constant-folding-crash.ll (+2-4)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmfeq.ll (+6-12)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmfge.ll (+6-12)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmfgt.ll (+6-12)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmfle.ll (+6-12)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmflt.ll (+6-12)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmfne.ll (+6-12)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmseq.ll (+10-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmsge.ll (+10-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmsgeu.ll (+10-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmsgt.ll (+10-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmsgtu.ll (+10-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmsle.ll (+10-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmsleu.ll (+10-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmslt.ll (+10-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmsltu.ll (+10-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmsne.ll (+10-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vsetvli-regression.ll (+1-2)
  • (modified) llvm/test/CodeGen/RISCV/shifts.ll (+2-4)
  • (modified) llvm/test/CodeGen/RISCV/srem-vector-lkk.ll (+11-20)
  • (modified) llvm/test/CodeGen/RISCV/tail-calls.ll (+6-8)
  • (modified) llvm/test/CodeGen/RISCV/urem-vector-lkk.ll (+13-24)
  • (modified) llvm/test/CodeGen/RISCV/wide-scalar-shift-by-byte-multiple-legalization.ll (+2-4)
  • (modified) llvm/test/CodeGen/RISCV/wide-scalar-shift-legalization.ll (+2-4)
  • (modified) llvm/test/CodeGen/SPARC/tailcall.ll (+4-6)
  • (modified) llvm/test/CodeGen/SystemZ/vector-constrained-fp-intrinsics.ll (+116-214)
  • (modified) llvm/test/CodeGen/Thumb2/mve-pred-ext.ll (+2-4)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vmull-splat.ll (-4)
  • (modified) llvm/test/CodeGen/X86/avx512-intrinsics.ll (+2-4)
  • (modified) llvm/test/CodeGen/X86/matrix-multiply.ll (+2-3)
  • (modified) llvm/test/CodeGen/X86/vec_saddo.ll (+5-9)
  • (modified) llvm/test/CodeGen/X86/vec_ssubo.ll (+2-3)
  • (modified) llvm/test/CodeGen/X86/vector-shuffle-combining-avx.ll (+1-2)
  • (modified) llvm/test/CodeGen/X86/xmulo.ll (+42-84)
diff --git a/llvm/lib/CodeGen/MachineCopyPropagation.cpp b/llvm/lib/CodeGen/MachineCopyPropagation.cpp
index bdc17e99d1fb07..de62f33c67cc4d 100644
--- a/llvm/lib/CodeGen/MachineCopyPropagation.cpp
+++ b/llvm/lib/CodeGen/MachineCopyPropagation.cpp
@@ -49,29 +49,36 @@
 //===----------------------------------------------------------------------===//
 
 #include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/DenseSet.h"
 #include "llvm/ADT/STLExtras.h"
 #include "llvm/ADT/SetVector.h"
 #include "llvm/ADT/SmallSet.h"
 #include "llvm/ADT/SmallVector.h"
 #include "llvm/ADT/Statistic.h"
 #include "llvm/ADT/iterator_range.h"
+#include "llvm/Analysis/BlockFrequencyInfo.h"
 #include "llvm/CodeGen/MachineBasicBlock.h"
 #include "llvm/CodeGen/MachineFunction.h"
 #include "llvm/CodeGen/MachineFunctionPass.h"
 #include "llvm/CodeGen/MachineInstr.h"
 #include "llvm/CodeGen/MachineOperand.h"
 #include "llvm/CodeGen/MachineRegisterInfo.h"
+#include "llvm/CodeGen/Register.h"
 #include "llvm/CodeGen/TargetInstrInfo.h"
 #include "llvm/CodeGen/TargetRegisterInfo.h"
 #include "llvm/CodeGen/TargetSubtargetInfo.h"
 #include "llvm/InitializePasses.h"
+#include "llvm/MC/MCRegister.h"
 #include "llvm/MC/MCRegisterInfo.h"
 #include "llvm/Pass.h"
 #include "llvm/Support/Debug.h"
 #include "llvm/Support/DebugCounter.h"
+#include "llvm/Support/ErrorHandling.h"
 #include "llvm/Support/raw_ostream.h"
+#include <algorithm>
 #include <cassert>
 #include <iterator>
+#include <utility>
 
 using namespace llvm;
 
@@ -105,7 +112,51 @@ static std::optional<DestSourcePair> isCopyInstr(const MachineInstr &MI,
   return std::nullopt;
 }
 
+static bool hasOverlaps(const MachineInstr &MI, Register &Reg,
+                        const TargetRegisterInfo *TRI) {
+  for (const MachineOperand &MI : MI.operands())
+    if (MI.isReg() && TRI->regsOverlap(Reg, MI.getReg()))
+      return true;
+  return false;
+}
+
+static bool twoMIsHaveMutualOperandRegisters(const MachineInstr &MI1,
+                                             const MachineInstr &MI2,
+                                             const TargetRegisterInfo *TRI) {
+  for (auto MI1OP : MI1.operands()) {
+    for (auto MI2OP : MI2.operands()) {
+      if (MI1OP.isReg() && MI2OP.isReg() &&
+          TRI->regsOverlap(MI1OP.getReg(), MI2OP.getReg())) {
+        return true;
+      }
+    }
+  }
+  return false;
+}
+
 class CopyTracker {
+  // When conducting backward copy propagation we may need to move instructions
+  // that are in the dependency tree in this case the relative order of the
+  // moved instructions to each other must not change. This variable is used to
+  // see where we are in the basic block and based.
+  int Time = 0;
+
+  // A tree representing dependencies between instructions.
+  struct Dependency {
+    Dependency() = default;
+    Dependency(MachineInstr *MI) : MI(MI) {}
+    MachineInstr *MI;
+
+    // When the MI appears. It is used to keep the MIs relative order to
+    // each other.
+    int MIPosition = -1;
+
+    // The children in the dependency tree. These are the instructions
+    // that work with at least one common register with the current instruction.
+    llvm::SmallVector<MachineInstr *> MustPrecede;
+    bool AnythingMustPrecede = false;
+  };
+
   struct CopyInfo {
     MachineInstr *MI, *LastSeenUseInCopy;
     SmallVector<MCRegister, 4> DefRegs;
@@ -113,9 +164,10 @@ class CopyTracker {
   };
 
   DenseMap<MCRegister, CopyInfo> Copies;
+  DenseMap<MachineInstr *, Dependency> Dependencies;
 
 public:
-  /// Mark all of the given registers and their subregisters as unavailable for
+  /// Mark all of the given registers and their sub registers as unavailable for
   /// copying.
   void markRegsUnavailable(ArrayRef<MCRegister> Regs,
                            const TargetRegisterInfo &TRI) {
@@ -129,6 +181,27 @@ class CopyTracker {
     }
   }
 
+  void moveBAfterA(MachineInstr *A, MachineInstr *B) {
+    llvm::MachineBasicBlock *MBB = A->getParent();
+    assert(MBB == B->getParent() &&
+           "Both instructions must be in the same MachineBasicBlock");
+    assert(Dependencies.contains(B) &&
+           "Shall contain the instruction that blocks the optimization.");
+    Dependencies[B].MIPosition = Time++;
+    MBB->remove(B);
+    MBB->insertAfter(--A->getIterator(), B);
+  }
+
+  // Add a new Dependency to the already existing ones.
+  void addDependency(Dependency B) {
+    if (Dependencies.contains(B.MI))
+      return;
+
+    B.MIPosition = Time++;
+    Dependencies.insert({B.MI, B});
+  }
+
+  /// Only called for backward propagation
   /// Remove register from copy maps.
   void invalidateRegister(MCRegister Reg, const TargetRegisterInfo &TRI,
                           const TargetInstrInfo &TII, bool UseCopyInstr) {
@@ -222,6 +295,95 @@ class CopyTracker {
     }
   }
 
+  void setDependenciesForMI(MachineInstr *MI, const TargetRegisterInfo &TRI,
+                            const TargetInstrInfo &TII, bool UseCopyInstr) {
+    bool Blocks = !Dependencies.contains(MI);
+    Dependency b{MI};
+    Dependency *CurrentBlocker = nullptr;
+    if (!Blocks) {
+      CurrentBlocker = &Dependencies[MI];
+    } else {
+      CurrentBlocker = &b;
+    }
+
+    for (const MachineOperand &Operand : MI->operands()) {
+      if (!Operand.isReg())
+        continue;
+
+      Register OpReg = Operand.getReg();
+      MCRegister OpMCReg = OpReg.asMCReg();
+      if (!OpMCReg)
+        continue;
+
+      // Invalidate those copies that are affected by the definition or usage of
+      // the current MI.
+      for (MCRegUnit UsedOPMcRegUnit : TRI.regunits(OpMCReg)) {
+        auto CopyThatDependsOnIt = Copies.find(UsedOPMcRegUnit);
+        // Do not take debug usages into account.
+        if (CopyThatDependsOnIt != Copies.end() &&
+            !(Operand.isUse() && Operand.isDebug())) {
+          Copies.erase(CopyThatDependsOnIt);
+        }
+      }
+
+      for (std::pair<MachineInstr *, Dependency> &Dep : Dependencies) {
+        assert(
+            Dep.first == Dep.second.MI &&
+            "Inconsistent state: The key and MI of a blocker do not match\n");
+        MachineInstr *DepMI = Dep.first;
+        if (DepMI == MI)
+          continue;
+        if (Operand.isUse() && Operand.isDebug())
+          continue;
+        if (!hasOverlaps(*DepMI, OpReg, &TRI))
+          continue;
+
+        // The current instruction precedes the instruction in the dependency
+        // tree.
+        if (CurrentBlocker->MIPosition == -1 ||
+            Dep.second.MIPosition < CurrentBlocker->MIPosition) {
+          Dep.second.MustPrecede.push_back(MI);
+          Dep.second.AnythingMustPrecede = true;
+          continue;
+        }
+
+        // The current instruction is preceeded the instruction in the
+        // dependency tree. This can happen when other instruction are moved
+        // before the currently analyzed one to optimize data dependencies.
+        if (CurrentBlocker->MIPosition != -1 &&
+            Dep.second.MIPosition > CurrentBlocker->MIPosition) {
+          CurrentBlocker->MustPrecede.push_back(DepMI);
+          CurrentBlocker->AnythingMustPrecede = true;
+          continue;
+        }
+      }
+    }
+
+    if (Blocks) {
+      addDependency(*CurrentBlocker);
+    }
+  }
+
+  std::optional<llvm::SmallVector<MachineInstr *>>
+  getFirstPreviousDependencies(MachineInstr *MI,
+                               const TargetRegisterInfo &TRI) {
+    if (Dependencies.contains(MI)) {
+      auto PrevDefs = Dependencies[MI].MustPrecede;
+      if (std::all_of(PrevDefs.begin(), PrevDefs.end(), [&](auto OneUse) {
+            if (!OneUse) {
+              return false;
+            }
+            return !(Dependencies[OneUse].AnythingMustPrecede) &&
+                   (Dependencies[OneUse].MustPrecede.size() == 0);
+          })) {
+
+        return Dependencies[MI].MustPrecede;
+      }
+      return {};
+    }
+    return {{}};
+  }
+
   /// Add this copy's registers into the tracker's copy maps.
   void trackCopy(MachineInstr *MI, const TargetRegisterInfo &TRI,
                  const TargetInstrInfo &TII, bool UseCopyInstr) {
@@ -245,6 +407,9 @@ class CopyTracker {
         Copy.DefRegs.push_back(Def);
       Copy.LastSeenUseInCopy = MI;
     }
+    
+    Dependency b{MI};
+    addDependency(b);
   }
 
   bool hasAnyCopies() {
@@ -263,12 +428,16 @@ class CopyTracker {
   }
 
   MachineInstr *findCopyDefViaUnit(MCRegister RegUnit,
-                                   const TargetRegisterInfo &TRI) {
+                                   const TargetRegisterInfo &TRI,
+                                   bool CanUseLastSeenInCopy = false) {
     auto CI = Copies.find(RegUnit);
     if (CI == Copies.end())
       return nullptr;
     if (CI->second.DefRegs.size() != 1)
       return nullptr;
+    if (CanUseLastSeenInCopy)
+      return CI->second.LastSeenUseInCopy;
+
     MCRegUnit RU = *TRI.regunits(CI->second.DefRegs[0]).begin();
     return findCopyForUnit(RU, TRI, true);
   }
@@ -278,7 +447,7 @@ class CopyTracker {
                                       const TargetInstrInfo &TII,
                                       bool UseCopyInstr) {
     MCRegUnit RU = *TRI.regunits(Reg).begin();
-    MachineInstr *AvailCopy = findCopyDefViaUnit(RU, TRI);
+    MachineInstr *AvailCopy = findCopyDefViaUnit(RU, TRI, true);
 
     if (!AvailCopy)
       return nullptr;
@@ -376,6 +545,7 @@ class CopyTracker {
 
   void clear() {
     Copies.clear();
+    Dependencies.clear();
   }
 };
 
@@ -1029,6 +1199,50 @@ void MachineCopyPropagation::propagateDefs(MachineInstr &MI) {
     if (hasOverlappingMultipleDef(MI, MODef, Def))
       continue;
 
+    LLVM_DEBUG(dbgs() << "Backward copy was found\n");
+    LLVM_DEBUG(Copy->dump());
+
+    // Let's see if we have any kind of previous dependencies for the copy,
+    // that have no other dependencies. (So the dependency tree is one level
+    // deep)
+    auto PreviousDependencies =
+        Tracker.getFirstPreviousDependencies(Copy, *TRI);
+
+    if (!PreviousDependencies)
+      // The dependency tree is more than one level deep.
+      continue;
+
+    LLVM_DEBUG(dbgs() << "Number of dependencies of the copy: "
+                      << PreviousDependencies->size() << "\n");
+    
+    // Check whether the current MI has a mutual registers with the optimization
+    // blocking instructions.
+    bool NoDependencyWithMI =
+        std::all_of(PreviousDependencies->begin(), PreviousDependencies->end(),
+                    [&](auto *MI1) {
+                      return !twoMIsHaveMutualOperandRegisters(*MI1, MI, TRI);
+                    });
+    if (!NoDependencyWithMI)
+      continue;
+    
+    // If there isn't any relationship between the blockers and the current MI
+    // then we can start moving them.
+
+    // Add the new instruction to the list of dependencies.
+    Tracker.addDependency({&MI});
+
+    // Then move the previous instruction before the current.
+    // Later the dependency analysis will run for the current MI. There
+    // the relationship between these moved instructions and the current
+    // MI is handled.
+    for (llvm::MachineInstr *I : llvm::reverse(*PreviousDependencies)) {
+      LLVM_DEBUG(dbgs() << "Moving ");
+      LLVM_DEBUG(I->dump());
+      LLVM_DEBUG(dbgs() << "Before ");
+      LLVM_DEBUG(MI.dump());
+      Tracker.moveBAfterA(&MI, I);
+    }
+
     LLVM_DEBUG(dbgs() << "MCP: Replacing " << printReg(MODef.getReg(), TRI)
                       << "\n     with " << printReg(Def, TRI) << "\n     in "
                       << MI << "     from " << *Copy);
@@ -1036,6 +1250,10 @@ void MachineCopyPropagation::propagateDefs(MachineInstr &MI) {
     MODef.setReg(Def);
     MODef.setIsRenamable(CopyOperands->Destination->isRenamable());
 
+    Tracker.invalidateRegister(MODef.getReg().asMCReg(), *TRI, *TII,
+                               UseCopyInstr);
+    Tracker.invalidateRegister(Def, *TRI, *TII, UseCopyInstr);
+
     LLVM_DEBUG(dbgs() << "MCP: After replacement: " << MI << "\n");
     MaybeDeadCopies.insert(Copy);
     Changed = true;
@@ -1060,10 +1278,7 @@ void MachineCopyPropagation::BackwardCopyPropagateBlock(
         // Unlike forward cp, we don't invoke propagateDefs here,
         // just let forward cp do COPY-to-COPY propagation.
         if (isBackwardPropagatableCopy(*CopyOperands, *MRI)) {
-          Tracker.invalidateRegister(SrcReg.asMCReg(), *TRI, *TII,
-                                     UseCopyInstr);
-          Tracker.invalidateRegister(DefReg.asMCReg(), *TRI, *TII,
-                                     UseCopyInstr);
+          Tracker.setDependenciesForMI(&MI, *TRI, *TII, UseCopyInstr); // Maybe it is bad to call it here
           Tracker.trackCopy(&MI, *TRI, *TII, UseCopyInstr);
           continue;
         }
@@ -1080,6 +1295,8 @@ void MachineCopyPropagation::BackwardCopyPropagateBlock(
       }
 
     propagateDefs(MI);
+    Tracker.setDependenciesForMI(&MI, *TRI, *TII, UseCopyInstr);
+
     for (const MachineOperand &MO : MI.operands()) {
       if (!MO.isReg())
         continue;
@@ -1087,10 +1304,6 @@ void MachineCopyPropagation::BackwardCopyPropagateBlock(
       if (!MO.getReg())
         continue;
 
-      if (MO.isDef())
-        Tracker.invalidateRegister(MO.getReg().asMCReg(), *TRI, *TII,
-                                   UseCopyInstr);
-
       if (MO.readsReg()) {
         if (MO.isDebug()) {
           //  Check if the register in the debug instruction is utilized
@@ -1101,9 +1314,6 @@ void MachineCopyPropagation::BackwardCopyPropagateBlock(
               CopyDbgUsers[Copy].insert(&MI);
             }
           }
-        } else {
-          Tracker.invalidateRegister(MO.getReg().asMCReg(), *TRI, *TII,
-                                     UseCopyInstr);
         }
       }
     }
diff --git a/llvm/test/CodeGen/AArch64/GlobalISel/arm64-atomic.ll b/llvm/test/CodeGen/AArch64/GlobalISel/arm64-atomic.ll
index b619aac709d985..21cc2e0e570776 100644
--- a/llvm/test/CodeGen/AArch64/GlobalISel/arm64-atomic.ll
+++ b/llvm/test/CodeGen/AArch64/GlobalISel/arm64-atomic.ll
@@ -106,11 +106,10 @@ define i32 @val_compare_and_swap_from_load(ptr %p, i32 %cmp, ptr %pnew) #0 {
 ; CHECK-OUTLINE-O1-LABEL: val_compare_and_swap_from_load:
 ; CHECK-OUTLINE-O1:       ; %bb.0:
 ; CHECK-OUTLINE-O1-NEXT:    stp x29, x30, [sp, #-16]! ; 16-byte Folded Spill
-; CHECK-OUTLINE-O1-NEXT:    ldr w8, [x2]
 ; CHECK-OUTLINE-O1-NEXT:    mov x3, x0
 ; CHECK-OUTLINE-O1-NEXT:    mov w0, w1
+; CHECK-OUTLINE-O1-NEXT:    ldr w1, [x2]
 ; CHECK-OUTLINE-O1-NEXT:    mov x2, x3
-; CHECK-OUTLINE-O1-NEXT:    mov w1, w8
 ; CHECK-OUTLINE-O1-NEXT:    bl ___aarch64_cas4_acq
 ; CHECK-OUTLINE-O1-NEXT:    ldp x29, x30, [sp], #16 ; 16-byte Folded Reload
 ; CHECK-OUTLINE-O1-NEXT:    ret
diff --git a/llvm/test/CodeGen/AArch64/GlobalISel/merge-stores-truncating.ll b/llvm/test/CodeGen/AArch64/GlobalISel/merge-stores-truncating.ll
index 7fd71b26fa1ba7..c8f8361e5ef885 100644
--- a/llvm/test/CodeGen/AArch64/GlobalISel/merge-stores-truncating.ll
+++ b/llvm/test/CodeGen/AArch64/GlobalISel/merge-stores-truncating.ll
@@ -256,9 +256,8 @@ define dso_local i32 @load_between_stores(i32 %x, ptr %p, ptr %ptr) {
 ; CHECK:       ; %bb.0:
 ; CHECK-NEXT:    strh w0, [x1]
 ; CHECK-NEXT:    lsr w9, w0, #16
-; CHECK-NEXT:    ldr w8, [x2]
+; CHECK-NEXT:    ldr w0, [x2]
 ; CHECK-NEXT:    strh w9, [x1, #2]
-; CHECK-NEXT:    mov w0, w8
 ; CHECK-NEXT:    ret
   %t1 = trunc i32 %x to i16
   %sh = lshr i32 %x, 16
diff --git a/llvm/test/CodeGen/AArch64/aarch64-wide-mul.ll b/llvm/test/CodeGen/AArch64/aarch64-wide-mul.ll
index 410c2d9021d6d5..a150a0f6ee40a2 100644
--- a/llvm/test/CodeGen/AArch64/aarch64-wide-mul.ll
+++ b/llvm/test/CodeGen/AArch64/aarch64-wide-mul.ll
@@ -131,13 +131,12 @@ entry:
 define <16 x i32> @mla_i32(<16 x i8> %a, <16 x i8> %b, <16 x i32> %c) {
 ; CHECK-SD-LABEL: mla_i32:
 ; CHECK-SD:       // %bb.0: // %entry
-; CHECK-SD-NEXT:    umull2 v7.8h, v0.16b, v1.16b
 ; CHECK-SD-NEXT:    umull v6.8h, v0.8b, v1.8b
-; CHECK-SD-NEXT:    uaddw2 v5.4s, v5.4s, v7.8h
+; CHECK-SD-NEXT:    umull2 v7.8h, v0.16b, v1.16b
 ; CHECK-SD-NEXT:    uaddw v0.4s, v2.4s, v6.4h
 ; CHECK-SD-NEXT:    uaddw2 v1.4s, v3.4s, v6.8h
+; CHECK-SD-NEXT:    uaddw2 v3.4s, v5.4s, v7.8h
 ; CHECK-SD-NEXT:    uaddw v2.4s, v4.4s, v7.4h
-; CHECK-SD-NEXT:    mov v3.16b, v5.16b
 ; CHECK-SD-NEXT:    ret
 ;
 ; CHECK-GI-LABEL: mla_i32:
@@ -170,18 +169,17 @@ define <16 x i64> @mla_i64(<16 x i8> %a, <16 x i8> %b, <16 x i64> %c) {
 ; CHECK-SD-NEXT:    umull2 v0.8h, v0.16b, v1.16b
 ; CHECK-SD-NEXT:    ldp q20, q21, [sp]
 ; CHECK-SD-NEXT:    ushll v17.4s, v16.4h, #0
+; CHECK-SD-NEXT:    ushll v18.4s, v0.4h, #0
 ; CHECK-SD-NEXT:    ushll2 v16.4s, v16.8h, #0
 ; CHECK-SD-NEXT:    ushll2 v19.4s, v0.8h, #0
-; CHECK-SD-NEXT:    ushll v18.4s, v0.4h, #0
 ; CHECK-SD-NEXT:    uaddw2 v1.2d, v3.2d, v17.4s
 ; CHECK-SD-NEXT:    uaddw v0.2d, v2.2d, v17.2s
 ; CHECK-SD-NEXT:    uaddw2 v3.2d, v5.2d, v16.4s
 ; CHECK-SD-NEXT:    uaddw v2.2d, v4.2d, v16.2s
-; CHECK-SD-NEXT:    uaddw2 v16.2d, v21.2d, v19.4s
 ; CHECK-SD-NEXT:    uaddw v4.2d, v6.2d, v18.2s
 ; CHECK-SD-NEXT:    uaddw2 v5.2d, v7.2d, v18.4s
+; CHECK-SD-NEXT:    uaddw2 v7.2d, v21.2d, v19.4s
 ; CHECK-SD-NEXT:    uaddw v6.2d, v20.2d, v19.2s
-; CHECK-SD-NEXT:    mov v7.16b, v16.16b
 ; CHECK-SD-NEXT:    ret
 ;
 ; CHECK-GI-LABEL: mla_i64:
diff --git a/llvm/test/CodeGen/AArch64/addp-shuffle.ll b/llvm/test/CodeGen/AArch64/addp-shuffle.ll
index fb96d11acc275a..3dd6068ea3c227 100644
--- a/llvm/test/CodeGen/AArch64/addp-shuffle.ll
+++ b/llvm/test/CodeGen/AArch64/addp-shuffle.ll
@@ -63,9 +63,8 @@ define <16 x i8> @deinterleave_shuffle_v32i8(<32 x i8> %a) {
 define <4 x i64> @deinterleave_shuffle_v8i64(<8 x i64> %a) {
 ; CHECK-LABEL: deinterleave_shuffle_v8i64:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    addp v2.2d, v2.2d, v3.2d
 ; CHECK-NEXT:    addp v0.2d, v0.2d, v1.2d
-; CHECK-NEXT:    mov v1.16b, v2.16b
+; CHECK-NEXT:    addp v1.2d, v2.2d, v3.2d
 ; CHECK-NEXT:    ret
   %r0 = shufflevector <8 x i64> %a, <8 x i64> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
   %r1 = shufflevector <8 x i64> %a, <8 x i64> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
diff --git a/llvm/test/CodeGen/AArch64/cgp-usubo.ll b/llvm/test/CodeGen/AArch64/cgp-usubo.ll
index d307107fc07ee6..c8b73625086644 100644
--- a/llvm/test/CodeGen/AArch64/cgp-usubo.ll
+++ b/llvm/test/CodeGen/AArch64/cgp-usubo.ll
@@ -40,9 +40,8 @@ define i1 @usubo_ugt_constant_op0_i8(i8 %x, ptr %p) nounwind {
 ; CHECK-NEXT:    mov w9, #42 // =0x2a
 ; CHECK-NEXT:    cmp w8, #42
 ; CHECK-NEXT:    sub w9, w9, w0
-; CHECK-NEXT:    cset w8, hi
+; CHECK-NEXT:    cset w0, hi
 ; CHECK-NEXT:    strb w9, [x1]
-; CHECK-NEXT:    mov w0, w8
 ; CHECK-NEXT:    ret
   %s = sub i8 42, %x
   %ov = icmp ugt i8 %x, 42
@@ -59,9 +58,8 @@ define i1 @usubo_ult_constant_op0_i16(i16 %x, ptr %p) nounwind {
 ; CHECK-NEXT:    mov w9, #43 // =0x2b
 ; CHECK-NEXT:    cmp w8, #43
 ; CHECK-NEXT:    sub w9, w9, w0
-; CHECK-NEXT:    cset w8, hi
+; CHECK-NEXT:    cset w0, hi
 ; CHECK-NEXT:    strh w9, [x1]
-; CHECK-NEXT:    mov w0, w8
 ; CHECK-NEXT:    ret
   %s = sub i16 43, %x
   %ov = icmp ult i16 43, %x
@@ -78,8 +76,7 @@ define i1 @usubo_ult_constant_op1_i16(i16 %x, ptr %p) nounwind {
 ; CHECK-NEXT:    sub w9, w0, #44
 ; CHECK-NEXT:    cmp w8, #44
 ; CHECK-NEXT:    strh w9, [x1]
-; CHECK-NEXT:    cset w8, lo
-; CHECK-NEXT:    mov w0, w8
+; CHECK-NEXT:    cset w0, lo
 ; CHECK-NEXT:    ret
   %s = add i16 %x, -44
   %ov = icmp ult i16 %x, 44
@@ -94,8 +91,7 @@ define i1 @usubo_ugt_constant_op1_i8(i8 %x, ptr %p) nounwind {
 ; CHECK-NEXT:    sub w9, w0, #45
 ; CHECK-NEXT:    cmp w8, #45
 ; CHECK-NEXT:    strb w9, [x1]
-; CHECK-NEXT:    cset w8, lo
-; CHECK-NEXT:    mov w0, w8
+; CHECK-NEXT:    cset w0, lo
 ; CHECK-NEXT:    ret
   %ov = icmp ugt i8 45, %x
   %s = add i8 %x, -45
@@ -110,9 +106,8 @@ define i1 @usubo_eq_constant1_op1_i32(i32 %x, ptr %p) nounwind {
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    cmp w0, #0
 ; CHECK-NEXT:    sub w9, w0, #1
-; CHECK-NEXT:    cset w8, eq
+; CHECK-NEXT:    cset w0, eq
 ; CHECK-NEXT:    str w9, [x1]
-; CHECK-NEXT:    mov w0, w8
 ; CHECK-NEXT:    ret
   %s = add i32 %x, -1
   %ov = icmp eq i32 %x, 0
diff --git a/llvm/test/CodeGen/AArch64/machine-cp.mir b/llvm/test/CodeGen/AArch64/machine-cp.mir
new file mode 100644
index 00000000000000..de57627d82c57c
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/machine-cp.mir
@@ -0,0 +1,215 @@
+# NOTE: Assertions have been autogene...
[truncated]

Copy link

github-actions bot commented Jul 8, 2024

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff 01ba3fa37b22828abca81011b0cb2e4cbbb8cfda 795bcd3d35efd638622996536e5673c2092d5d33 -- llvm/lib/CodeGen/MachineCopyPropagation.cpp
View the diff from clang-format here.
diff --git a/llvm/lib/CodeGen/MachineCopyPropagation.cpp b/llvm/lib/CodeGen/MachineCopyPropagation.cpp
index de62f33c67..2f5e6e8b64 100644
--- a/llvm/lib/CodeGen/MachineCopyPropagation.cpp
+++ b/llvm/lib/CodeGen/MachineCopyPropagation.cpp
@@ -407,7 +407,7 @@ public:
         Copy.DefRegs.push_back(Def);
       Copy.LastSeenUseInCopy = MI;
     }
-    
+
     Dependency b{MI};
     addDependency(b);
   }
@@ -1214,7 +1214,7 @@ void MachineCopyPropagation::propagateDefs(MachineInstr &MI) {
 
     LLVM_DEBUG(dbgs() << "Number of dependencies of the copy: "
                       << PreviousDependencies->size() << "\n");
-    
+
     // Check whether the current MI has a mutual registers with the optimization
     // blocking instructions.
     bool NoDependencyWithMI =
@@ -1224,7 +1224,7 @@ void MachineCopyPropagation::propagateDefs(MachineInstr &MI) {
                     });
     if (!NoDependencyWithMI)
       continue;
-    
+
     // If there isn't any relationship between the blockers and the current MI
     // then we can start moving them.
 
@@ -1278,7 +1278,8 @@ void MachineCopyPropagation::BackwardCopyPropagateBlock(
         // Unlike forward cp, we don't invoke propagateDefs here,
         // just let forward cp do COPY-to-COPY propagation.
         if (isBackwardPropagatableCopy(*CopyOperands, *MRI)) {
-          Tracker.setDependenciesForMI(&MI, *TRI, *TII, UseCopyInstr); // Maybe it is bad to call it here
+          Tracker.setDependenciesForMI(
+              &MI, *TRI, *TII, UseCopyInstr); // Maybe it is bad to call it here
           Tracker.trackCopy(&MI, *TRI, *TII, UseCopyInstr);
           continue;
         }

@spaits spaits requested review from qcolombet, LWenH and davemgreen July 8, 2024 22:26
Copy link
Collaborator

@qcolombet qcolombet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the kind of optimization that belongs to postRA scheduling.

I feel this is too complicated for MachineCopyPropagation. Or put differently, until now, MCP was not moving things around so I'm not sure this is the right place to do this.

@spaits
Copy link
Contributor Author

spaits commented Jul 9, 2024

@qcolombet Thank you for your response.

I put this "feature" here because it relies on the already implemented functionality of the pass. This functionality is the recognition of possible optimization in the backward propagation phase. I have just implemented a dependency representation and handling mechanism.

I am open to move this logic to elsewhere but then the full functionality of backward propagation must be also copied there, to recognize the cases when this moving is needed. (Except the register changing and deleting of redundant instructions.)

I was also thinking about if it is a good thing to move around instruction in MCP. My justification was that in this pass we already rename registers and delete instructions. These moves are only happening to enable that. So there can not be any case when the resulting MIR coming out of MCP just has instructions in a different order, moved around. When moving happens then also renaming and deletion will happen.

If there will be others that also think that this logic should be moved to postRA scheduling, then I will move the dependency handling logic and also do some backward propagation like thing there to identify the cases that needs instruction moving.

@spaits spaits requested a review from s-barannikov July 9, 2024 10:54
@s-barannikov s-barannikov requested a review from jayfoad July 9, 2024 11:09
@s-barannikov
Copy link
Contributor

I agree with @qcolombet
This looks like a job for dependency breaker, which I believe the "new" machine scheduler still doesn't use.

@spaits
Copy link
Contributor Author

spaits commented Jul 9, 2024

So basically this dependency breaker already implements exactly what I have wrote here, but it isn't yet used?

Or it is rather that, there is this generic dependency breaking stuff that has to be extended to also find and resolve copy propagation related dependencies?

@s-barannikov
Copy link
Contributor

So basically this dependency breaker already implements exactly what I have wrote here, but it isn't yet used?

I'm not quite sure in either part of the statement. We did use the dependency breaker with the new machine scheduler a few years ago in a downstream target, and it did eliminate some copies.

Or it is rather that, there is this generic dependency breaking stuff that has to be extended to also find and resolve copy propagation related dependencies?

From my recollection, it wasn't too difficult to attach the dependency breaker to the new scheduler, so it may worth a try, it might already be powerful enough.

@spaits
Copy link
Contributor Author

spaits commented Jul 9, 2024

@qcolombet @s-barannikov I have tried out postRA scheduling as you have suggested. I could get the pass to work on X86 with the following flags for llc: -post-RA-scheduler -break-anti-dependencies=all.

I used smuloi8 in llvm/test/CodeGen/X86/xmulo.ll for testing where this patch enables optimization:
https://github.com/llvm/llvm-project/pull/98087/files#diff-4e4004fcba25c450462eaec7e0624e6fb03322ae07c833916cd1196e68c97ddfL86 (Also explained on bottom of the message.)

Here is the behavior of post-RA-scheduler on that function on Godbolt (left with postRA scheduling right without it): https://godbolt.org/z/K9z1TrK54

Also you can see that the data dependency was not introduced by the scheduler this time.

Here post-RA-scheduler just replaces cl to dl. (see Godbolt or bottom of the message) It does not resolve the dependency to enable copy propagation.
I think the goal of this pass is reorder machine instructions to optimize their execution on the target architecture by considering the latencies of instructions and the dependencies between them. Not by considering how we could more effective copy propagation. So to use this pass for copy propagation we will need extra code. And based on my observation that extra code would be less fitting for postRA scheduling.

I think this optimization would be pretty useful by looking at the tests it has improved on many different targets in many different contexts. Also there were other attempts to do similar things. I didn't really check them deeply but they did not move around instructions: #74239

I still would prefer this optimization to be in machine-cp. Copy propagation is already doing some very simple data flow analysis like stuff. What I would do just would make this analysis a bit more advanced. If I would extend the responsibilities of post-RA-scheduler it would have to also recognize possible copy propagation in post-RA-scheduler.

Can I ask why do you think that this optimization would belong to post-RA-scheduler (I am still willing to move it there)?

(Sorry for the long messages I am just trying to exactly describe my thought process and my first language isn't English :) )


Explaining the behavior of my patch and also machine-cp with my patch.
With my patch this happens:
Here is the assembly generated without my patch and without postRA scheduling:

smuloi8:
# %bb.0:
  movl %ecx, %eax
  imulb %dl
  seto %cl
  movb %al, (%r8)
  movl %ecx, %eax
  retq

It recognizes that the by replacing movb %al, (%r8) and seto %cl we can do a copy propagation.

smuloi8:
%bb.0:
  movl %ecx, %eax
  imulb %dl
  movb %al, (%r8)
  seto %al
  retq

What post-RA-scheduler with break-anti-dependencies=all gives:

smuloi8:                                # @smuloi8
        mov     eax, ecx
        imul    dl
        seto    dl
        mov     byte ptr [r8], al
        mov     eax, edx
        ret

It just does a renaming the dependencies making copy propagation not avalible stay.

(I don't know why there is a difference in the assembly I got form Godbolt and the one I got from my machine from the test generation. The semantics match).
In all cases the -disable-peephole -mtriple=x86_64-pc-win32 flag was passed to llc.

@qcolombet
Copy link
Collaborator

I still think it should be part of a scheduling related implementation, because I don't want to recreate the dependency graph in MCP. My worry is that we don't track the right thing and we start breaking things or slowly but surely re-implement the whole DDG in MCP.

I'm surprised that the postRA scheduler doesn't try to get rid of the copies, but maybe this is what you should aim at fixing.

@spaits
Copy link
Contributor Author

spaits commented Jul 10, 2024

@qcolombet Yes I agree, that recreating having the code for data dependency analysis twice is a real possibility that wouldn't be good.

My last concern is this, if we resolve this then I am moving the logic:

  • post-RA-scheduler seems kind of not stable to me: No major target uses it on any optimization level (x86, arm, riscv). It outright doesn't run on arm and riscv even if I explicitly ask for the pass with the flag that should enable it. I would like to have this optimization available everywhere after a PR merge, since it is simple and useful.

Is post-RA-scheduler stable? How much work it would be to add it to all the targets from at least O3? Is there any historic reason why it is not used anywhere I don't know of and couldn't find? Is it not done yet? It generates wrong or sub-optimal code? (In my case on X86 it was interesting to me that it has just renamed some regs.)

Also mcp's backward propagation has to be copied into post-RA-scheduler if we move logic there.
Shouldn't we just create a really simple data flow analysis library for MIR (that only operates on basic block)? I would check what exactly happens in post-RA-scheduler and create a library that can be used by post-RA-scheduler and machine-cp or other passes that need data flow analysis like stuff too.

This way we could avoid any duplication of both data flow analysis and copy propagation related code. Also if post-RA-scheduler and dependency breakers are experimental (I deduced this from what I experienced) and will probably stay that for some time then I wouldn't want this to be tied to it, since I think the only relation between my the functionality of my patch and post-RA-scheduler is the need for data flow analysis.

@s-barannikov
Copy link
Contributor

s-barannikov commented Jul 10, 2024

Is post-RA-scheduler stable? How much work it would be to add it to all the targets from at least O3? Is there any historic reason why it is not used anywhere I don't know of and couldn't find? Is it not done yet? It generates wrong or sub-optimal code? (In my case on X86 it was interesting to me that it has just renamed some regs.)

post-RA-scheduler is the legacy scheduler. The "modern" scheduler is machine-scheduler. The legacy scheduler is still used (1) for historical reasons and (2) because it is (arguably?) better suited for DSPs (more accurate resource counting at the cost of compile time). The lack of use of dependency breaker by the new scheduler might be the third reason, but I don't think anyone considered it as a serious disadvantage.

...

I once needed tracking functionality used by MCP for a different reason, so in my opinion it is worthwile to make the tracker publicly available (regardless of the direction this patch goes).

@qcolombet
Copy link
Collaborator

Is post-RA-scheduler stable? How much work it would be to add it to all the targets from at least O3? Is there any historic reason why it is not used anywhere I don't know of and couldn't find? Is it not done yet? It generates wrong or sub-optimal code? (In my case on X86 it was interesting to me that it has just renamed some regs.)

post-RA-scheduler is the legacy scheduler. The "modern" scheduler is machine-scheduler. The legacy scheduler is still used (1) for historical reasons and (2) because it is (arguably?) better suited for DSPs (more accurate resource counting at the cost of compile time). The lack of use of dependency breaker by the new scheduler might be the third reason, but I don't think anyone considered it as a serious disadvantage.

Sorry, yes, when I said postRA scheduler, I mean the new one, but running after RA :).

...

I once needed tracking functionality used by MCP for a different reason, so in my opinion it is worthwile to make the tracker publicly available (regardless of the direction this patch goes).

Sounds like a plan to me.

@spaits
Copy link
Contributor Author

spaits commented Jul 10, 2024

@s-barannikov @qcolombet I would like to ask some questions before I start implementing.

So I have looked at the two existing anti dependency breakers. I found out that currently non of these move around instructions to get rid of anti dependencies.

In our case we must move instructions just renaming is not enough. The reason for this is the place final place of data and the origin of the data can not be changed in some scenarios. For example if the values are coming from calls and passed to calls the calling convention determines where that data must be.

For example the above mir's anti dependency can not be resolved by just renaming.
The data returned from chain let's call it $1 must be in w1 for the next call.
The data loaded from the stack must be in w0. So the desired state can be achieved in two instructions, but to do that we must switch the load and the copy.

BL @chain
renamable $w8 = LDRWui %stack.0.c, 0 # Here we can't load instantly to w0 because it still holds needed data.
renamable $w1 = COPY $w0 # Putting $1 to it's final place for the call.
$w0 = COPY killed renamable $w8
BL @chain
BL @chain
renamable $w1 = COPY $w0 # Here $1 is at it's final place.
renamable $w8 = LDRWui %stack.0.c, 0 # No anti dep. Now machine-cp can optimize.
$w0 = COPY killed renamable $w8
BL @chain

(1) Is it ok if from now on it will also move around instructions?

This dependency breaking stuff is pretty huge. (2) Wouldn't it deserve it's own pass?

Also currently post RA scheduling only happens after machine-cp on arm64 and x86 and doesn't happens at all on riscv. (3)Should I add it for riscv? (4)Is it ok if I add another machine-cp after post RA scheduling?

@spaits
Copy link
Contributor Author

spaits commented Jul 16, 2024

Gentle ping @s-barannikov @qcolombet @arsenm . Sorry to ping you. I just want to get clear feedback for my plan. I want to know if my new approach detailed above is correct. (I should have done this before this PR to, but instead I just went along with what I thought would be the correct approach from seeing https://discourse.llvm.org/t/how-to-copy-propagate-physical-register-introduced-before-ra/74828 ).

The reason why I want an early opinion from you is:

  • It may be bigger work to integrate the existing anti dependency breaker into the new scheduler. Also I want it to be turned on by default when using O3. Is it okay to use anti dep breaker by deafult at some OP level? Which anti dep breaker at which level? (There are two Aggressive or Critical). I would prefer aggressive at O3.
  • Currently non of the anti dependency breakers are capable to resolve the dependencies I am complaining about. I must extend one of the two dep breaker implementations. Which one should I extend? Aggressive or Critical? I was suggested aggressive.
  • The extension I am planning would involve moving around instructions. Is it fine if the anti dep breaker does that? Currently it just renames.
  • I want this to work on RISCV. Currently RISCV doesn't have the new post RA scheduler enabled. What should I work on for it to be enabled? (Maybe @topperc could you give any input on this please?)

Sorry for the long message again. I just want to know if what I am planning to do makes sense or is reasonable before I start doing it.

I have also created a discourse post: https://discourse.llvm.org/t/anti-dependency-breaking-and-the-new-sceduler/80109.

Thank you for your time and effort!

@topperc
Copy link
Collaborator

topperc commented Jul 16, 2024

I want this to work on RISCV. Currently RISCV doesn't have the new post RA scheduler enabled. What should I work on for it to be enabled? (Maybe @topperc could you give any input on this please?)

It should already be enabled when compiling for any CPU in RISCVProcessors.td that has FeaturePostRAScheduler

@spaits spaits requested a review from michaelmaitland July 18, 2024 12:23
@spaits spaits changed the title [MachineCopyPropagation] Detect and fix suboptimal instruction order to enable optimizations [MachineCopyPropagation, Scheduler] Detect and fix suboptimal instruction order to enable optimizations Jul 18, 2024
@michaelmaitland
Copy link
Contributor

michaelmaitland commented Jul 18, 2024

Note: In this comment, when I say post-RA-scheduler, I mean the PostMachineScheduler.

in the cases where an unnecessary data dependency(ies) is(are) blocking an optimization it is recognized and resolved by moving the instruction(s) without changing the semantics
Can I ask why do you think that this optimization would belong to post-RA-scheduler

In the first statement, you describe the problem as needing to move instructions, but cannot due to unnecessary data dependencies. If this is the problem, then the scheduler is the optimization to do this since it is responsible for reordering instructions limited by data-dependencies.

But as I read on, I started to question whether this really has to do only with scheduling. Your end goal is to eliminate unnecessary copies. Coming into the scheduler, there are copies which could have been removed because they could have been propagated.

If I understand correctly, I think people are saying that the post-RA-scheduler should be responsible for moving instructions so that the MachineCopyPropagation pass can recognize that the copy can be propagated. What you are proposing is that MachineCopyPropagation moves the instructions so that it can do the best copy propagation it can.

I can see both sides. The post-RA-scheduler is balancing many things. To name a few: (1) avoid the number of stalls due to latency (i.e. dependency not ready), (2) avoid the number of stalls due to resource consumption (resources not available), (3) reducing number of spills/reloads. With the proposal that the post-ra scheduler also order instructions to enable copy propagation, we give it yet another thing it needs to balance. There are two concerns I have with using the post-ra scheduler to enable copy propagation:

  1. Scheduling for latency, resource consumption, and reducing spills/reloads may not be effective if scheduling for MCP takes precedence.
  2. Scheduling for MCP may not be effective if scheduling for latency, resource consumption, and reducing spills/reloads takes precedence.

So by reordering inside of MCP, we can be sure that we reorder to have the best MCP and allow the post-ra scheduler to reorder for the best latency/resource consumption/spills/reloads.

On the other hand, I agree with others when they raise the worry about recreating the dependency graph in MCP. We'd really like to avoid that. Instead of recreating that infrastructure, I wonder about adapting or reusing the ScheduleDAG in MCP. With this approach, I thought about the problem of rebuilding the graph multiple times. I'm not sure how expensive it is, but if we decided that it was too expensive, I wonder if it would be possible to reuse the same ScheduleDAG across multiple passes (i.e. build once and reuse). I think are building the graph multiple times today: once for pre-ra scheduling and once for post-ra scheduling. So that makes me believe that, at least initially, it might be okay to rebuild it a third time.

It sounds like the dependency breaker might be able to solve the problem of reordering in the scheduler to enable MCP, if we decide to take the route that reordering should be done in the scheduler. I wonder how balancing reordering in scheduler for MCP + other things will impact existing scheduling. I think I'd like to keep the reordering of instructions to enable MCP inside the scheduler as long as it does not have trouble balancing all its tasks. If it does have trouble doing that, then if we could reuse the ScheduleDAG infrastructure in MCP, I have no strong objections to allowing MCP to reorder instructions as it wishes. It sounds like we would need consensus around this idea though.

Also, I found a small nit in your explanation:

Instruction 3 has a dependency on instruction 2 since instruction two uses a value that is in r1 and r1 is re-defined by 3. This means that instruction 2 must preceed instruction 3.
instruction 2 must preceed instruction 4.

Could you please fix your numbering of the instructions? I think your explanation is sometimes referring to instructions as 1,2,3,4 and sometimes as 0,2,3,4. For example you say instruction 3 has a dependency on instruction 2 when I think you mean instruction 3 has a dependency on instruction 2. Also you refer to instruction 4 when there is no instruction 4 in the 0,1,2,3 based list.

@spaits
Copy link
Contributor Author

spaits commented Jul 18, 2024

@michaelmaitland Thank you very much for your response. I will address your review in more detail tonight or tomorrow morning (Central European Time).

I have looked at the explanation. I fixed the non existent instruction number. I have also took another deeper look at the explanation and added the type of dependency (true, anti and out) to each sentence (at least I hope each sentence :) ) in which I talk about dependencies.

I am sorry for the spelling in my description I have fixed that.

@spaits spaits requested a review from kparzysz July 18, 2024 18:14
@spaits
Copy link
Contributor Author

spaits commented Jul 18, 2024

There are two concerns I have with using the post-ra scheduler to enable copy propagation:

1. Scheduling for latency, resource consumption, and reducing spills/reloads may not be effective if scheduling for MCP takes precedence.

2. Scheduling for MCP may not be effective if scheduling for latency, resource consumption, and reducing spills/reloads takes precedence.

Yes. In the case of the C code I show in the issue description, basically the instruction scheduler would "fight" this mechanism that would do the re-ordering for the anti dependency breaker.

  • The scheduler would want to prioritize the load.
  • The copy pre-copy-propagation re-ordering mechanism I implement would want to keep the load at it's place because it wants to enable that optimization later.

On the other hand, I agree with others when they raise the worry about recreating the dependency graph in MCP.

In my current implementation I have basically created a dumper scheduler graph, that only has the things need to create "must precede" relations between instructions. It doesn't care what kinds of dependencies cause the relation. I know that it is not good practice I also agree that regardless of the direction, the scheduler dag would be the best to re-use. I think the scheduler DAG has everything and more I need to do this optimization. (Basically it needs two things: recognize data dependencies, and be able to update itself when I move around instructions and the order of instructions change.)

It sounds like the dependency breaker might be able to solve the problem of reordering in the scheduler to enable MCP, if we decide to take the route that reordering should be done in the scheduler.

@s-barannikov you have mentioned anti dep breaker first. Thinking about this deeper, I figured that even if we decide to not do this in MCP maybe it won't be that trivial to find a new place for this in the scheduler:

We are not breaking any anti dependency here. The anti dependencies stay, they must stay, in some cases they are not preventable. Anti dependencies must happen when we have a value, that must be in a specific register (for example the calling convention says so) or must come from a specific register (the calling convention specifies the return value storing register).

For example in the case of the aarch64 example even in the optimal code we must have anti dependency:

bl      _Z5chainii
mov     w1, w0
ldr     w0, [sp, 28]
bl      _Z5chainii

For the optimal work of the MCP we must move around instructions, that have are present in an anti dependency relation with a copy propagation "destination", but do not have any dependency with any of the instructions between itself and a "source" of a copy propagated. (It is pretty weird to write dependency with the with preposition but for some reason this describes the situation the best :) .)

So basically we just move around the anti dependency so it won't spoil a possible copy propagation.

It sounds like we would need consensus around this idea though.

I totally agree.

@michaelmaitland thank you very much checking out this PR/discussion.

@spaits
Copy link
Contributor Author

spaits commented Aug 2, 2024

I think that the logic I want to introduce here is orthogonal to the one that is already present in the anti dependency breaker. We couldn't reuse any logic from the anti dependency breaker. The anti dependencies are found and categorized by the ScheduleDAG. The only thing common between these two is that we both would check if a dependency is anti dependency:

We are not breaking any anti dependency here. The anti dependencies stay, they must stay, in some cases they are not preventable. Anti dependencies must happen when we have a value, that must be in a specific register (for example the calling convention says so) or must come from a specific register (the calling convention specifies the return value storing register).

I think the best possible solution would be to use the existing scheduler DAG in the MCP. As @michaelmaitland has suggested:

I wonder about adapting or reusing the ScheduleDAG in MCP. With this approach, I thought about the problem of rebuilding the graph multiple times. I'm not sure how expensive it is, but if we decided that it was too expensive, I wonder if it would be possible to reuse the same ScheduleDAG across multiple passes (i.e. build once and reuse). I think are building the graph multiple times today: once for pre-ra scheduling and once for post-ra scheduling. So that makes me believe that, at least initially, it might be okay to rebuild it a third time.

Also I don't think that the schedule dag building would cost that much. It may be built twice in the legacy post RA scheduler.

What I would do first is removing the dependency graph made by me.

I would use a scheduleDAG instead. So the re-implementation of the schedule graph would not happen.
By this I would address this concern of @qcolombet :

My worry is that we don't track the right thing and we start breaking things or slowly but surely re-implement the whole DDG in MCP.

I will do a more in depth research to see whether any current variant of ScheduleDAG can handle the modification of the underlying MIR. If not then I may create a new subclass that does just that, or I will rebuild the ScheduleDAG at each instruction reordering.

Regardless of this I am still going to work on bringing the anti dependency breaker to the new scheduler. (I have already played around with it with less success).

@s-barannikov @qcolombet @michaelmaitland So what do you think of an implementation in MCP using the existing ScheduleDAG? To me it would be the most logical and we would get the most code reuse this way.

Thank you very much for your attention and response in advance.

@qcolombet
Copy link
Collaborator

So what do you think of an implementation in MCP using the existing ScheduleDAG?

That may work, but be careful about the compile time aspect of such approach. The impact needs to be negligible unless we show significant performance boost across the board (which I doubt we'll achieve)

@spaits
Copy link
Contributor Author

spaits commented Aug 5, 2024

A similar patch like this caused 0.2% reduction on the code size for coremark on riscv32. This was that patch: #73924 . I don't know how big of an impact is that. Maybe the speed improvement would be more important but we have not yet measured that on a real board.

@michaelmaitland
Copy link
Contributor

So what do you think of an implementation in MCP using the existing ScheduleDAG?

I think this makes sense. Once you have an implementation, I am happy to take a look into compile time vs performance impact on spec using SiFive hardware.

The impact needs to be negligible unless we show significant performance boost across the board (which I doubt we'll achieve)

@qcolombet is your main concern the compile time impact or that there will be a lack of performance impact?

@qcolombet
Copy link
Collaborator

@qcolombet is your main concern the compile time impact or that there will be a lack of performance impact?

Compile time, or more precisely, it is okay to burn compile time if we get significant performance impact, otherwise the compile time impact needs to be kept at minimum.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants