Skip to content

DAG: Handle load in SimplifyDemandedVectorElts #122671

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

arsenm
Copy link
Contributor

@arsenm arsenm commented Jan 13, 2025

This improves some AMDGPU cases and avoids future regressions.
The combiner likes to form shuffles for cases where an extract_vector_elt
would do perfectly well, and this recovers some of the regressions from
losing load narrowing.

AMDGPU, Arch64 and RISCV test changes look broadly better. Other targets have
some improvements, but mostly regressions. In particular X86 looks much
worse. I'm guessing this is because it's shouldReduceLoadWidth is wrong.

I mostly just regenerated the checks. I assume some set of them should
switch to use volatile loads to defeat the optimization.

Copy link
Contributor Author

arsenm commented Jan 13, 2025

@arsenm arsenm added the llvm:SelectionDAG SelectionDAGISel as well label Jan 13, 2025 — with Graphite App
@arsenm arsenm marked this pull request as ready for review January 13, 2025 07:10
@llvmbot
Copy link
Member

llvmbot commented Jan 13, 2025

@llvm/pr-subscribers-backend-amdgpu
@llvm/pr-subscribers-backend-powerpc
@llvm/pr-subscribers-backend-arm
@llvm/pr-subscribers-backend-x86

@llvm/pr-subscribers-llvm-selectiondag

Author: Matt Arsenault (arsenm)

Changes

This improves some AMDGPU cases and avoids future regressions.
The combiner likes to form shuffles for cases where an extract_vector_elt
would do perfectly well, and this recovers some of the regressions from
losing load narrowing.

AMDGPU, Arch64 and RISCV test changes look broadly better. Other targets have
some improvements, but mostly regressions. In particular X86 looks much
worse. I'm guessing this is because it's shouldReduceLoadWidth is wrong.

I mostly just regenerated the checks. I assume some set of them should
switch to use volatile loads to defeat the optimization.


Patch is 773.93 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/122671.diff

167 Files Affected:

  • (modified) llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp (+32)
  • (modified) llvm/test/CodeGen/AArch64/arm64-big-endian-bitconverts.ll (+27-27)
  • (modified) llvm/test/CodeGen/AArch64/dag-ReplaceAllUsesOfValuesWith.ll (+1-4)
  • (modified) llvm/test/CodeGen/AArch64/fcmp.ll (+21-22)
  • (modified) llvm/test/CodeGen/AArch64/fmlal-loreg.ll (+4-4)
  • (modified) llvm/test/CodeGen/AArch64/icmp.ll (+8-9)
  • (modified) llvm/test/CodeGen/AArch64/sve-fixed-length-extract-vector-elt.ll (+24-90)
  • (modified) llvm/test/CodeGen/AArch64/sve-fixed-length-masked-gather.ll (+1-2)
  • (modified) llvm/test/CodeGen/AArch64/sve-fixed-length-masked-scatter.ll (+5-5)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-extract-vector-elt.ll (+6-24)
  • (modified) llvm/test/CodeGen/AMDGPU/fcopysign.f32.ll (+19-20)
  • (modified) llvm/test/CodeGen/AMDGPU/fcopysign.f64.ll (+19-19)
  • (modified) llvm/test/CodeGen/AMDGPU/greedy-reverse-local-assignment.ll (+10-11)
  • (modified) llvm/test/CodeGen/AMDGPU/identical-subrange-spill-infloop.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/implicit-kernarg-backend-usage.ll (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/shader-addr64-nonuniform.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/trunc.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/vector_rebroadcast.ll (+1329-1328)
  • (modified) llvm/test/CodeGen/AMDGPU/vector_shuffle.packed.ll (+102-148)
  • (modified) llvm/test/CodeGen/ARM/crash-on-pow2-shufflevector.ll (+3-2)
  • (modified) llvm/test/CodeGen/ARM/vector-promotion.ll (+15-15)
  • (modified) llvm/test/CodeGen/ARM/vext.ll (+4-3)
  • (modified) llvm/test/CodeGen/ARM/vuzp.ll (+7-7)
  • (modified) llvm/test/CodeGen/LoongArch/vector-fp-imm.ll (+1-2)
  • (modified) llvm/test/CodeGen/Mips/cconv/vector.ll (+43-22)
  • (modified) llvm/test/CodeGen/Mips/msa/basic_operations.ll (+76-20)
  • (modified) llvm/test/CodeGen/NVPTX/i128.ll (+5-5)
  • (modified) llvm/test/CodeGen/NVPTX/i8x4-instructions.ll (+24-24)
  • (modified) llvm/test/CodeGen/NVPTX/store-undef.ll (+2-2)
  • (modified) llvm/test/CodeGen/PowerPC/aix-vector-byval-callee.ll (+2-2)
  • (modified) llvm/test/CodeGen/PowerPC/canonical-merge-shuffles.ll (+2-7)
  • (modified) llvm/test/CodeGen/PowerPC/const-stov.ll (+8-7)
  • (modified) llvm/test/CodeGen/PowerPC/pr27078.ll (+12-10)
  • (modified) llvm/test/CodeGen/PowerPC/pre-inc-disable.ll (+16-12)
  • (modified) llvm/test/CodeGen/PowerPC/v16i8_scalar_to_vector_shuffle.ll (+78-42)
  • (modified) llvm/test/CodeGen/PowerPC/v2i64_scalar_to_vector_shuffle.ll (+111-101)
  • (modified) llvm/test/CodeGen/PowerPC/v8i16_scalar_to_vector_shuffle.ll (+4-2)
  • (modified) llvm/test/CodeGen/PowerPC/vsx_shuffle_le.ll (+44-60)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-extract.ll (+69-132)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp-shuffles.ll (+17-10)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-reduction-fp.ll (+3-9)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-reduction-int.ll (+108-81)
  • (modified) llvm/test/CodeGen/Thumb2/mve-extractstore.ll (+10-14)
  • (modified) llvm/test/CodeGen/Thumb2/mve-insertshuffleload.ll (+28-20)
  • (modified) llvm/test/CodeGen/X86/SwizzleShuff.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/avx-vbroadcast.ll (+3-3)
  • (modified) llvm/test/CodeGen/X86/avx.ll (+4-2)
  • (modified) llvm/test/CodeGen/X86/avx1-logical-load-folding.ll (+16-12)
  • (modified) llvm/test/CodeGen/X86/avx512-arith.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/avx512-broadcast-arith.ll (+6-6)
  • (modified) llvm/test/CodeGen/X86/avx512-broadcast-unfold.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/avx512-calling-conv.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/avx512-cmp.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/avx512-ext.ll (+9-9)
  • (modified) llvm/test/CodeGen/X86/avx512-extract-subvector-load-store.ll (+6-6)
  • (modified) llvm/test/CodeGen/X86/avx512-intrinsics-fast-isel.ll (+37-24)
  • (modified) llvm/test/CodeGen/X86/avx512-intrinsics.ll (+10-10)
  • (modified) llvm/test/CodeGen/X86/avx512-load-store.ll (+16-16)
  • (modified) llvm/test/CodeGen/X86/avx512-logic.ll (+7-7)
  • (modified) llvm/test/CodeGen/X86/avx512-select.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/avx512-shuffles/partial_permute.ll (+407-312)
  • (modified) llvm/test/CodeGen/X86/avx512-shuffles/shuffle-interleave.ll (+18-13)
  • (modified) llvm/test/CodeGen/X86/avx512-shuffles/unpack.ll (+22-18)
  • (modified) llvm/test/CodeGen/X86/avx512fp16-mov.ll (+5-3)
  • (modified) llvm/test/CodeGen/X86/bitreverse.ll (+6-4)
  • (modified) llvm/test/CodeGen/X86/buildvec-insertvec.ll (+6-10)
  • (modified) llvm/test/CodeGen/X86/combine-fabs.ll (+6-3)
  • (modified) llvm/test/CodeGen/X86/combine-sdiv.ll (+8-4)
  • (modified) llvm/test/CodeGen/X86/combine-udiv.ll (+6-2)
  • (modified) llvm/test/CodeGen/X86/commute-blend-avx2.ll (+4-2)
  • (modified) llvm/test/CodeGen/X86/commute-blend-sse41.ll (+6-6)
  • (modified) llvm/test/CodeGen/X86/copysign-constant-magnitude.ll (+16-8)
  • (modified) llvm/test/CodeGen/X86/extract-concat.ll (+6-4)
  • (modified) llvm/test/CodeGen/X86/extractelement-fp.ll (+4-3)
  • (modified) llvm/test/CodeGen/X86/extractelement-load.ll (+40-19)
  • (modified) llvm/test/CodeGen/X86/fabs.ll (+2-1)
  • (modified) llvm/test/CodeGen/X86/fast-isel-fneg.ll (+3-2)
  • (modified) llvm/test/CodeGen/X86/fma-signed-zero.ll (+4-2)
  • (modified) llvm/test/CodeGen/X86/fp-fold.ll (+15-12)
  • (modified) llvm/test/CodeGen/X86/fp-intrinsics-fma.ll (+28-14)
  • (modified) llvm/test/CodeGen/X86/fp-logic.ll (+8-4)
  • (modified) llvm/test/CodeGen/X86/fp-round.ll (+30-27)
  • (modified) llvm/test/CodeGen/X86/fp128-cast.ll (+2-1)
  • (modified) llvm/test/CodeGen/X86/fp16-libcalls.ll (+8-4)
  • (modified) llvm/test/CodeGen/X86/freeze-vector.ll (+4-4)
  • (modified) llvm/test/CodeGen/X86/gfni-funnel-shifts.ll (+32-30)
  • (modified) llvm/test/CodeGen/X86/half.ll (+9-8)
  • (modified) llvm/test/CodeGen/X86/insert-into-constant-vector.ll (+26-26)
  • (modified) llvm/test/CodeGen/X86/insertps-combine.ll (+6-6)
  • (modified) llvm/test/CodeGen/X86/insertps-from-constantpool.ll (+4-2)
  • (modified) llvm/test/CodeGen/X86/insertps-unfold-load-bug.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/is_fpclass.ll (+6-4)
  • (modified) llvm/test/CodeGen/X86/isel-blendi-gettargetconstant.ll (+5-2)
  • (modified) llvm/test/CodeGen/X86/load-partial.ll (-4)
  • (modified) llvm/test/CodeGen/X86/masked_load.ll (+2-1)
  • (modified) llvm/test/CodeGen/X86/masked_store.ll (+7-8)
  • (modified) llvm/test/CodeGen/X86/mmx-arith.ll (+4-7)
  • (modified) llvm/test/CodeGen/X86/neg_fp.ll (+3-2)
  • (modified) llvm/test/CodeGen/X86/negative-sin.ll (+2-1)
  • (modified) llvm/test/CodeGen/X86/packus.ll (+38-22)
  • (modified) llvm/test/CodeGen/X86/peephole-fold-movsd.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/pr14161.ll (+2-1)
  • (modified) llvm/test/CodeGen/X86/pr30511.ll (+3-2)
  • (modified) llvm/test/CodeGen/X86/pr31956.ll (+5-4)
  • (modified) llvm/test/CodeGen/X86/pr34592.ll (+18-18)
  • (modified) llvm/test/CodeGen/X86/pr36553.ll (+2-1)
  • (modified) llvm/test/CodeGen/X86/pr40811.ll (+5-4)
  • (modified) llvm/test/CodeGen/X86/pr63091.ll (+4-3)
  • (modified) llvm/test/CodeGen/X86/sar_fold64.ll (+10-6)
  • (modified) llvm/test/CodeGen/X86/setcc-combine.ll (+3-3)
  • (modified) llvm/test/CodeGen/X86/setcc-non-simple-type.ll (+11-15)
  • (modified) llvm/test/CodeGen/X86/shrink_vmul.ll (+16-12)
  • (modified) llvm/test/CodeGen/X86/shuffle-vs-trunc-512.ll (+14-10)
  • (modified) llvm/test/CodeGen/X86/splat-for-size.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/sqrt-fastmath-tune.ll (+2-1)
  • (modified) llvm/test/CodeGen/X86/sqrt-fastmath-tunecpu-attr.ll (+4-2)
  • (modified) llvm/test/CodeGen/X86/sqrt-fastmath.ll (+6-4)
  • (modified) llvm/test/CodeGen/X86/srem-seteq-vec-nonsplat.ll (+76-62)
  • (modified) llvm/test/CodeGen/X86/sse-align-12.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/sse2.ll (+48-37)
  • (modified) llvm/test/CodeGen/X86/sse3.ll (+14-12)
  • (modified) llvm/test/CodeGen/X86/sse41.ll (+84-100)
  • (modified) llvm/test/CodeGen/X86/strict-fsub-combines.ll (+17-11)
  • (modified) llvm/test/CodeGen/X86/subvector-broadcast.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/test-shrink-bug.ll (+5-2)
  • (modified) llvm/test/CodeGen/X86/tuning-shuffle-unpckpd-avx512.ll (+43-146)
  • (modified) llvm/test/CodeGen/X86/tuning-shuffle-unpckpd.ll (+14-38)
  • (modified) llvm/test/CodeGen/X86/urem-seteq-vec-tautological.ll (+5-3)
  • (modified) llvm/test/CodeGen/X86/vec_insert-5.ll (+20-39)
  • (modified) llvm/test/CodeGen/X86/vec_int_to_fp.ll (+60-60)
  • (modified) llvm/test/CodeGen/X86/vec_shift5.ll (+10-18)
  • (modified) llvm/test/CodeGen/X86/vec_umulo.ll (+8-4)
  • (modified) llvm/test/CodeGen/X86/vector-bitreverse.ll (+3-2)
  • (modified) llvm/test/CodeGen/X86/vector-constrained-fp-intrinsics-flags.ll (+3-25)
  • (modified) llvm/test/CodeGen/X86/vector-constrained-fp-intrinsics.ll (+9-12)
  • (modified) llvm/test/CodeGen/X86/vector-fshl-256.ll (+24-20)
  • (modified) llvm/test/CodeGen/X86/vector-fshl-512.ll (+18-11)
  • (modified) llvm/test/CodeGen/X86/vector-fshr-256.ll (+4)
  • (modified) llvm/test/CodeGen/X86/vector-fshr-512.ll (+18-11)
  • (modified) llvm/test/CodeGen/X86/vector-idiv-sdiv-512.ll (+3-3)
  • (modified) llvm/test/CodeGen/X86/vector-reduce-fmax-nnan.ll (+9-19)
  • (modified) llvm/test/CodeGen/X86/vector-reduce-fmin.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/vector-rotate-128.ll (+16-26)
  • (modified) llvm/test/CodeGen/X86/vector-rotate-256.ll (+13-13)
  • (modified) llvm/test/CodeGen/X86/vector-rotate-512.ll (+26-26)
  • (modified) llvm/test/CodeGen/X86/vector-shift-ashr-128.ll (+33-21)
  • (modified) llvm/test/CodeGen/X86/vector-shift-ashr-256.ll (+20-11)
  • (modified) llvm/test/CodeGen/X86/vector-shift-ashr-512.ll (+2-1)
  • (modified) llvm/test/CodeGen/X86/vector-shift-lshr-128.ll (+25-15)
  • (modified) llvm/test/CodeGen/X86/vector-shift-lshr-256.ll (+20-11)
  • (modified) llvm/test/CodeGen/X86/vector-shift-lshr-512.ll (+2-1)
  • (modified) llvm/test/CodeGen/X86/vector-shift-shl-128.ll (+25-15)
  • (modified) llvm/test/CodeGen/X86/vector-shift-shl-256.ll (+21-12)
  • (modified) llvm/test/CodeGen/X86/vector-shift-shl-512.ll (+2-1)
  • (modified) llvm/test/CodeGen/X86/vector-shuffle-128-v2.ll (+7-23)
  • (modified) llvm/test/CodeGen/X86/vector-shuffle-128-v4.ll (+39-32)
  • (modified) llvm/test/CodeGen/X86/vector-shuffle-combining-avx2.ll (+9-10)
  • (modified) llvm/test/CodeGen/X86/vector-shuffle-combining-ssse3.ll (+8-3)
  • (modified) llvm/test/CodeGen/X86/vector-shuffle-combining.ll (+80-52)
  • (modified) llvm/test/CodeGen/X86/vector-shuffle-v1.ll (+3-3)
  • (modified) llvm/test/CodeGen/X86/vector-shuffle-v192.ll (+8-8)
  • (modified) llvm/test/CodeGen/X86/vector-shuffle-v48.ll (+3-3)
  • (modified) llvm/test/CodeGen/X86/vselect.ll (+21-9)
  • (modified) llvm/test/CodeGen/X86/widened-broadcast.ll (+25-37)
  • (modified) llvm/test/CodeGen/X86/x86-interleaved-access.ll (+19-16)
  • (modified) llvm/test/CodeGen/X86/xop-shifts.ll (+5-2)
  • (modified) llvm/test/CodeGen/X86/xor.ll (+10-4)
diff --git a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
index b1fb4947fb9451..0e6be878d38cb8 100644
--- a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
@@ -3478,6 +3478,38 @@ bool TargetLowering::SimplifyDemandedVectorElts(
 
     break;
   }
+  case ISD::LOAD: {
+    auto *Ld = cast<LoadSDNode>(Op);
+    if (!ISD::isNormalLoad(Ld) || !Ld->isSimple())
+      break;
+
+    // TODO: Handle arbitrary vector extract for isMask
+    if (DemandedElts.popcount() != 1)
+      break;
+
+    EVT VT = Ld->getValueType(0);
+    if (TLO.LegalOperations() &&
+        !isOperationLegalOrCustom(ISD::INSERT_VECTOR_ELT,
+                                  VT /*, IsAfterLegalization*/))
+      break;
+
+    EVT EltVT = VT.getVectorElementType();
+    SDLoc DL(Ld);
+
+    unsigned Idx = DemandedElts.countTrailingZeros();
+
+    SDValue IdxVal = TLO.DAG.getVectorIdxConstant(Idx, DL);
+    SDValue Scalarized =
+        scalarizeExtractedVectorLoad(EltVT, DL, VT, IdxVal, Ld, TLO.DAG);
+    if (!Scalarized)
+      break;
+
+    TLO.DAG.ReplaceAllUsesOfValueWith(SDValue(Ld, 1), Scalarized.getValue(1));
+
+    SDValue Insert = TLO.DAG.getNode(ISD::INSERT_VECTOR_ELT, DL, VT,
+                                     TLO.DAG.getUNDEF(VT), Scalarized, IdxVal);
+    return TLO.CombineTo(Op, Insert);
+  }
   case ISD::VECTOR_SHUFFLE: {
     SDValue LHS = Op.getOperand(0);
     SDValue RHS = Op.getOperand(1);
diff --git a/llvm/test/CodeGen/AArch64/arm64-big-endian-bitconverts.ll b/llvm/test/CodeGen/AArch64/arm64-big-endian-bitconverts.ll
index f5aa4c666a5681..e9a4a83a406838 100644
--- a/llvm/test/CodeGen/AArch64/arm64-big-endian-bitconverts.ll
+++ b/llvm/test/CodeGen/AArch64/arm64-big-endian-bitconverts.ll
@@ -30,7 +30,7 @@ define void @test_i64_v2f32(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2s }
 ; CHECK: rev64 v{{[0-9]+}}.2s
 ; CHECK: str
-    %1 = load <2 x float>, ptr %p
+    %1 = load volatile <2 x float>, ptr %p
     %2 = fadd <2 x float> %1, %1
     %3 = bitcast <2 x float> %2 to i64
     %4 = add i64 %3, %3
@@ -43,7 +43,7 @@ define void @test_i64_v2i32(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2s }
 ; CHECK: rev64 v{{[0-9]+}}.2s
 ; CHECK: str
-    %1 = load <2 x i32>, ptr %p
+    %1 = load volatile <2 x i32>, ptr %p
     %2 = add <2 x i32> %1, %1
     %3 = bitcast <2 x i32> %2 to i64
     %4 = add i64 %3, %3
@@ -121,7 +121,7 @@ define void @test_f64_v2f32(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2s }
 ; CHECK: rev64 v{{[0-9]+}}.2s
 ; CHECK: str
-    %1 = load <2 x float>, ptr %p
+    %1 = load volatile <2 x float>, ptr %p
     %2 = fadd <2 x float> %1, %1
     %3 = bitcast <2 x float> %2 to double
     %4 = fadd double %3, %3
@@ -134,7 +134,7 @@ define void @test_f64_v2i32(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2s }
 ; CHECK: rev64 v{{[0-9]+}}.2s
 ; CHECK: str
-    %1 = load <2 x i32>, ptr %p
+    %1 = load volatile <2 x i32>, ptr %p
     %2 = add <2 x i32> %1, %1
     %3 = bitcast <2 x i32> %2 to double
     %4 = fadd double %3, %3
@@ -213,7 +213,7 @@ define void @test_v1i64_v2f32(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2s }
 ; CHECK: rev64 v{{[0-9]+}}.2s
 ; CHECK: str
-    %1 = load <2 x float>, ptr %p
+    %1 = load volatile <2 x float>, ptr %p
     %2 = fadd <2 x float> %1, %1
     %3 = bitcast <2 x float> %2 to <1 x i64>
     %4 = add <1 x i64> %3, %3
@@ -226,7 +226,7 @@ define void @test_v1i64_v2i32(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2s }
 ; CHECK: rev64 v{{[0-9]+}}.2s
 ; CHECK: str
-    %1 = load <2 x i32>, ptr %p
+    %1 = load volatile <2 x i32>, ptr %p
     %2 = add <2 x i32> %1, %1
     %3 = bitcast <2 x i32> %2 to <1 x i64>
     %4 = add <1 x i64> %3, %3
@@ -318,7 +318,7 @@ define void @test_v2f32_v1i64(ptr %p, ptr %q) {
 define void @test_v2f32_v2i32(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2s }
 ; CHECK: st1 { v{{[0-9]+}}.2s }
-    %1 = load <2 x i32>, ptr %p
+    %1 = load volatile <2 x i32>, ptr %p
     %2 = add <2 x i32> %1, %1
     %3 = bitcast <2 x i32> %2 to <2 x float>
     %4 = fadd <2 x float> %3, %3
@@ -410,7 +410,7 @@ define void @test_v2i32_v1i64(ptr %p, ptr %q) {
 define void @test_v2i32_v2f32(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2s }
 ; CHECK: st1 { v{{[0-9]+}}.2s }
-    %1 = load <2 x float>, ptr %p
+    %1 = load volatile <2 x float>, ptr %p
     %2 = fadd <2 x float> %1, %1
     %3 = bitcast <2 x float> %2 to <2 x i32>
     %4 = add <2 x i32> %3, %3
@@ -488,7 +488,7 @@ define void @test_v4i16_v2f32(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2s }
 ; CHECK: rev32 v{{[0-9]+}}.4h
 ; CHECK: st1 { v{{[0-9]+}}.4h }
-    %1 = load <2 x float>, ptr %p
+    %1 = load volatile <2 x float>, ptr %p
     %2 = fadd <2 x float> %1, %1
     %3 = bitcast <2 x float> %2 to <4 x i16>
     %4 = add <4 x i16> %3, %3
@@ -501,7 +501,7 @@ define void @test_v4i16_v2i32(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2s }
 ; CHECK: rev32 v{{[0-9]+}}.4h
 ; CHECK: st1 { v{{[0-9]+}}.4h }
-    %1 = load <2 x i32>, ptr %p
+    %1 = load volatile <2 x i32>, ptr %p
     %2 = add <2 x i32> %1, %1
     %3 = bitcast <2 x i32> %2 to <4 x i16>
     %4 = add <4 x i16> %3, %3
@@ -587,7 +587,7 @@ define void @test_v4f16_v2f32(ptr %p, ptr %q) {
 ; CHECK: fadd
 ; CHECK-NOT: rev
 ; CHECK: st1 { v{{[0-9]+}}.4h }
-    %1 = load <2 x float>, ptr %p
+    %1 = load volatile <2 x float>, ptr %p
     %2 = fadd <2 x float> %1, %1
     %3 = bitcast <2 x float> %2 to <4 x half>
     %4 = fadd <4 x half> %3, %3
@@ -602,7 +602,7 @@ define void @test_v4f16_v2i32(ptr %p, ptr %q) {
 ; CHECK: fadd
 ; CHECK-NOT: rev
 ; CHECK: st1 { v{{[0-9]+}}.4h }
-    %1 = load <2 x i32>, ptr %p
+    %1 = load volatile <2 x i32>, ptr %p
     %2 = add <2 x i32> %1, %1
     %3 = bitcast <2 x i32> %2 to <4 x half>
     %4 = fadd <4 x half> %3, %3
@@ -682,7 +682,7 @@ define void @test_v8i8_v2f32(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2s }
 ; CHECK: rev32 v{{[0-9]+}}.8b
 ; CHECK: st1 { v{{[0-9]+}}.8b }
-    %1 = load <2 x float>, ptr %p
+    %1 = load volatile <2 x float>, ptr %p
     %2 = fadd <2 x float> %1, %1
     %3 = bitcast <2 x float> %2 to <8 x i8>
     %4 = add <8 x i8> %3, %3
@@ -695,7 +695,7 @@ define void @test_v8i8_v2i32(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2s }
 ; CHECK: rev32 v{{[0-9]+}}.8b
 ; CHECK: st1 { v{{[0-9]+}}.8b }
-    %1 = load <2 x i32>, ptr %p
+    %1 = load volatile <2 x i32>, ptr %p
     %2 = add <2 x i32> %1, %1
     %3 = bitcast <2 x i32> %2 to <8 x i8>
     %4 = add <8 x i8> %3, %3
@@ -721,7 +721,7 @@ define void @test_f128_v2f64(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2d }
 ; CHECK: ext
 ; CHECK: str
-    %1 = load <2 x double>, ptr %p
+    %1 = load volatile <2 x double>, ptr %p
     %2 = fadd <2 x double> %1, %1
     %3 = bitcast <2 x double> %2 to fp128
     %4 = fadd fp128 %3, %3
@@ -734,7 +734,7 @@ define void @test_f128_v2i64(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2d }
 ; CHECK: ext
 ; CHECK: str
-    %1 = load <2 x i64>, ptr %p
+    %1 = load volatile <2 x i64>, ptr %p
     %2 = add <2 x i64> %1, %1
     %3 = bitcast <2 x i64> %2 to fp128
     %4 = fadd fp128 %3, %3
@@ -816,7 +816,7 @@ define void @test_v2f64_f128(ptr %p, ptr %q) {
 define void @test_v2f64_v2i64(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2d }
 ; CHECK: st1 { v{{[0-9]+}}.2d }
-    %1 = load <2 x i64>, ptr %p
+    %1 = load volatile <2 x i64>, ptr %p
     %2 = add <2 x i64> %1, %1
     %3 = bitcast <2 x i64> %2 to <2 x double>
     %4 = fadd <2 x double> %3, %3
@@ -895,7 +895,7 @@ define void @test_v2i64_f128(ptr %p, ptr %q) {
 define void @test_v2i64_v2f64(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2d }
 ; CHECK: st1 { v{{[0-9]+}}.2d }
-    %1 = load <2 x double>, ptr %p
+    %1 = load volatile <2 x double>, ptr %p
     %2 = fadd <2 x double> %1, %1
     %3 = bitcast <2 x double> %2 to <2 x i64>
     %4 = add <2 x i64> %3, %3
@@ -979,7 +979,7 @@ define void @test_v4f32_v2f64(ptr %p, ptr %q) {
 ; CHECK: rev64 v{{[0-9]+}}.4s
 ; CHECK-NOT: rev
 ; CHECK: st1 { v{{[0-9]+}}.4s }
-    %1 = load <2 x double>, ptr %p
+    %1 = load volatile <2 x double>, ptr %p
     %2 = fadd <2 x double> %1, %1
     %3 = bitcast <2 x double> %2 to <4 x float>
     %4 = fadd <4 x float> %3, %3
@@ -994,7 +994,7 @@ define void @test_v4f32_v2i64(ptr %p, ptr %q) {
 ; CHECK: fadd
 ; CHECK-NOT: rev
 ; CHECK: st1 { v{{[0-9]+}}.4s }
-    %1 = load <2 x i64>, ptr %p
+    %1 = load volatile <2 x i64>, ptr %p
     %2 = add <2 x i64> %1, %1
     %3 = bitcast <2 x i64> %2 to <4 x float>
     %4 = fadd <4 x float> %3, %3
@@ -1062,7 +1062,7 @@ define void @test_v4i32_v2f64(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2d }
 ; CHECK: rev64 v{{[0-9]+}}.4s
 ; CHECK: st1 { v{{[0-9]+}}.4s }
-    %1 = load <2 x double>, ptr %p
+    %1 = load volatile <2 x double>, ptr %p
     %2 = fadd <2 x double> %1, %1
     %3 = bitcast <2 x double> %2 to <4 x i32>
     %4 = add <4 x i32> %3, %3
@@ -1075,7 +1075,7 @@ define void @test_v4i32_v2i64(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2d }
 ; CHECK: rev64 v{{[0-9]+}}.4s
 ; CHECK: st1 { v{{[0-9]+}}.4s }
-    %1 = load <2 x i64>, ptr %p
+    %1 = load volatile <2 x i64>, ptr %p
     %2 = add <2 x i64> %1, %1
     %3 = bitcast <2 x i64> %2 to <4 x i32>
     %4 = add <4 x i32> %3, %3
@@ -1141,7 +1141,7 @@ define void @test_v8i16_v2f64(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2d }
 ; CHECK: rev64 v{{[0-9]+}}.8h
 ; CHECK: st1 { v{{[0-9]+}}.8h }
-    %1 = load <2 x double>, ptr %p
+    %1 = load volatile <2 x double>, ptr %p
     %2 = fadd <2 x double> %1, %1
     %3 = bitcast <2 x double> %2 to <8 x i16>
     %4 = add <8 x i16> %3, %3
@@ -1154,7 +1154,7 @@ define void @test_v8i16_v2i64(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2d }
 ; CHECK: rev64 v{{[0-9]+}}.8h
 ; CHECK: st1 { v{{[0-9]+}}.8h }
-    %1 = load <2 x i64>, ptr %p
+    %1 = load volatile <2 x i64>, ptr %p
     %2 = add <2 x i64> %1, %1
     %3 = bitcast <2 x i64> %2 to <8 x i16>
     %4 = add <8 x i16> %3, %3
@@ -1234,7 +1234,7 @@ define void @test_v16i8_v2f64(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2d }
 ; CHECK: rev64 v{{[0-9]+}}.16b
 ; CHECK: st1 { v{{[0-9]+}}.16b }
-    %1 = load <2 x double>, ptr %p
+    %1 = load volatile <2 x double>, ptr %p
     %2 = fadd <2 x double> %1, %1
     %3 = bitcast <2 x double> %2 to <16 x i8>
     %4 = add <16 x i8> %3, %3
@@ -1247,7 +1247,7 @@ define void @test_v16i8_v2i64(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2d }
 ; CHECK: rev64 v{{[0-9]+}}.16b
 ; CHECK: st1 { v{{[0-9]+}}.16b }
-    %1 = load <2 x i64>, ptr %p
+    %1 = load volatile <2 x i64>, ptr %p
     %2 = add <2 x i64> %1, %1
     %3 = bitcast <2 x i64> %2 to <16 x i8>
     %4 = add <16 x i8> %3, %3
@@ -1315,7 +1315,7 @@ define %struct.struct1 @test_v4f16_struct(ptr %ret) {
 entry:
 ; CHECK: ld1 { {{v[0-9]+}}.4h }
 ; CHECK-NOT: rev
-  %0 = load <4 x half>, ptr %ret, align 2
+  %0 = load volatile <4 x half>, ptr %ret, align 2
   %1 = extractelement <4 x half> %0, i32 0
   %.fca.0.insert = insertvalue %struct.struct1 undef, half %1, 0
   ret %struct.struct1 %.fca.0.insert
diff --git a/llvm/test/CodeGen/AArch64/dag-ReplaceAllUsesOfValuesWith.ll b/llvm/test/CodeGen/AArch64/dag-ReplaceAllUsesOfValuesWith.ll
index d76e817e62a495..ce657aa1f0b5bc 100644
--- a/llvm/test/CodeGen/AArch64/dag-ReplaceAllUsesOfValuesWith.ll
+++ b/llvm/test/CodeGen/AArch64/dag-ReplaceAllUsesOfValuesWith.ll
@@ -27,10 +27,7 @@
 define i64 @g(ptr %p) {
 ; CHECK-LABEL: g:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    ldr x8, [x0, #8]
-; CHECK-NEXT:    add x9, x8, x8
-; CHECK-NEXT:    add x8, x9, x8
-; CHECK-NEXT:    sub x0, x8, x8
+; CHECK-NEXT:    mov x0, xzr
 ; CHECK-NEXT:    ret
   %vec = load <2 x i64>, ptr %p, align 1
   %elt = extractelement <2 x i64> %vec, i32 1
diff --git a/llvm/test/CodeGen/AArch64/fcmp.ll b/llvm/test/CodeGen/AArch64/fcmp.ll
index 66f26fc9d85973..d39e537edb7861 100644
--- a/llvm/test/CodeGen/AArch64/fcmp.ll
+++ b/llvm/test/CodeGen/AArch64/fcmp.ll
@@ -679,28 +679,27 @@ define <3 x double> @v3f128_double(<3 x fp128> %a, <3 x fp128> %b, <3 x double>
 ; CHECK-SD-NEXT:    .cfi_def_cfa_offset 160
 ; CHECK-SD-NEXT:    .cfi_offset w30, -16
 ; CHECK-SD-NEXT:    stp q2, q5, [sp, #112] // 32-byte Folded Spill
+; CHECK-SD-NEXT:    add x8, sp, #176
 ; CHECK-SD-NEXT:    // kill: def $d6 killed $d6 def $q6
 ; CHECK-SD-NEXT:    // kill: def $d7 killed $d7 def $q7
-; CHECK-SD-NEXT:    ldr d5, [sp, #184]
-; CHECK-SD-NEXT:    str q3, [sp, #64] // 16-byte Folded Spill
-; CHECK-SD-NEXT:    ldp d3, d2, [sp, #168]
+; CHECK-SD-NEXT:    str q3, [sp, #32] // 16-byte Folded Spill
+; CHECK-SD-NEXT:    ldp d3, d2, [sp, #160]
 ; CHECK-SD-NEXT:    mov v6.d[1], v7.d[0]
 ; CHECK-SD-NEXT:    str q0, [sp, #16] // 16-byte Folded Spill
 ; CHECK-SD-NEXT:    mov v0.16b, v1.16b
 ; CHECK-SD-NEXT:    mov v1.16b, v4.16b
-; CHECK-SD-NEXT:    str q5, [sp, #96] // 16-byte Folded Spill
-; CHECK-SD-NEXT:    ldr d5, [sp, #160]
-; CHECK-SD-NEXT:    mov v3.d[1], v2.d[0]
-; CHECK-SD-NEXT:    str q5, [sp, #80] // 16-byte Folded Spill
-; CHECK-SD-NEXT:    stp q6, q3, [sp, #32] // 32-byte Folded Spill
+; CHECK-SD-NEXT:    ld1 { v2.d }[1], [x8]
+; CHECK-SD-NEXT:    stp q6, q3, [sp, #80] // 32-byte Folded Spill
+; CHECK-SD-NEXT:    str q2, [sp, #48] // 16-byte Folded Spill
+; CHECK-SD-NEXT:    ldr d2, [sp, #184]
+; CHECK-SD-NEXT:    str q2, [sp, #64] // 16-byte Folded Spill
 ; CHECK-SD-NEXT:    bl __lttf2
 ; CHECK-SD-NEXT:    cmp w0, #0
-; CHECK-SD-NEXT:    ldr q1, [sp, #64] // 16-byte Folded Reload
 ; CHECK-SD-NEXT:    cset w8, lt
 ; CHECK-SD-NEXT:    sbfx x8, x8, #0, #1
 ; CHECK-SD-NEXT:    fmov d0, x8
 ; CHECK-SD-NEXT:    str q0, [sp] // 16-byte Folded Spill
-; CHECK-SD-NEXT:    ldr q0, [sp, #16] // 16-byte Folded Reload
+; CHECK-SD-NEXT:    ldp q0, q1, [sp, #16] // 32-byte Folded Reload
 ; CHECK-SD-NEXT:    bl __lttf2
 ; CHECK-SD-NEXT:    cmp w0, #0
 ; CHECK-SD-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
@@ -708,19 +707,19 @@ define <3 x double> @v3f128_double(<3 x fp128> %a, <3 x fp128> %b, <3 x double>
 ; CHECK-SD-NEXT:    sbfx x8, x8, #0, #1
 ; CHECK-SD-NEXT:    fmov d1, x8
 ; CHECK-SD-NEXT:    mov v1.d[1], v0.d[0]
-; CHECK-SD-NEXT:    str q1, [sp, #64] // 16-byte Folded Spill
+; CHECK-SD-NEXT:    str q1, [sp, #32] // 16-byte Folded Spill
 ; CHECK-SD-NEXT:    ldp q0, q1, [sp, #112] // 32-byte Folded Reload
 ; CHECK-SD-NEXT:    bl __lttf2
-; CHECK-SD-NEXT:    ldp q1, q0, [sp, #32] // 32-byte Folded Reload
+; CHECK-SD-NEXT:    ldp q0, q3, [sp, #80] // 32-byte Folded Reload
 ; CHECK-SD-NEXT:    cmp w0, #0
-; CHECK-SD-NEXT:    ldp q2, q4, [sp, #64] // 32-byte Folded Reload
+; CHECK-SD-NEXT:    ldp q2, q1, [sp, #32] // 32-byte Folded Reload
 ; CHECK-SD-NEXT:    cset w8, lt
 ; CHECK-SD-NEXT:    sbfx x8, x8, #0, #1
-; CHECK-SD-NEXT:    ldr q3, [sp, #96] // 16-byte Folded Reload
+; CHECK-SD-NEXT:    ldr q4, [sp, #64] // 16-byte Folded Reload
 ; CHECK-SD-NEXT:    ldr x30, [sp, #144] // 8-byte Folded Reload
-; CHECK-SD-NEXT:    bit v0.16b, v1.16b, v2.16b
+; CHECK-SD-NEXT:    bif v0.16b, v1.16b, v2.16b
 ; CHECK-SD-NEXT:    fmov d2, x8
-; CHECK-SD-NEXT:    bsl v2.16b, v4.16b, v3.16b
+; CHECK-SD-NEXT:    bsl v2.16b, v3.16b, v4.16b
 ; CHECK-SD-NEXT:    ext v1.16b, v0.16b, v0.16b, #8
 ; CHECK-SD-NEXT:    // kill: def $d0 killed $d0 killed $q0
 ; CHECK-SD-NEXT:    // kill: def $d2 killed $d2 killed $q2
@@ -815,20 +814,20 @@ define <3 x double> @v3f64_double(<3 x double> %a, <3 x double> %b, <3 x double>
 ; CHECK-SD-NEXT:    // kill: def $d1 killed $d1 def $q1
 ; CHECK-SD-NEXT:    // kill: def $d6 killed $d6 def $q6
 ; CHECK-SD-NEXT:    // kill: def $d7 killed $d7 def $q7
+; CHECK-SD-NEXT:    add x8, sp, #16
 ; CHECK-SD-NEXT:    // kill: def $d2 killed $d2 def $q2
 ; CHECK-SD-NEXT:    // kill: def $d5 killed $d5 def $q5
-; CHECK-SD-NEXT:    ldr d16, [sp, #24]
-; CHECK-SD-NEXT:    ldr d17, [sp]
 ; CHECK-SD-NEXT:    mov v3.d[1], v4.d[0]
 ; CHECK-SD-NEXT:    mov v0.d[1], v1.d[0]
 ; CHECK-SD-NEXT:    mov v6.d[1], v7.d[0]
-; CHECK-SD-NEXT:    ldp d1, d4, [sp, #8]
 ; CHECK-SD-NEXT:    fcmgt v2.2d, v5.2d, v2.2d
-; CHECK-SD-NEXT:    mov v1.d[1], v4.d[0]
 ; CHECK-SD-NEXT:    fcmgt v0.2d, v3.2d, v0.2d
-; CHECK-SD-NEXT:    bsl v2.16b, v17.16b, v16.16b
-; CHECK-SD-NEXT:    // kill: def $d2 killed $d2 killed $q2
+; CHECK-SD-NEXT:    ldp d3, d1, [sp]
+; CHECK-SD-NEXT:    ld1 { v1.d }[1], [x8]
 ; CHECK-SD-NEXT:    bsl v0.16b, v6.16b, v1.16b
+; CHECK-SD-NEXT:    ldr d1, [sp, #24]
+; CHECK-SD-NEXT:    bsl v2.16b, v3.16b, v1.16b
+; CHECK-SD-NEXT:    // kill: def $d2 killed $d2 killed $q2
 ; CHECK-SD-NEXT:    ext v1.16b, v0.16b, v0.16b, #8
 ; CHECK-SD-NEXT:    // kill: def $d0 killed $d0 killed $q0
 ; CHECK-SD-NEXT:    // kill: def $d1 killed $d1 killed $q1
diff --git a/llvm/test/CodeGen/AArch64/fmlal-loreg.ll b/llvm/test/CodeGen/AArch64/fmlal-loreg.ll
index 31ead890ba8ac7..ed22243eeef45f 100644
--- a/llvm/test/CodeGen/AArch64/fmlal-loreg.ll
+++ b/llvm/test/CodeGen/AArch64/fmlal-loreg.ll
@@ -45,11 +45,11 @@ define void @loop(ptr %out_tile, ptr %lhs_panel, ptr %rhs_panel, i32 noundef %K,
 ; CHECK-NEXT:    mov w8, w3
 ; CHECK-NEXT:  .LBB1_1: // %for.body
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    ldr q2, [x1], #2
+; CHECK-NEXT:    ldr q2, [x2], #2
 ; CHECK-NEXT:    subs x8, x8, #1
-; CHECK-NEXT:    ldr q3, [x2], #2
-; CHECK-NEXT:    fmlal v0.4s, v3.4h, v2.h[0]
-; CHECK-NEXT:    fmlal2 v1.4s, v3.4h, v2.h[0]
+; CHECK-NEXT:    ld1r { v3.8h }, [x1], #2
+; CHECK-NEXT:    fmlal v0.4s, v2.4h, v3.4h
+; CHECK-NEXT:    fmlal2 v1.4s, v2.4h, v3.4h
 ; CHECK-NEXT:    b.ne .LBB1_1
 ; CHECK-NEXT:  // %bb.2: // %for.cond.cleanup
 ; CHECK-NEXT:    stp q0, q1, [x0]
diff --git a/llvm/test/CodeGen/AArch64/icmp.ll b/llvm/test/CodeGen/AArch64/icmp.ll
index e284795760c5ca..f586647439d255 100644
--- a/llvm/test/CodeGen/AArch64/icmp.ll
+++ b/llvm/test/CodeGen/AArch64/icmp.ll
@@ -1123,30 +1123,29 @@ entry:
 define <3 x i64> @v3i64_i64(<3 x i64> %a, <3 x i64> %b, <3 x i64> %d, <3 x i64> %e) {
 ; CHECK-SD-LABEL: v3i64_i64:
 ; CHECK-SD:       // %bb.0: // %entry
-; CHECK-SD-NEXT:    // kill: def $d4 killed $d4 def $q4
 ; CHECK-SD-NEXT:    // kill: def $d3 killed $d3 def $q3
-; CHECK-SD-NEXT:    // kill: def $d1 killed $d1 def $q1
 ; CHECK-SD-NEXT:    // kill: def $d0 killed $d0 def $q0
+; CHECK-SD-NEXT:    // kill: def $d4 killed $d4 def $q4
+; CHECK-SD-NEXT:    // kill: def $d1 killed $d1 def $q1
 ; CHECK-SD-NEXT:    // kill: def $d6 killed $d6 def $q6
 ; CHECK-SD-NEXT:    // kill: def $d7 killed $d7 def $q7
+; CHECK-SD-NEXT:    add x8, sp, #16
 ; CHECK-SD-NEXT:    // kill: def $d5 killed $d5 def $q5
 ; CHECK-SD-NEXT:    // kill: def $d2 killed $d2 def $q2
-; CHECK-SD-NEXT:    ldr d16, [sp, #24]
-; CHECK-SD-NEXT:    ldr d17, [sp]
 ; CHECK-SD-NEXT:    mov v3.d[1], v4.d[0]
 ; CHECK-SD-NEXT:    mov v0.d[1], v1.d[0]
 ; CHECK-SD-NEXT:    mov v6.d[1], v7.d[0]
-; CHECK-SD-NEXT:    ldp d1, d4, [sp, #8]
-; CHECK-SD-NEXT:    mov v1.d[1], v4.d[0]
+; CHECK-SD-NEXT:    ldp d4, d1, [sp]
+; CHECK-SD-NEXT:    ld1 { v1.d }[1], [x8]
 ; CHECK-SD-NEXT:    cmgt v0.2d, v3.2d, v0.2d
 ; CHECK-SD-NEXT:    bsl v0.16b, v6.16b, v1.16b
 ; CHECK-SD-NEXT:    cmgt v1.2d, v5.2d, v2.2d
-; CHECK-SD-NEXT:    mov v2.16b, v1.16b
+; CHECK-SD-NEXT:    ldr d2, [sp, #24]
+; CHECK-SD-NEXT:    bit v2.16b, v4.16b, v1.16b
 ; CHECK-SD-NEXT:    ext v1.16b, v0.16b, v0.16b, #8
 ; CHECK-SD-NEXT:    // kill: def $d0 killed $d0 killed $q0
-; CHECK-SD-NEXT:    // kill: def $d1 killed $d1 killed $q1
-; CHECK-SD-NEXT:    bsl v2.16b, v17.16b, v16.16b
 ; CHECK-SD-NEXT:    // kill: def $d2 killed $d2 killed $q2
+; CHECK-SD-NEXT:    // kill: def $d1 killed $d1 killed $q1
 ; CHECK-SD-NEXT:    ret
 ;
 ; CHECK-GI-LABEL: v3i64_i64:
diff --git a/llvm/test/CodeGen/AArch64/sve-fixed-length-extract-vector-elt.ll b/llvm/test/CodeGen/AArch64/sve-fixed-length-extract-vector-elt.ll
index ad4efeaf39247a..1e6427c4cd4956 100644
--- a/llvm/test/CodeGen/AArch64/sve-fixed-length-extract-vector-elt.ll
+++ b/llvm/test/CodeGen/AArch64/sve-fixed-length-extract-vector-elt.ll
@@ -33,10 +33,7 @@ define half @extractelement_v8f16(<8 x half> %op1) vscale_range(2,0) #0 {
 define half @extractelement_v16f16(ptr %a) vscale_range(2,0) #0 {
 ; CHECK-LABEL: extractelement_v16f16:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    ptrue p0.h, vl16
-; CHECK-NEXT:    ld1h { z0.h }, p0/z, [x0]
-; CHECK-NEXT:    mov z0.h, z0.h[15]
-; CHECK-NEXT:    // kill: def $h0 killed $h0 killed $z0
+; CHECK-NEXT:    ldr h0, [x0, #30]
 ; CHECK-NEXT:    ret
     %op1 = load <16 x half>, ptr %a
  ...
[truncated]

@RKSimon
Copy link
Collaborator

RKSimon commented Jan 13, 2025

X86TargetLowering::shouldReduceLoadWidth is a mess, resulting in a lot of duplicate aliasaed loads that make very little sense - we're seeing something similar on #122485, but it might take some time to unravel.

topperc added a commit to topperc/llvm-project that referenced this pull request Jan 13, 2025
These tests cases weren't trying to test load+extract. I believe
they only used loads because fixed vector arguments weren't supported
when they were written or they weren't copied from the structure of
other tests that pre-date fixed vector argument support.

Reduces diff from llvm#122671.
; CHECK-NEXT: vslidedown.vi v8, v8, 2
; CHECK-NEXT: vmv.x.s a0, v8
; CHECK-NEXT: ret
; RV32-LABEL: extractelt_v4i32:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think most of these tests were only using loads because fixed vector arguments werent' supported when the test was written.

; CHECK-NEXT: vle32.v v8, (a0)
; CHECK-NEXT: vmv.x.s a0, v8
; CHECK-NEXT: ret
; RV32-LABEL: vreduce_add_v1i32:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same story as fixed-vectors-extract.ll. This test wasn't interested in loads.

; CHECK-NEXT: vsetivli zero, 1, e16, mf4, ta, ma
; CHECK-NEXT: vle16.v v8, (a0)
; CHECK-NEXT: vfmv.f.s fa5, v8
; CHECK-NEXT: flh fa5, 0(a0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same story as fixed-vectors-extract.ll. This test wasn't interested in loads.

topperc added a commit to topperc/llvm-project that referenced this pull request Jan 13, 2025
…. NFC

These tests weren't interested in the loas. Removing them reduces the
diffs from llvm#122671.
topperc added a commit to topperc/llvm-project that referenced this pull request Jan 13, 2025
These tests cases weren't trying to test load+extract. I believe
they only used loads because fixed vector arguments weren't supported
when they were written or they weren't copied from the structure of
other tests that pre-date fixed vector argument support.

Reduces diff from llvm#122671.
topperc added a commit that referenced this pull request Jan 14, 2025
These test cases weren't trying to test load+extract. I believe they
only used loads because fixed vector arguments weren't supported when
they were written or they weren't copied from the structure of other
tests that pre-date fixed vector argument support.

Reduces diff from #122671.
topperc added a commit that referenced this pull request Jan 14, 2025
…. NFC (#122808)

These tests weren't interested in the loads. Removing them reduces the
diffs from #122671.
@arsenm arsenm force-pushed the users/arsenm/dag-move-scalarize-extracted-vector-load-to-tli branch from 9bedb14 to 65e9c1b Compare January 29, 2025 16:33
@arsenm arsenm force-pushed the users/arsenm/dag/simplify-demanded-vector-elts-load branch from 68ca84a to 8f15ec9 Compare January 29, 2025 16:33
@arsenm arsenm force-pushed the users/arsenm/dag/simplify-demanded-vector-elts-load branch from 8f15ec9 to b745947 Compare February 4, 2025 09:32
@arsenm arsenm force-pushed the users/arsenm/dag-move-scalarize-extracted-vector-load-to-tli branch from 65e9c1b to b7d320b Compare February 4, 2025 09:32
Base automatically changed from users/arsenm/dag-move-scalarize-extracted-vector-load-to-tli to main February 4, 2025 10:37
@RKSimon
Copy link
Collaborator

RKSimon commented Feb 4, 2025

@arsenm Please can you rebase this and then I'll see what I can do to help with the x86 regressions

This improves some AMDGPU cases and avoids future regressions.
The combiner likes to form shuffles for cases where an extract_vector_elt
would do perfectly well, and this recovers some of the regressions from
losing load narrowing.

AMDGPU, Arch64 and RISCV test changes look broadly better. Other targets have
some improvements, but mostly regressions. In particular X86 looks much
worse. I'm guessing this is because it's shouldReduceLoadWidth is wrong.

I mostly just regenerated the checks. I assume some set of them should
switch to use volatile loads to defeat the optimization.
@arsenm arsenm force-pushed the users/arsenm/dag/simplify-demanded-vector-elts-load branch from b745947 to 1cae79c Compare February 6, 2025 16:45
@RKSimon
Copy link
Collaborator

RKSimon commented Feb 12, 2025

I'm still looking at the x86 mess - but something I've hit is that the hasOneUse() checks on the shouldReduceLoadWidth callback are often getting confused by extra uses of the load nodes's chain - is there anything we can do is clean that up? (See also #126764)

@arsenm
Copy link
Contributor Author

arsenm commented Feb 12, 2025

I'm still looking at the x86 mess - but something I've hit is that the hasOneUse() checks on the shouldReduceLoadWidth callback are often getting confused by extra uses of the load nodes's chain - is there anything we can do is clean that up? (See also #126764)

SDNode::hasOneUse checks are rarely the correct thing over SDValue::hasOneUse. We should probably rename the SDNode version; it's too easy to mix up N->hasOneUse vs. N.hasOneUse

RKSimon added a commit to RKSimon/llvm-project that referenced this pull request Feb 21, 2025
…ded value, not the chain etc.

The hasOneUse check was failing in any case where the load was part of a chain - we should only be checking if the loaded value has one use, and any updates to the chain should be handled by the fold calling shouldReduceLoadWidth.

I've updated the x86 implementation to match, although it has no effect here yet (I'm still looking at how to improve the x86 implementation) as the inner for loop was discarding chain uses anyway.

By using hasNUsesOfValue instead this patch exposes a missing dependency on the LLVMSelectionDAG library in a lot of tools + unittests, we can either update the CMakeLists.txt dependencies or make SDNode::hasNUsesOfValue inline - no string opinions on this tbh.

Noticed while fighting the x86 regressions in llvm#122671
RKSimon added a commit to RKSimon/llvm-project that referenced this pull request Feb 24, 2025
…ded value, not the chain etc.

The hasOneUse check was failing in any case where the load was part of a chain - we should only be checking if the loaded value has one use, and any updates to the chain should be handled by the fold calling shouldReduceLoadWidth.

I've updated the x86 implementation to match, although it has no effect here yet (I'm still looking at how to improve the x86 implementation) as the inner for loop was discarding chain uses anyway.

By using SDValue::hasOneUse instead this patch exposes a missing dependency on the LLVMSelectionDAG library in a lot of tools + unittests, which resulted in having to make SDNode::hasNUsesOfValue inline.

Noticed while fighting the x86 regressions in llvm#122671
RKSimon added a commit that referenced this pull request Feb 24, 2025
…value - not the chain (#128167)

The hasOneUse check was failing in any case where the load was part of a chain - we should only be checking if the loaded value has one use, and any updates to the chain should be handled by the fold calling shouldReduceLoadWidth.

I've updated the x86 implementation to match, although it has no effect here yet (I'm still looking at how to improve the x86 implementation) as the inner for loop was discarding chain uses anyway.

By using SDValue::hasOneUse instead this patch exposes a missing dependency on the LLVMSelectionDAG library in a lot of tools + unittests, which resulted in having to make SDNode::hasNUsesOfValue inline.

Noticed while fighting the x86 regressions in #122671
RKSimon added a commit to RKSimon/llvm-project that referenced this pull request Feb 26, 2025
…st(p0) if either load is oneuse

This fold is currently limited to cases where the load_subv(p0) has oneuse, but its beneficial if either load has oneuse and will be replaced.

Yes another yak shave for llvm#122671
RKSimon added a commit that referenced this pull request Feb 27, 2025
…st(p0) if either load is oneuse (#128857)

This fold is currently limited to cases where the load_subv(p0) has oneuse, but its beneficial if either load has oneuse and will be replaced.

Yet another yak shave for #122671
joaosaffran pushed a commit to joaosaffran/llvm-project that referenced this pull request Mar 3, 2025
…st(p0) if either load is oneuse (llvm#128857)

This fold is currently limited to cases where the load_subv(p0) has oneuse, but its beneficial if either load has oneuse and will be replaced.

Yet another yak shave for llvm#122671
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants