DAG: Handle load in SimplifyDemandedVectorElts #122671

arsenm · 2025-01-13T07:09:46Z

This improves some AMDGPU cases and avoids future regressions.
The combiner likes to form shuffles for cases where an extract_vector_elt
would do perfectly well, and this recovers some of the regressions from
losing load narrowing.

AMDGPU, Arch64 and RISCV test changes look broadly better. Other targets have
some improvements, but mostly regressions. In particular X86 looks much
worse. I'm guessing this is because it's shouldReduceLoadWidth is wrong.

I mostly just regenerated the checks. I assume some set of them should
switch to use volatile loads to defeat the optimization.

arsenm · 2025-01-13T07:10:08Z

DAG: Handle load in SimplifyDemandedVectorElts #122671 👈 (View in Graphite)
DAG: Move scalarizeExtractedVectorLoad to TargetLowering #122670
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

llvmbot · 2025-01-13T07:10:54Z

@llvm/pr-subscribers-backend-amdgpu
@llvm/pr-subscribers-backend-powerpc
@llvm/pr-subscribers-backend-arm
@llvm/pr-subscribers-backend-x86

@llvm/pr-subscribers-llvm-selectiondag

Author: Matt Arsenault (arsenm)

Changes

This improves some AMDGPU cases and avoids future regressions.
The combiner likes to form shuffles for cases where an extract_vector_elt
would do perfectly well, and this recovers some of the regressions from
losing load narrowing.

AMDGPU, Arch64 and RISCV test changes look broadly better. Other targets have
some improvements, but mostly regressions. In particular X86 looks much
worse. I'm guessing this is because it's shouldReduceLoadWidth is wrong.

I mostly just regenerated the checks. I assume some set of them should
switch to use volatile loads to defeat the optimization.

Patch is 773.93 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/122671.diff

167 Files Affected:

(modified) llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp (+32)
(modified) llvm/test/CodeGen/AArch64/arm64-big-endian-bitconverts.ll (+27-27)
(modified) llvm/test/CodeGen/AArch64/dag-ReplaceAllUsesOfValuesWith.ll (+1-4)
(modified) llvm/test/CodeGen/AArch64/fcmp.ll (+21-22)
(modified) llvm/test/CodeGen/AArch64/fmlal-loreg.ll (+4-4)
(modified) llvm/test/CodeGen/AArch64/icmp.ll (+8-9)
(modified) llvm/test/CodeGen/AArch64/sve-fixed-length-extract-vector-elt.ll (+24-90)
(modified) llvm/test/CodeGen/AArch64/sve-fixed-length-masked-gather.ll (+1-2)
(modified) llvm/test/CodeGen/AArch64/sve-fixed-length-masked-scatter.ll (+5-5)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-extract-vector-elt.ll (+6-24)
(modified) llvm/test/CodeGen/AMDGPU/fcopysign.f32.ll (+19-20)
(modified) llvm/test/CodeGen/AMDGPU/fcopysign.f64.ll (+19-19)
(modified) llvm/test/CodeGen/AMDGPU/greedy-reverse-local-assignment.ll (+10-11)
(modified) llvm/test/CodeGen/AMDGPU/identical-subrange-spill-infloop.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/implicit-kernarg-backend-usage.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/shader-addr64-nonuniform.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/trunc.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/vector_rebroadcast.ll (+1329-1328)
(modified) llvm/test/CodeGen/AMDGPU/vector_shuffle.packed.ll (+102-148)
(modified) llvm/test/CodeGen/ARM/crash-on-pow2-shufflevector.ll (+3-2)
(modified) llvm/test/CodeGen/ARM/vector-promotion.ll (+15-15)
(modified) llvm/test/CodeGen/ARM/vext.ll (+4-3)
(modified) llvm/test/CodeGen/ARM/vuzp.ll (+7-7)
(modified) llvm/test/CodeGen/LoongArch/vector-fp-imm.ll (+1-2)
(modified) llvm/test/CodeGen/Mips/cconv/vector.ll (+43-22)
(modified) llvm/test/CodeGen/Mips/msa/basic_operations.ll (+76-20)
(modified) llvm/test/CodeGen/NVPTX/i128.ll (+5-5)
(modified) llvm/test/CodeGen/NVPTX/i8x4-instructions.ll (+24-24)
(modified) llvm/test/CodeGen/NVPTX/store-undef.ll (+2-2)
(modified) llvm/test/CodeGen/PowerPC/aix-vector-byval-callee.ll (+2-2)
(modified) llvm/test/CodeGen/PowerPC/canonical-merge-shuffles.ll (+2-7)
(modified) llvm/test/CodeGen/PowerPC/const-stov.ll (+8-7)
(modified) llvm/test/CodeGen/PowerPC/pr27078.ll (+12-10)
(modified) llvm/test/CodeGen/PowerPC/pre-inc-disable.ll (+16-12)
(modified) llvm/test/CodeGen/PowerPC/v16i8_scalar_to_vector_shuffle.ll (+78-42)
(modified) llvm/test/CodeGen/PowerPC/v2i64_scalar_to_vector_shuffle.ll (+111-101)
(modified) llvm/test/CodeGen/PowerPC/v8i16_scalar_to_vector_shuffle.ll (+4-2)
(modified) llvm/test/CodeGen/PowerPC/vsx_shuffle_le.ll (+44-60)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-extract.ll (+69-132)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp-shuffles.ll (+17-10)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-reduction-fp.ll (+3-9)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-reduction-int.ll (+108-81)
(modified) llvm/test/CodeGen/Thumb2/mve-extractstore.ll (+10-14)
(modified) llvm/test/CodeGen/Thumb2/mve-insertshuffleload.ll (+28-20)
(modified) llvm/test/CodeGen/X86/SwizzleShuff.ll (+1-1)
(modified) llvm/test/CodeGen/X86/avx-vbroadcast.ll (+3-3)
(modified) llvm/test/CodeGen/X86/avx.ll (+4-2)
(modified) llvm/test/CodeGen/X86/avx1-logical-load-folding.ll (+16-12)
(modified) llvm/test/CodeGen/X86/avx512-arith.ll (+2-2)
(modified) llvm/test/CodeGen/X86/avx512-broadcast-arith.ll (+6-6)
(modified) llvm/test/CodeGen/X86/avx512-broadcast-unfold.ll (+1-1)
(modified) llvm/test/CodeGen/X86/avx512-calling-conv.ll (+2-2)
(modified) llvm/test/CodeGen/X86/avx512-cmp.ll (+1-1)
(modified) llvm/test/CodeGen/X86/avx512-ext.ll (+9-9)
(modified) llvm/test/CodeGen/X86/avx512-extract-subvector-load-store.ll (+6-6)
(modified) llvm/test/CodeGen/X86/avx512-intrinsics-fast-isel.ll (+37-24)
(modified) llvm/test/CodeGen/X86/avx512-intrinsics.ll (+10-10)
(modified) llvm/test/CodeGen/X86/avx512-load-store.ll (+16-16)
(modified) llvm/test/CodeGen/X86/avx512-logic.ll (+7-7)
(modified) llvm/test/CodeGen/X86/avx512-select.ll (+2-2)
(modified) llvm/test/CodeGen/X86/avx512-shuffles/partial_permute.ll (+407-312)
(modified) llvm/test/CodeGen/X86/avx512-shuffles/shuffle-interleave.ll (+18-13)
(modified) llvm/test/CodeGen/X86/avx512-shuffles/unpack.ll (+22-18)
(modified) llvm/test/CodeGen/X86/avx512fp16-mov.ll (+5-3)
(modified) llvm/test/CodeGen/X86/bitreverse.ll (+6-4)
(modified) llvm/test/CodeGen/X86/buildvec-insertvec.ll (+6-10)
(modified) llvm/test/CodeGen/X86/combine-fabs.ll (+6-3)
(modified) llvm/test/CodeGen/X86/combine-sdiv.ll (+8-4)
(modified) llvm/test/CodeGen/X86/combine-udiv.ll (+6-2)
(modified) llvm/test/CodeGen/X86/commute-blend-avx2.ll (+4-2)
(modified) llvm/test/CodeGen/X86/commute-blend-sse41.ll (+6-6)
(modified) llvm/test/CodeGen/X86/copysign-constant-magnitude.ll (+16-8)
(modified) llvm/test/CodeGen/X86/extract-concat.ll (+6-4)
(modified) llvm/test/CodeGen/X86/extractelement-fp.ll (+4-3)
(modified) llvm/test/CodeGen/X86/extractelement-load.ll (+40-19)
(modified) llvm/test/CodeGen/X86/fabs.ll (+2-1)
(modified) llvm/test/CodeGen/X86/fast-isel-fneg.ll (+3-2)
(modified) llvm/test/CodeGen/X86/fma-signed-zero.ll (+4-2)
(modified) llvm/test/CodeGen/X86/fp-fold.ll (+15-12)
(modified) llvm/test/CodeGen/X86/fp-intrinsics-fma.ll (+28-14)
(modified) llvm/test/CodeGen/X86/fp-logic.ll (+8-4)
(modified) llvm/test/CodeGen/X86/fp-round.ll (+30-27)
(modified) llvm/test/CodeGen/X86/fp128-cast.ll (+2-1)
(modified) llvm/test/CodeGen/X86/fp16-libcalls.ll (+8-4)
(modified) llvm/test/CodeGen/X86/freeze-vector.ll (+4-4)
(modified) llvm/test/CodeGen/X86/gfni-funnel-shifts.ll (+32-30)
(modified) llvm/test/CodeGen/X86/half.ll (+9-8)
(modified) llvm/test/CodeGen/X86/insert-into-constant-vector.ll (+26-26)
(modified) llvm/test/CodeGen/X86/insertps-combine.ll (+6-6)
(modified) llvm/test/CodeGen/X86/insertps-from-constantpool.ll (+4-2)
(modified) llvm/test/CodeGen/X86/insertps-unfold-load-bug.ll (+2-2)
(modified) llvm/test/CodeGen/X86/is_fpclass.ll (+6-4)
(modified) llvm/test/CodeGen/X86/isel-blendi-gettargetconstant.ll (+5-2)
(modified) llvm/test/CodeGen/X86/load-partial.ll (-4)
(modified) llvm/test/CodeGen/X86/masked_load.ll (+2-1)
(modified) llvm/test/CodeGen/X86/masked_store.ll (+7-8)
(modified) llvm/test/CodeGen/X86/mmx-arith.ll (+4-7)
(modified) llvm/test/CodeGen/X86/neg_fp.ll (+3-2)
(modified) llvm/test/CodeGen/X86/negative-sin.ll (+2-1)
(modified) llvm/test/CodeGen/X86/packus.ll (+38-22)
(modified) llvm/test/CodeGen/X86/peephole-fold-movsd.ll (+1-1)
(modified) llvm/test/CodeGen/X86/pr14161.ll (+2-1)
(modified) llvm/test/CodeGen/X86/pr30511.ll (+3-2)
(modified) llvm/test/CodeGen/X86/pr31956.ll (+5-4)
(modified) llvm/test/CodeGen/X86/pr34592.ll (+18-18)
(modified) llvm/test/CodeGen/X86/pr36553.ll (+2-1)
(modified) llvm/test/CodeGen/X86/pr40811.ll (+5-4)
(modified) llvm/test/CodeGen/X86/pr63091.ll (+4-3)
(modified) llvm/test/CodeGen/X86/sar_fold64.ll (+10-6)
(modified) llvm/test/CodeGen/X86/setcc-combine.ll (+3-3)
(modified) llvm/test/CodeGen/X86/setcc-non-simple-type.ll (+11-15)
(modified) llvm/test/CodeGen/X86/shrink_vmul.ll (+16-12)
(modified) llvm/test/CodeGen/X86/shuffle-vs-trunc-512.ll (+14-10)
(modified) llvm/test/CodeGen/X86/splat-for-size.ll (+2-2)
(modified) llvm/test/CodeGen/X86/sqrt-fastmath-tune.ll (+2-1)
(modified) llvm/test/CodeGen/X86/sqrt-fastmath-tunecpu-attr.ll (+4-2)
(modified) llvm/test/CodeGen/X86/sqrt-fastmath.ll (+6-4)
(modified) llvm/test/CodeGen/X86/srem-seteq-vec-nonsplat.ll (+76-62)
(modified) llvm/test/CodeGen/X86/sse-align-12.ll (+2-2)
(modified) llvm/test/CodeGen/X86/sse2.ll (+48-37)
(modified) llvm/test/CodeGen/X86/sse3.ll (+14-12)
(modified) llvm/test/CodeGen/X86/sse41.ll (+84-100)
(modified) llvm/test/CodeGen/X86/strict-fsub-combines.ll (+17-11)
(modified) llvm/test/CodeGen/X86/subvector-broadcast.ll (+2-2)
(modified) llvm/test/CodeGen/X86/test-shrink-bug.ll (+5-2)
(modified) llvm/test/CodeGen/X86/tuning-shuffle-unpckpd-avx512.ll (+43-146)
(modified) llvm/test/CodeGen/X86/tuning-shuffle-unpckpd.ll (+14-38)
(modified) llvm/test/CodeGen/X86/urem-seteq-vec-tautological.ll (+5-3)
(modified) llvm/test/CodeGen/X86/vec_insert-5.ll (+20-39)
(modified) llvm/test/CodeGen/X86/vec_int_to_fp.ll (+60-60)
(modified) llvm/test/CodeGen/X86/vec_shift5.ll (+10-18)
(modified) llvm/test/CodeGen/X86/vec_umulo.ll (+8-4)
(modified) llvm/test/CodeGen/X86/vector-bitreverse.ll (+3-2)
(modified) llvm/test/CodeGen/X86/vector-constrained-fp-intrinsics-flags.ll (+3-25)
(modified) llvm/test/CodeGen/X86/vector-constrained-fp-intrinsics.ll (+9-12)
(modified) llvm/test/CodeGen/X86/vector-fshl-256.ll (+24-20)
(modified) llvm/test/CodeGen/X86/vector-fshl-512.ll (+18-11)
(modified) llvm/test/CodeGen/X86/vector-fshr-256.ll (+4)
(modified) llvm/test/CodeGen/X86/vector-fshr-512.ll (+18-11)
(modified) llvm/test/CodeGen/X86/vector-idiv-sdiv-512.ll (+3-3)
(modified) llvm/test/CodeGen/X86/vector-reduce-fmax-nnan.ll (+9-19)
(modified) llvm/test/CodeGen/X86/vector-reduce-fmin.ll (+2-2)
(modified) llvm/test/CodeGen/X86/vector-rotate-128.ll (+16-26)
(modified) llvm/test/CodeGen/X86/vector-rotate-256.ll (+13-13)
(modified) llvm/test/CodeGen/X86/vector-rotate-512.ll (+26-26)
(modified) llvm/test/CodeGen/X86/vector-shift-ashr-128.ll (+33-21)
(modified) llvm/test/CodeGen/X86/vector-shift-ashr-256.ll (+20-11)
(modified) llvm/test/CodeGen/X86/vector-shift-ashr-512.ll (+2-1)
(modified) llvm/test/CodeGen/X86/vector-shift-lshr-128.ll (+25-15)
(modified) llvm/test/CodeGen/X86/vector-shift-lshr-256.ll (+20-11)
(modified) llvm/test/CodeGen/X86/vector-shift-lshr-512.ll (+2-1)
(modified) llvm/test/CodeGen/X86/vector-shift-shl-128.ll (+25-15)
(modified) llvm/test/CodeGen/X86/vector-shift-shl-256.ll (+21-12)
(modified) llvm/test/CodeGen/X86/vector-shift-shl-512.ll (+2-1)
(modified) llvm/test/CodeGen/X86/vector-shuffle-128-v2.ll (+7-23)
(modified) llvm/test/CodeGen/X86/vector-shuffle-128-v4.ll (+39-32)
(modified) llvm/test/CodeGen/X86/vector-shuffle-combining-avx2.ll (+9-10)
(modified) llvm/test/CodeGen/X86/vector-shuffle-combining-ssse3.ll (+8-3)
(modified) llvm/test/CodeGen/X86/vector-shuffle-combining.ll (+80-52)
(modified) llvm/test/CodeGen/X86/vector-shuffle-v1.ll (+3-3)
(modified) llvm/test/CodeGen/X86/vector-shuffle-v192.ll (+8-8)
(modified) llvm/test/CodeGen/X86/vector-shuffle-v48.ll (+3-3)
(modified) llvm/test/CodeGen/X86/vselect.ll (+21-9)
(modified) llvm/test/CodeGen/X86/widened-broadcast.ll (+25-37)
(modified) llvm/test/CodeGen/X86/x86-interleaved-access.ll (+19-16)
(modified) llvm/test/CodeGen/X86/xop-shifts.ll (+5-2)
(modified) llvm/test/CodeGen/X86/xor.ll (+10-4)

diff --git a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
index b1fb4947fb9451..0e6be878d38cb8 100644
--- a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
@@ -3478,6 +3478,38 @@ bool TargetLowering::SimplifyDemandedVectorElts(
 
     break;
   }
+  case ISD::LOAD: {
+    auto *Ld = cast<LoadSDNode>(Op);
+    if (!ISD::isNormalLoad(Ld) || !Ld->isSimple())
+      break;
+
+    // TODO: Handle arbitrary vector extract for isMask
+    if (DemandedElts.popcount() != 1)
+      break;
+
+    EVT VT = Ld->getValueType(0);
+    if (TLO.LegalOperations() &&
+        !isOperationLegalOrCustom(ISD::INSERT_VECTOR_ELT,
+                                  VT /*, IsAfterLegalization*/))
+      break;
+
+    EVT EltVT = VT.getVectorElementType();
+    SDLoc DL(Ld);
+
+    unsigned Idx = DemandedElts.countTrailingZeros();
+
+    SDValue IdxVal = TLO.DAG.getVectorIdxConstant(Idx, DL);
+    SDValue Scalarized =
+        scalarizeExtractedVectorLoad(EltVT, DL, VT, IdxVal, Ld, TLO.DAG);
+    if (!Scalarized)
+      break;
+
+    TLO.DAG.ReplaceAllUsesOfValueWith(SDValue(Ld, 1), Scalarized.getValue(1));
+
+    SDValue Insert = TLO.DAG.getNode(ISD::INSERT_VECTOR_ELT, DL, VT,
+                                     TLO.DAG.getUNDEF(VT), Scalarized, IdxVal);
+    return TLO.CombineTo(Op, Insert);
+  }
   case ISD::VECTOR_SHUFFLE: {
     SDValue LHS = Op.getOperand(0);
     SDValue RHS = Op.getOperand(1);
diff --git a/llvm/test/CodeGen/AArch64/arm64-big-endian-bitconverts.ll b/llvm/test/CodeGen/AArch64/arm64-big-endian-bitconverts.ll
index f5aa4c666a5681..e9a4a83a406838 100644
--- a/llvm/test/CodeGen/AArch64/arm64-big-endian-bitconverts.ll
+++ b/llvm/test/CodeGen/AArch64/arm64-big-endian-bitconverts.ll
@@ -30,7 +30,7 @@ define void @test_i64_v2f32(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2s }
 ; CHECK: rev64 v{{[0-9]+}}.2s
 ; CHECK: str
-    %1 = load <2 x float>, ptr %p
+    %1 = load volatile <2 x float>, ptr %p
     %2 = fadd <2 x float> %1, %1
     %3 = bitcast <2 x float> %2 to i64
     %4 = add i64 %3, %3
@@ -43,7 +43,7 @@ define void @test_i64_v2i32(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2s }
 ; CHECK: rev64 v{{[0-9]+}}.2s
 ; CHECK: str
-    %1 = load <2 x i32>, ptr %p
+    %1 = load volatile <2 x i32>, ptr %p
     %2 = add <2 x i32> %1, %1
     %3 = bitcast <2 x i32> %2 to i64
     %4 = add i64 %3, %3
@@ -121,7 +121,7 @@ define void @test_f64_v2f32(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2s }
 ; CHECK: rev64 v{{[0-9]+}}.2s
 ; CHECK: str
-    %1 = load <2 x float>, ptr %p
+    %1 = load volatile <2 x float>, ptr %p
     %2 = fadd <2 x float> %1, %1
     %3 = bitcast <2 x float> %2 to double
     %4 = fadd double %3, %3
@@ -134,7 +134,7 @@ define void @test_f64_v2i32(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2s }
 ; CHECK: rev64 v{{[0-9]+}}.2s
 ; CHECK: str
-    %1 = load <2 x i32>, ptr %p
+    %1 = load volatile <2 x i32>, ptr %p
     %2 = add <2 x i32> %1, %1
     %3 = bitcast <2 x i32> %2 to double
     %4 = fadd double %3, %3
@@ -213,7 +213,7 @@ define void @test_v1i64_v2f32(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2s }
 ; CHECK: rev64 v{{[0-9]+}}.2s
 ; CHECK: str
-    %1 = load <2 x float>, ptr %p
+    %1 = load volatile <2 x float>, ptr %p
     %2 = fadd <2 x float> %1, %1
     %3 = bitcast <2 x float> %2 to <1 x i64>
     %4 = add <1 x i64> %3, %3
@@ -226,7 +226,7 @@ define void @test_v1i64_v2i32(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2s }
 ; CHECK: rev64 v{{[0-9]+}}.2s
 ; CHECK: str
-    %1 = load <2 x i32>, ptr %p
+    %1 = load volatile <2 x i32>, ptr %p
     %2 = add <2 x i32> %1, %1
     %3 = bitcast <2 x i32> %2 to <1 x i64>
     %4 = add <1 x i64> %3, %3
@@ -318,7 +318,7 @@ define void @test_v2f32_v1i64(ptr %p, ptr %q) {
 define void @test_v2f32_v2i32(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2s }
 ; CHECK: st1 { v{{[0-9]+}}.2s }
-    %1 = load <2 x i32>, ptr %p
+    %1 = load volatile <2 x i32>, ptr %p
     %2 = add <2 x i32> %1, %1
     %3 = bitcast <2 x i32> %2 to <2 x float>
     %4 = fadd <2 x float> %3, %3
@@ -410,7 +410,7 @@ define void @test_v2i32_v1i64(ptr %p, ptr %q) {
 define void @test_v2i32_v2f32(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2s }
 ; CHECK: st1 { v{{[0-9]+}}.2s }
-    %1 = load <2 x float>, ptr %p
+    %1 = load volatile <2 x float>, ptr %p
     %2 = fadd <2 x float> %1, %1
     %3 = bitcast <2 x float> %2 to <2 x i32>
     %4 = add <2 x i32> %3, %3
@@ -488,7 +488,7 @@ define void @test_v4i16_v2f32(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2s }
 ; CHECK: rev32 v{{[0-9]+}}.4h
 ; CHECK: st1 { v{{[0-9]+}}.4h }
-    %1 = load <2 x float>, ptr %p
+    %1 = load volatile <2 x float>, ptr %p
     %2 = fadd <2 x float> %1, %1
     %3 = bitcast <2 x float> %2 to <4 x i16>
     %4 = add <4 x i16> %3, %3
@@ -501,7 +501,7 @@ define void @test_v4i16_v2i32(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2s }
 ; CHECK: rev32 v{{[0-9]+}}.4h
 ; CHECK: st1 { v{{[0-9]+}}.4h }
-    %1 = load <2 x i32>, ptr %p
+    %1 = load volatile <2 x i32>, ptr %p
     %2 = add <2 x i32> %1, %1
     %3 = bitcast <2 x i32> %2 to <4 x i16>
     %4 = add <4 x i16> %3, %3
@@ -587,7 +587,7 @@ define void @test_v4f16_v2f32(ptr %p, ptr %q) {
 ; CHECK: fadd
 ; CHECK-NOT: rev
 ; CHECK: st1 { v{{[0-9]+}}.4h }
-    %1 = load <2 x float>, ptr %p
+    %1 = load volatile <2 x float>, ptr %p
     %2 = fadd <2 x float> %1, %1
     %3 = bitcast <2 x float> %2 to <4 x half>
     %4 = fadd <4 x half> %3, %3
@@ -602,7 +602,7 @@ define void @test_v4f16_v2i32(ptr %p, ptr %q) {
 ; CHECK: fadd
 ; CHECK-NOT: rev
 ; CHECK: st1 { v{{[0-9]+}}.4h }
-    %1 = load <2 x i32>, ptr %p
+    %1 = load volatile <2 x i32>, ptr %p
     %2 = add <2 x i32> %1, %1
     %3 = bitcast <2 x i32> %2 to <4 x half>
     %4 = fadd <4 x half> %3, %3
@@ -682,7 +682,7 @@ define void @test_v8i8_v2f32(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2s }
 ; CHECK: rev32 v{{[0-9]+}}.8b
 ; CHECK: st1 { v{{[0-9]+}}.8b }
-    %1 = load <2 x float>, ptr %p
+    %1 = load volatile <2 x float>, ptr %p
     %2 = fadd <2 x float> %1, %1
     %3 = bitcast <2 x float> %2 to <8 x i8>
     %4 = add <8 x i8> %3, %3
@@ -695,7 +695,7 @@ define void @test_v8i8_v2i32(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2s }
 ; CHECK: rev32 v{{[0-9]+}}.8b
 ; CHECK: st1 { v{{[0-9]+}}.8b }
-    %1 = load <2 x i32>, ptr %p
+    %1 = load volatile <2 x i32>, ptr %p
     %2 = add <2 x i32> %1, %1
     %3 = bitcast <2 x i32> %2 to <8 x i8>
     %4 = add <8 x i8> %3, %3
@@ -721,7 +721,7 @@ define void @test_f128_v2f64(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2d }
 ; CHECK: ext
 ; CHECK: str
-    %1 = load <2 x double>, ptr %p
+    %1 = load volatile <2 x double>, ptr %p
     %2 = fadd <2 x double> %1, %1
     %3 = bitcast <2 x double> %2 to fp128
     %4 = fadd fp128 %3, %3
@@ -734,7 +734,7 @@ define void @test_f128_v2i64(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2d }
 ; CHECK: ext
 ; CHECK: str
-    %1 = load <2 x i64>, ptr %p
+    %1 = load volatile <2 x i64>, ptr %p
     %2 = add <2 x i64> %1, %1
     %3 = bitcast <2 x i64> %2 to fp128
     %4 = fadd fp128 %3, %3
@@ -816,7 +816,7 @@ define void @test_v2f64_f128(ptr %p, ptr %q) {
 define void @test_v2f64_v2i64(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2d }
 ; CHECK: st1 { v{{[0-9]+}}.2d }
-    %1 = load <2 x i64>, ptr %p
+    %1 = load volatile <2 x i64>, ptr %p
     %2 = add <2 x i64> %1, %1
     %3 = bitcast <2 x i64> %2 to <2 x double>
     %4 = fadd <2 x double> %3, %3
@@ -895,7 +895,7 @@ define void @test_v2i64_f128(ptr %p, ptr %q) {
 define void @test_v2i64_v2f64(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2d }
 ; CHECK: st1 { v{{[0-9]+}}.2d }
-    %1 = load <2 x double>, ptr %p
+    %1 = load volatile <2 x double>, ptr %p
     %2 = fadd <2 x double> %1, %1
     %3 = bitcast <2 x double> %2 to <2 x i64>
     %4 = add <2 x i64> %3, %3
@@ -979,7 +979,7 @@ define void @test_v4f32_v2f64(ptr %p, ptr %q) {
 ; CHECK: rev64 v{{[0-9]+}}.4s
 ; CHECK-NOT: rev
 ; CHECK: st1 { v{{[0-9]+}}.4s }
-    %1 = load <2 x double>, ptr %p
+    %1 = load volatile <2 x double>, ptr %p
     %2 = fadd <2 x double> %1, %1
     %3 = bitcast <2 x double> %2 to <4 x float>
     %4 = fadd <4 x float> %3, %3
@@ -994,7 +994,7 @@ define void @test_v4f32_v2i64(ptr %p, ptr %q) {
 ; CHECK: fadd
 ; CHECK-NOT: rev
 ; CHECK: st1 { v{{[0-9]+}}.4s }
-    %1 = load <2 x i64>, ptr %p
+    %1 = load volatile <2 x i64>, ptr %p
     %2 = add <2 x i64> %1, %1
     %3 = bitcast <2 x i64> %2 to <4 x float>
     %4 = fadd <4 x float> %3, %3
@@ -1062,7 +1062,7 @@ define void @test_v4i32_v2f64(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2d }
 ; CHECK: rev64 v{{[0-9]+}}.4s
 ; CHECK: st1 { v{{[0-9]+}}.4s }
-    %1 = load <2 x double>, ptr %p
+    %1 = load volatile <2 x double>, ptr %p
     %2 = fadd <2 x double> %1, %1
     %3 = bitcast <2 x double> %2 to <4 x i32>
     %4 = add <4 x i32> %3, %3
@@ -1075,7 +1075,7 @@ define void @test_v4i32_v2i64(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2d }
 ; CHECK: rev64 v{{[0-9]+}}.4s
 ; CHECK: st1 { v{{[0-9]+}}.4s }
-    %1 = load <2 x i64>, ptr %p
+    %1 = load volatile <2 x i64>, ptr %p
     %2 = add <2 x i64> %1, %1
     %3 = bitcast <2 x i64> %2 to <4 x i32>
     %4 = add <4 x i32> %3, %3
@@ -1141,7 +1141,7 @@ define void @test_v8i16_v2f64(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2d }
 ; CHECK: rev64 v{{[0-9]+}}.8h
 ; CHECK: st1 { v{{[0-9]+}}.8h }
-    %1 = load <2 x double>, ptr %p
+    %1 = load volatile <2 x double>, ptr %p
     %2 = fadd <2 x double> %1, %1
     %3 = bitcast <2 x double> %2 to <8 x i16>
     %4 = add <8 x i16> %3, %3
@@ -1154,7 +1154,7 @@ define void @test_v8i16_v2i64(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2d }
 ; CHECK: rev64 v{{[0-9]+}}.8h
 ; CHECK: st1 { v{{[0-9]+}}.8h }
-    %1 = load <2 x i64>, ptr %p
+    %1 = load volatile <2 x i64>, ptr %p
     %2 = add <2 x i64> %1, %1
     %3 = bitcast <2 x i64> %2 to <8 x i16>
     %4 = add <8 x i16> %3, %3
@@ -1234,7 +1234,7 @@ define void @test_v16i8_v2f64(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2d }
 ; CHECK: rev64 v{{[0-9]+}}.16b
 ; CHECK: st1 { v{{[0-9]+}}.16b }
-    %1 = load <2 x double>, ptr %p
+    %1 = load volatile <2 x double>, ptr %p
     %2 = fadd <2 x double> %1, %1
     %3 = bitcast <2 x double> %2 to <16 x i8>
     %4 = add <16 x i8> %3, %3
@@ -1247,7 +1247,7 @@ define void @test_v16i8_v2i64(ptr %p, ptr %q) {
 ; CHECK: ld1 { v{{[0-9]+}}.2d }
 ; CHECK: rev64 v{{[0-9]+}}.16b
 ; CHECK: st1 { v{{[0-9]+}}.16b }
-    %1 = load <2 x i64>, ptr %p
+    %1 = load volatile <2 x i64>, ptr %p
     %2 = add <2 x i64> %1, %1
     %3 = bitcast <2 x i64> %2 to <16 x i8>
     %4 = add <16 x i8> %3, %3
@@ -1315,7 +1315,7 @@ define %struct.struct1 @test_v4f16_struct(ptr %ret) {
 entry:
 ; CHECK: ld1 { {{v[0-9]+}}.4h }
 ; CHECK-NOT: rev
-  %0 = load <4 x half>, ptr %ret, align 2
+  %0 = load volatile <4 x half>, ptr %ret, align 2
   %1 = extractelement <4 x half> %0, i32 0
   %.fca.0.insert = insertvalue %struct.struct1 undef, half %1, 0
   ret %struct.struct1 %.fca.0.insert
diff --git a/llvm/test/CodeGen/AArch64/dag-ReplaceAllUsesOfValuesWith.ll b/llvm/test/CodeGen/AArch64/dag-ReplaceAllUsesOfValuesWith.ll
index d76e817e62a495..ce657aa1f0b5bc 100644
--- a/llvm/test/CodeGen/AArch64/dag-ReplaceAllUsesOfValuesWith.ll
+++ b/llvm/test/CodeGen/AArch64/dag-ReplaceAllUsesOfValuesWith.ll
@@ -27,10 +27,7 @@
 define i64 @g(ptr %p) {
 ; CHECK-LABEL: g:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    ldr x8, [x0, #8]
-; CHECK-NEXT:    add x9, x8, x8
-; CHECK-NEXT:    add x8, x9, x8
-; CHECK-NEXT:    sub x0, x8, x8
+; CHECK-NEXT:    mov x0, xzr
 ; CHECK-NEXT:    ret
   %vec = load <2 x i64>, ptr %p, align 1
   %elt = extractelement <2 x i64> %vec, i32 1
diff --git a/llvm/test/CodeGen/AArch64/fcmp.ll b/llvm/test/CodeGen/AArch64/fcmp.ll
index 66f26fc9d85973..d39e537edb7861 100644
--- a/llvm/test/CodeGen/AArch64/fcmp.ll
+++ b/llvm/test/CodeGen/AArch64/fcmp.ll
@@ -679,28 +679,27 @@ define <3 x double> @v3f128_double(<3 x fp128> %a, <3 x fp128> %b, <3 x double>
 ; CHECK-SD-NEXT:    .cfi_def_cfa_offset 160
 ; CHECK-SD-NEXT:    .cfi_offset w30, -16
 ; CHECK-SD-NEXT:    stp q2, q5, [sp, #112] // 32-byte Folded Spill
+; CHECK-SD-NEXT:    add x8, sp, #176
 ; CHECK-SD-NEXT:    // kill: def $d6 killed $d6 def $q6
 ; CHECK-SD-NEXT:    // kill: def $d7 killed $d7 def $q7
-; CHECK-SD-NEXT:    ldr d5, [sp, #184]
-; CHECK-SD-NEXT:    str q3, [sp, #64] // 16-byte Folded Spill
-; CHECK-SD-NEXT:    ldp d3, d2, [sp, #168]
+; CHECK-SD-NEXT:    str q3, [sp, #32] // 16-byte Folded Spill
+; CHECK-SD-NEXT:    ldp d3, d2, [sp, #160]
 ; CHECK-SD-NEXT:    mov v6.d[1], v7.d[0]
 ; CHECK-SD-NEXT:    str q0, [sp, #16] // 16-byte Folded Spill
 ; CHECK-SD-NEXT:    mov v0.16b, v1.16b
 ; CHECK-SD-NEXT:    mov v1.16b, v4.16b
-; CHECK-SD-NEXT:    str q5, [sp, #96] // 16-byte Folded Spill
-; CHECK-SD-NEXT:    ldr d5, [sp, #160]
-; CHECK-SD-NEXT:    mov v3.d[1], v2.d[0]
-; CHECK-SD-NEXT:    str q5, [sp, #80] // 16-byte Folded Spill
-; CHECK-SD-NEXT:    stp q6, q3, [sp, #32] // 32-byte Folded Spill
+; CHECK-SD-NEXT:    ld1 { v2.d }[1], [x8]
+; CHECK-SD-NEXT:    stp q6, q3, [sp, #80] // 32-byte Folded Spill
+; CHECK-SD-NEXT:    str q2, [sp, #48] // 16-byte Folded Spill
+; CHECK-SD-NEXT:    ldr d2, [sp, #184]
+; CHECK-SD-NEXT:    str q2, [sp, #64] // 16-byte Folded Spill
 ; CHECK-SD-NEXT:    bl __lttf2
 ; CHECK-SD-NEXT:    cmp w0, #0
-; CHECK-SD-NEXT:    ldr q1, [sp, #64] // 16-byte Folded Reload
 ; CHECK-SD-NEXT:    cset w8, lt
 ; CHECK-SD-NEXT:    sbfx x8, x8, #0, #1
 ; CHECK-SD-NEXT:    fmov d0, x8
 ; CHECK-SD-NEXT:    str q0, [sp] // 16-byte Folded Spill
-; CHECK-SD-NEXT:    ldr q0, [sp, #16] // 16-byte Folded Reload
+; CHECK-SD-NEXT:    ldp q0, q1, [sp, #16] // 32-byte Folded Reload
 ; CHECK-SD-NEXT:    bl __lttf2
 ; CHECK-SD-NEXT:    cmp w0, #0
 ; CHECK-SD-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
@@ -708,19 +707,19 @@ define <3 x double> @v3f128_double(<3 x fp128> %a, <3 x fp128> %b, <3 x double>
 ; CHECK-SD-NEXT:    sbfx x8, x8, #0, #1
 ; CHECK-SD-NEXT:    fmov d1, x8
 ; CHECK-SD-NEXT:    mov v1.d[1], v0.d[0]
-; CHECK-SD-NEXT:    str q1, [sp, #64] // 16-byte Folded Spill
+; CHECK-SD-NEXT:    str q1, [sp, #32] // 16-byte Folded Spill
 ; CHECK-SD-NEXT:    ldp q0, q1, [sp, #112] // 32-byte Folded Reload
 ; CHECK-SD-NEXT:    bl __lttf2
-; CHECK-SD-NEXT:    ldp q1, q0, [sp, #32] // 32-byte Folded Reload
+; CHECK-SD-NEXT:    ldp q0, q3, [sp, #80] // 32-byte Folded Reload
 ; CHECK-SD-NEXT:    cmp w0, #0
-; CHECK-SD-NEXT:    ldp q2, q4, [sp, #64] // 32-byte Folded Reload
+; CHECK-SD-NEXT:    ldp q2, q1, [sp, #32] // 32-byte Folded Reload
 ; CHECK-SD-NEXT:    cset w8, lt
 ; CHECK-SD-NEXT:    sbfx x8, x8, #0, #1
-; CHECK-SD-NEXT:    ldr q3, [sp, #96] // 16-byte Folded Reload
+; CHECK-SD-NEXT:    ldr q4, [sp, #64] // 16-byte Folded Reload
 ; CHECK-SD-NEXT:    ldr x30, [sp, #144] // 8-byte Folded Reload
-; CHECK-SD-NEXT:    bit v0.16b, v1.16b, v2.16b
+; CHECK-SD-NEXT:    bif v0.16b, v1.16b, v2.16b
 ; CHECK-SD-NEXT:    fmov d2, x8
-; CHECK-SD-NEXT:    bsl v2.16b, v4.16b, v3.16b
+; CHECK-SD-NEXT:    bsl v2.16b, v3.16b, v4.16b
 ; CHECK-SD-NEXT:    ext v1.16b, v0.16b, v0.16b, #8
 ; CHECK-SD-NEXT:    // kill: def $d0 killed $d0 killed $q0
 ; CHECK-SD-NEXT:    // kill: def $d2 killed $d2 killed $q2
@@ -815,20 +814,20 @@ define <3 x double> @v3f64_double(<3 x double> %a, <3 x double> %b, <3 x double>
 ; CHECK-SD-NEXT:    // kill: def $d1 killed $d1 def $q1
 ; CHECK-SD-NEXT:    // kill: def $d6 killed $d6 def $q6
 ; CHECK-SD-NEXT:    // kill: def $d7 killed $d7 def $q7
+; CHECK-SD-NEXT:    add x8, sp, #16
 ; CHECK-SD-NEXT:    // kill: def $d2 killed $d2 def $q2
 ; CHECK-SD-NEXT:    // kill: def $d5 killed $d5 def $q5
-; CHECK-SD-NEXT:    ldr d16, [sp, #24]
-; CHECK-SD-NEXT:    ldr d17, [sp]
 ; CHECK-SD-NEXT:    mov v3.d[1], v4.d[0]
 ; CHECK-SD-NEXT:    mov v0.d[1], v1.d[0]
 ; CHECK-SD-NEXT:    mov v6.d[1], v7.d[0]
-; CHECK-SD-NEXT:    ldp d1, d4, [sp, #8]
 ; CHECK-SD-NEXT:    fcmgt v2.2d, v5.2d, v2.2d
-; CHECK-SD-NEXT:    mov v1.d[1], v4.d[0]
 ; CHECK-SD-NEXT:    fcmgt v0.2d, v3.2d, v0.2d
-; CHECK-SD-NEXT:    bsl v2.16b, v17.16b, v16.16b
-; CHECK-SD-NEXT:    // kill: def $d2 killed $d2 killed $q2
+; CHECK-SD-NEXT:    ldp d3, d1, [sp]
+; CHECK-SD-NEXT:    ld1 { v1.d }[1], [x8]
 ; CHECK-SD-NEXT:    bsl v0.16b, v6.16b, v1.16b
+; CHECK-SD-NEXT:    ldr d1, [sp, #24]
+; CHECK-SD-NEXT:    bsl v2.16b, v3.16b, v1.16b
+; CHECK-SD-NEXT:    // kill: def $d2 killed $d2 killed $q2
 ; CHECK-SD-NEXT:    ext v1.16b, v0.16b, v0.16b, #8
 ; CHECK-SD-NEXT:    // kill: def $d0 killed $d0 killed $q0
 ; CHECK-SD-NEXT:    // kill: def $d1 killed $d1 killed $q1
diff --git a/llvm/test/CodeGen/AArch64/fmlal-loreg.ll b/llvm/test/CodeGen/AArch64/fmlal-loreg.ll
index 31ead890ba8ac7..ed22243eeef45f 100644
--- a/llvm/test/CodeGen/AArch64/fmlal-loreg.ll
+++ b/llvm/test/CodeGen/AArch64/fmlal-loreg.ll
@@ -45,11 +45,11 @@ define void @loop(ptr %out_tile, ptr %lhs_panel, ptr %rhs_panel, i32 noundef %K,
 ; CHECK-NEXT:    mov w8, w3
 ; CHECK-NEXT:  .LBB1_1: // %for.body
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    ldr q2, [x1], #2
+; CHECK-NEXT:    ldr q2, [x2], #2
 ; CHECK-NEXT:    subs x8, x8, #1
-; CHECK-NEXT:    ldr q3, [x2], #2
-; CHECK-NEXT:    fmlal v0.4s, v3.4h, v2.h[0]
-; CHECK-NEXT:    fmlal2 v1.4s, v3.4h, v2.h[0]
+; CHECK-NEXT:    ld1r { v3.8h }, [x1], #2
+; CHECK-NEXT:    fmlal v0.4s, v2.4h, v3.4h
+; CHECK-NEXT:    fmlal2 v1.4s, v2.4h, v3.4h
 ; CHECK-NEXT:    b.ne .LBB1_1
 ; CHECK-NEXT:  // %bb.2: // %for.cond.cleanup
 ; CHECK-NEXT:    stp q0, q1, [x0]
diff --git a/llvm/test/CodeGen/AArch64/icmp.ll b/llvm/test/CodeGen/AArch64/icmp.ll
index e284795760c5ca..f586647439d255 100644
--- a/llvm/test/CodeGen/AArch64/icmp.ll
+++ b/llvm/test/CodeGen/AArch64/icmp.ll
@@ -1123,30 +1123,29 @@ entry:
 define <3 x i64> @v3i64_i64(<3 x i64> %a, <3 x i64> %b, <3 x i64> %d, <3 x i64> %e) {
 ; CHECK-SD-LABEL: v3i64_i64:
 ; CHECK-SD:       // %bb.0: // %entry
-; CHECK-SD-NEXT:    // kill: def $d4 killed $d4 def $q4
 ; CHECK-SD-NEXT:    // kill: def $d3 killed $d3 def $q3
-; CHECK-SD-NEXT:    // kill: def $d1 killed $d1 def $q1
 ; CHECK-SD-NEXT:    // kill: def $d0 killed $d0 def $q0
+; CHECK-SD-NEXT:    // kill: def $d4 killed $d4 def $q4
+; CHECK-SD-NEXT:    // kill: def $d1 killed $d1 def $q1
 ; CHECK-SD-NEXT:    // kill: def $d6 killed $d6 def $q6
 ; CHECK-SD-NEXT:    // kill: def $d7 killed $d7 def $q7
+; CHECK-SD-NEXT:    add x8, sp, #16
 ; CHECK-SD-NEXT:    // kill: def $d5 killed $d5 def $q5
 ; CHECK-SD-NEXT:    // kill: def $d2 killed $d2 def $q2
-; CHECK-SD-NEXT:    ldr d16, [sp, #24]
-; CHECK-SD-NEXT:    ldr d17, [sp]
 ; CHECK-SD-NEXT:    mov v3.d[1], v4.d[0]
 ; CHECK-SD-NEXT:    mov v0.d[1], v1.d[0]
 ; CHECK-SD-NEXT:    mov v6.d[1], v7.d[0]
-; CHECK-SD-NEXT:    ldp d1, d4, [sp, #8]
-; CHECK-SD-NEXT:    mov v1.d[1], v4.d[0]
+; CHECK-SD-NEXT:    ldp d4, d1, [sp]
+; CHECK-SD-NEXT:    ld1 { v1.d }[1], [x8]
 ; CHECK-SD-NEXT:    cmgt v0.2d, v3.2d, v0.2d
 ; CHECK-SD-NEXT:    bsl v0.16b, v6.16b, v1.16b
 ; CHECK-SD-NEXT:    cmgt v1.2d, v5.2d, v2.2d
-; CHECK-SD-NEXT:    mov v2.16b, v1.16b
+; CHECK-SD-NEXT:    ldr d2, [sp, #24]
+; CHECK-SD-NEXT:    bit v2.16b, v4.16b, v1.16b
 ; CHECK-SD-NEXT:    ext v1.16b, v0.16b, v0.16b, #8
 ; CHECK-SD-NEXT:    // kill: def $d0 killed $d0 killed $q0
-; CHECK-SD-NEXT:    // kill: def $d1 killed $d1 killed $q1
-; CHECK-SD-NEXT:    bsl v2.16b, v17.16b, v16.16b
 ; CHECK-SD-NEXT:    // kill: def $d2 killed $d2 killed $q2
+; CHECK-SD-NEXT:    // kill: def $d1 killed $d1 killed $q1
 ; CHECK-SD-NEXT:    ret
 ;
 ; CHECK-GI-LABEL: v3i64_i64:
diff --git a/llvm/test/CodeGen/AArch64/sve-fixed-length-extract-vector-elt.ll b/llvm/test/CodeGen/AArch64/sve-fixed-length-extract-vector-elt.ll
index ad4efeaf39247a..1e6427c4cd4956 100644
--- a/llvm/test/CodeGen/AArch64/sve-fixed-length-extract-vector-elt.ll
+++ b/llvm/test/CodeGen/AArch64/sve-fixed-length-extract-vector-elt.ll
@@ -33,10 +33,7 @@ define half @extractelement_v8f16(<8 x half> %op1) vscale_range(2,0) #0 {
 define half @extractelement_v16f16(ptr %a) vscale_range(2,0) #0 {
 ; CHECK-LABEL: extractelement_v16f16:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    ptrue p0.h, vl16
-; CHECK-NEXT:    ld1h { z0.h }, p0/z, [x0]
-; CHECK-NEXT:    mov z0.h, z0.h[15]
-; CHECK-NEXT:    // kill: def $h0 killed $h0 killed $z0
+; CHECK-NEXT:    ldr h0, [x0, #30]
 ; CHECK-NEXT:    ret
     %op1 = load <16 x half>, ptr %a
  ...
[truncated]

RKSimon · 2025-01-13T12:42:09Z

X86TargetLowering::shouldReduceLoadWidth is a mess, resulting in a lot of duplicate aliasaed loads that make very little sense - we're seeing something similar on #122485, but it might take some time to unravel.

These tests cases weren't trying to test load+extract. I believe they only used loads because fixed vector arguments weren't supported when they were written or they weren't copied from the structure of other tests that pre-date fixed vector argument support. Reduces diff from llvm#122671.

topperc · 2025-01-13T21:17:11Z

llvm/test/CodeGen/RISCV/rvv/fixed-vectors-extract.ll

-; CHECK-NEXT:    vslidedown.vi v8, v8, 2
-; CHECK-NEXT:    vmv.x.s a0, v8
-; CHECK-NEXT:    ret
+; RV32-LABEL: extractelt_v4i32:


I think most of these tests were only using loads because fixed vector arguments werent' supported when the test was written.

topperc · 2025-01-13T21:18:35Z

llvm/test/CodeGen/RISCV/rvv/fixed-vectors-reduction-int.ll

-; CHECK-NEXT:    vle32.v v8, (a0)
-; CHECK-NEXT:    vmv.x.s a0, v8
-; CHECK-NEXT:    ret
+; RV32-LABEL: vreduce_add_v1i32:


Same story as fixed-vectors-extract.ll. This test wasn't interested in loads.

topperc · 2025-01-13T21:18:53Z

llvm/test/CodeGen/RISCV/rvv/fixed-vectors-reduction-fp.ll

-; CHECK-NEXT:    vsetivli zero, 1, e16, mf4, ta, ma
-; CHECK-NEXT:    vle16.v v8, (a0)
-; CHECK-NEXT:    vfmv.f.s fa5, v8
+; CHECK-NEXT:    flh fa5, 0(a0)


Same story as fixed-vectors-extract.ll. This test wasn't interested in loads.

…. NFC These tests weren't interested in the loas. Removing them reduces the diffs from llvm#122671.

These tests cases weren't trying to test load+extract. I believe they only used loads because fixed vector arguments weren't supported when they were written or they weren't copied from the structure of other tests that pre-date fixed vector argument support. Reduces diff from llvm#122671.

These test cases weren't trying to test load+extract. I believe they only used loads because fixed vector arguments weren't supported when they were written or they weren't copied from the structure of other tests that pre-date fixed vector argument support. Reduces diff from #122671.

…. NFC (#122808) These tests weren't interested in the loads. Removing them reduces the diffs from #122671.

RKSimon · 2025-02-04T15:08:33Z

@arsenm Please can you rebase this and then I'll see what I can do to help with the x86 regressions

RKSimon · 2025-02-12T15:55:24Z

I'm still looking at the x86 mess - but something I've hit is that the hasOneUse() checks on the shouldReduceLoadWidth callback are often getting confused by extra uses of the load nodes's chain - is there anything we can do is clean that up? (See also #126764)

arsenm · 2025-02-12T16:40:58Z

I'm still looking at the x86 mess - but something I've hit is that the hasOneUse() checks on the shouldReduceLoadWidth callback are often getting confused by extra uses of the load nodes's chain - is there anything we can do is clean that up? (See also #126764)

SDNode::hasOneUse checks are rarely the correct thing over SDValue::hasOneUse. We should probably rename the SDNode version; it's too easy to mix up N->hasOneUse vs. N.hasOneUse

…ded value, not the chain etc. The hasOneUse check was failing in any case where the load was part of a chain - we should only be checking if the loaded value has one use, and any updates to the chain should be handled by the fold calling shouldReduceLoadWidth. I've updated the x86 implementation to match, although it has no effect here yet (I'm still looking at how to improve the x86 implementation) as the inner for loop was discarding chain uses anyway. By using hasNUsesOfValue instead this patch exposes a missing dependency on the LLVMSelectionDAG library in a lot of tools + unittests, we can either update the CMakeLists.txt dependencies or make SDNode::hasNUsesOfValue inline - no string opinions on this tbh. Noticed while fighting the x86 regressions in llvm#122671

…ded value, not the chain etc. The hasOneUse check was failing in any case where the load was part of a chain - we should only be checking if the loaded value has one use, and any updates to the chain should be handled by the fold calling shouldReduceLoadWidth. I've updated the x86 implementation to match, although it has no effect here yet (I'm still looking at how to improve the x86 implementation) as the inner for loop was discarding chain uses anyway. By using SDValue::hasOneUse instead this patch exposes a missing dependency on the LLVMSelectionDAG library in a lot of tools + unittests, which resulted in having to make SDNode::hasNUsesOfValue inline. Noticed while fighting the x86 regressions in llvm#122671

…value - not the chain (#128167) The hasOneUse check was failing in any case where the load was part of a chain - we should only be checking if the loaded value has one use, and any updates to the chain should be handled by the fold calling shouldReduceLoadWidth. I've updated the x86 implementation to match, although it has no effect here yet (I'm still looking at how to improve the x86 implementation) as the inner for loop was discarding chain uses anyway. By using SDValue::hasOneUse instead this patch exposes a missing dependency on the LLVMSelectionDAG library in a lot of tools + unittests, which resulted in having to make SDNode::hasNUsesOfValue inline. Noticed while fighting the x86 regressions in #122671

…st(p0) if either load is oneuse This fold is currently limited to cases where the load_subv(p0) has oneuse, but its beneficial if either load has oneuse and will be replaced. Yes another yak shave for llvm#122671

…st(p0) if either load is oneuse (#128857) This fold is currently limited to cases where the load_subv(p0) has oneuse, but its beneficial if either load has oneuse and will be replaced. Yet another yak shave for #122671

…st(p0) if either load is oneuse (llvm#128857) This fold is currently limited to cases where the load_subv(p0) has oneuse, but its beneficial if either load has oneuse and will be replaced. Yet another yak shave for llvm#122671

This improves some AMDGPU cases and avoids future regressions. The combiner likes to form shuffles for cases where an extract_vector_elt would do perfectly well, and this recovers some of the regressions from losing load narrowing. AMDGPU, Arch64 and RISCV test changes look broadly better. Other targets have some improvements, but mostly regressions. In particular X86 looks much worse. I'm guessing this is because it's shouldReduceLoadWidth is wrong. I mostly just regenerated the checks. I assume some set of them should switch to use volatile loads to defeat the optimization.

arsenm mentioned this pull request Jan 13, 2025

DAG: Move scalarizeExtractedVectorLoad to TargetLowering #122670

Merged

arsenm added the llvm:SelectionDAG SelectionDAGISel as well label Jan 13, 2025 — with Graphite App

arsenm requested review from davemgreen, RKSimon and topperc January 13, 2025 07:10

arsenm marked this pull request as ready for review January 13, 2025 07:10

llvmbot added backend:ARM backend:AArch64 backend:AMDGPU backend:PowerPC backend:X86 backend:NVPTX backend:loongarch labels Jan 13, 2025

arsenm force-pushed the users/arsenm/dag/simplify-demanded-vector-elts-load branch from 426f5fa to 68ca84a Compare January 13, 2025 07:12

topperc mentioned this pull request Jan 13, 2025

[RISCV] Remove loads from fixed-vectors-extract.ll. NFC #122796

Merged

topperc reviewed Jan 13, 2025

View reviewed changes

topperc added a commit to topperc/llvm-project that referenced this pull request Jan 13, 2025

[RISCV] Remove loads from single element fixed vector reduction tests…

55ba834

…. NFC These tests weren't interested in the loas. Removing them reduces the diffs from llvm#122671.

topperc mentioned this pull request Jan 13, 2025

[RISCV] Remove loads from single element fixed vector reduction tests. NFC #122808

Merged

topperc added a commit that referenced this pull request Jan 14, 2025

[RISCV] Remove loads from single element fixed vector reduction tests…

d90a427

…. NFC (#122808) These tests weren't interested in the loads. Removing them reduces the diffs from #122671.

arsenm mentioned this pull request Jan 14, 2025

DAG: Avoid forming shufflevector from a single extract_vector_elt #122672

Merged

arsenm force-pushed the users/arsenm/dag-move-scalarize-extracted-vector-load-to-tli branch from 9bedb14 to 65e9c1b Compare January 29, 2025 16:33

arsenm force-pushed the users/arsenm/dag/simplify-demanded-vector-elts-load branch from 68ca84a to 8f15ec9 Compare January 29, 2025 16:33

arsenm force-pushed the users/arsenm/dag/simplify-demanded-vector-elts-load branch from 8f15ec9 to b745947 Compare February 4, 2025 09:32

arsenm force-pushed the users/arsenm/dag-move-scalarize-extracted-vector-load-to-tli branch from 65e9c1b to b7d320b Compare February 4, 2025 09:32

Base automatically changed from users/arsenm/dag-move-scalarize-extracted-vector-load-to-tli to main February 4, 2025 10:37

arsenm force-pushed the users/arsenm/dag/simplify-demanded-vector-elts-load branch from b745947 to 1cae79c Compare February 6, 2025 16:45

RKSimon mentioned this pull request Feb 13, 2025

[X86] Add missing scalar-load pattern for (V)CVTSS2SD blending with v2f64 pass through value #126955

Open

RKSimon mentioned this pull request Feb 21, 2025

[DAG] shouldReduceLoadWidth - hasOneUse should check just the loaded value - not the chain #128167

Merged

RKSimon mentioned this pull request Feb 26, 2025

[X86] Merge insertsubvector(load(p0),load_subv(p0),hi) -> subvbroadcast(p0) if either load is oneuse #128857

Merged

arsenm force-pushed the users/arsenm/dag/simplify-demanded-vector-elts-load branch from 1cae79c to f807d74 Compare May 5, 2025 16:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DAG: Handle load in SimplifyDemandedVectorElts #122671

DAG: Handle load in SimplifyDemandedVectorElts #122671

Uh oh!

arsenm commented Jan 13, 2025

Uh oh!

arsenm commented Jan 13, 2025 •

edited

Loading

Uh oh!

llvmbot commented Jan 13, 2025 •

edited

Loading

Uh oh!

RKSimon commented Jan 13, 2025

Uh oh!

topperc Jan 13, 2025

Uh oh!

topperc Jan 13, 2025

Uh oh!

topperc Jan 13, 2025

Uh oh!

RKSimon commented Feb 4, 2025

Uh oh!

RKSimon commented Feb 12, 2025

Uh oh!

arsenm commented Feb 12, 2025

Uh oh!

Uh oh!

DAG: Handle load in SimplifyDemandedVectorElts #122671

Are you sure you want to change the base?

DAG: Handle load in SimplifyDemandedVectorElts #122671

Uh oh!

Conversation

arsenm commented Jan 13, 2025

Uh oh!

arsenm commented Jan 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Jan 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RKSimon commented Jan 13, 2025

Uh oh!

topperc Jan 13, 2025

Choose a reason for hiding this comment

Uh oh!

topperc Jan 13, 2025

Choose a reason for hiding this comment

Uh oh!

topperc Jan 13, 2025

Choose a reason for hiding this comment

Uh oh!

RKSimon commented Feb 4, 2025

Uh oh!

RKSimon commented Feb 12, 2025

Uh oh!

arsenm commented Feb 12, 2025

Uh oh!

Uh oh!

arsenm commented Jan 13, 2025 •

edited

Loading

llvmbot commented Jan 13, 2025 •

edited

Loading