[TTI][WebAssembly] Pairwise reduction expansion #93948

sparker-arm · 2024-05-31T10:48:34Z

WebAssembly doesn't support horizontal operations nor does it have a way of expressing fast-math or reassoc flags, so runtimes are currently unable to use pairwise operations when generating code from the existing shuffle patterns.

This patch allows the backend to select which, arbitary, shuffle pattern to be used per reduction intrinsic. The default behaviour is the same as the existing, which is by splitting the vector into a top and bottom half. The other pattern introduced is for a pairwise shuffle.

WebAssembly enables pairwise reductions for int/fp add/sub.

llvmbot · 2024-05-31T10:49:01Z

@llvm/pr-subscribers-backend-webassembly
@llvm/pr-subscribers-llvm-analysis

@llvm/pr-subscribers-llvm-transforms

Author: Sam Parker (sparker-arm)

Changes

WebAssembly doesn't support horizontal operations nor does it have a way of expressing fast-math or reassoc flags, so runtimes are currently unable to use pairwise operations when generating code from the existing shuffle patterns.

This patch allows the backend to select which, arbitary, shuffle pattern to be used per reduction intrinsic. The default behaviour is the same as the existing, which is by splitting the vector into a top and bottom half. The other pattern introduced is for a pairwise shuffle.

WebAssembly enables pairwise reductions for int/fp add/sub.

Full diff: https://github.com/llvm/llvm-project/pull/93948.diff

9 Files Affected:

(modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+17)
(modified) llvm/include/llvm/Analysis/TargetTransformInfoImpl.h (+5)
(modified) llvm/include/llvm/Transforms/Utils/LoopUtils.h (+2)
(modified) llvm/lib/Analysis/TargetTransformInfo.cpp (+6)
(modified) llvm/lib/CodeGen/ExpandReductions.cpp (+6-4)
(modified) llvm/lib/Target/WebAssembly/WebAssemblyTargetTransformInfo.cpp (+13)
(modified) llvm/lib/Target/WebAssembly/WebAssemblyTargetTransformInfo.h (+2)
(modified) llvm/lib/Transforms/Utils/LoopUtils.cpp (+29-12)
(added) llvm/test/CodeGen/WebAssembly/vector-reduce.ll (+130)

diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index cefce93f9e25c..00658428e96d1 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -1696,6 +1696,16 @@ class TargetTransformInfo {
   /// into a shuffle sequence.
   bool shouldExpandReduction(const IntrinsicInst *II) const;
 
+  enum struct ReductionShuffle {
+    SplitHalf,
+    Pairwise
+  };
+
+  /// \returns The shuffle sequence pattern used to expand the given reduction
+  /// intrinsic.
+  ReductionShuffle getPreferredExpandedReductionShuffle(
+      const IntrinsicInst *II) const;
+
   /// \returns the size cost of rematerializing a GlobalValue address relative
   /// to a stack reload.
   unsigned getGISelRematGlobalCost() const;
@@ -2145,6 +2155,8 @@ class TargetTransformInfo::Concept {
   virtual bool preferEpilogueVectorization() const = 0;
 
   virtual bool shouldExpandReduction(const IntrinsicInst *II) const = 0;
+  virtual ReductionShuffle
+    getPreferredExpandedReductionShuffle(const IntrinsicInst *II) const = 0;
   virtual unsigned getGISelRematGlobalCost() const = 0;
   virtual unsigned getMinTripCountTailFoldingThreshold() const = 0;
   virtual bool enableScalableVectorization() const = 0;
@@ -2881,6 +2893,11 @@ class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {
     return Impl.shouldExpandReduction(II);
   }
 
+  ReductionShuffle
+  getPreferredExpandedReductionShuffle(const IntrinsicInst *II) const override {
+    return Impl.getPreferredExpandedReductionShuffle(II);
+  }
+
   unsigned getGISelRematGlobalCost() const override {
     return Impl.getGISelRematGlobalCost();
   }
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 9a57331d281db..ab83f7b88b6b1 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -927,6 +927,11 @@ class TargetTransformInfoImplBase {
 
   bool shouldExpandReduction(const IntrinsicInst *II) const { return true; }
 
+  TTI::ReductionShuffle
+  getPreferredExpandedReductionShuffle(const IntrinsicInst *II) const {
+      return TTI::ReductionShuffle::SplitHalf;
+  }
+
   unsigned getGISelRematGlobalCost() const { return 1; }
 
   unsigned getMinTripCountTailFoldingThreshold() const { return 0; }
diff --git a/llvm/include/llvm/Transforms/Utils/LoopUtils.h b/llvm/include/llvm/Transforms/Utils/LoopUtils.h
index 345e09dce0b2b..9b26ad0b2fc8c 100644
--- a/llvm/include/llvm/Transforms/Utils/LoopUtils.h
+++ b/llvm/include/llvm/Transforms/Utils/LoopUtils.h
@@ -15,6 +15,7 @@
 
 #include "llvm/Analysis/IVDescriptors.h"
 #include "llvm/Analysis/LoopAccessAnalysis.h"
+#include "llvm/Analysis/TargetTransformInfo.h"
 #include "llvm/Transforms/Utils/ValueMapper.h"
 
 namespace llvm {
@@ -384,6 +385,7 @@ Value *getOrderedReduction(IRBuilderBase &Builder, Value *Acc, Value *Src,
 /// Generates a vector reduction using shufflevectors to reduce the value.
 /// Fast-math-flags are propagated using the IRBuilder's setting.
 Value *getShuffleReduction(IRBuilderBase &Builder, Value *Src, unsigned Op,
+                           TargetTransformInfo::ReductionShuffle RS,
                            RecurKind MinMaxKind = RecurKind::None);
 
 /// Create a target reduction of the given vector. The reduction operation
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index 82b6d7e7c4833..7d37b222d9f00 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -1309,6 +1309,12 @@ bool TargetTransformInfo::shouldExpandReduction(const IntrinsicInst *II) const {
   return TTIImpl->shouldExpandReduction(II);
 }
 
+TargetTransformInfo::ReductionShuffle
+TargetTransformInfo::getPreferredExpandedReductionShuffle(
+    const IntrinsicInst *II) const {
+  return TTIImpl->getPreferredExpandedReductionShuffle(II);
+}
+
 unsigned TargetTransformInfo::getGISelRematGlobalCost() const {
   return TTIImpl->getGISelRematGlobalCost();
 }
diff --git a/llvm/lib/CodeGen/ExpandReductions.cpp b/llvm/lib/CodeGen/ExpandReductions.cpp
index 0b1504e51b1bb..cb7e73b04f812 100644
--- a/llvm/lib/CodeGen/ExpandReductions.cpp
+++ b/llvm/lib/CodeGen/ExpandReductions.cpp
@@ -59,6 +59,8 @@ bool expandReductions(Function &F, const TargetTransformInfo *TTI) {
         isa<FPMathOperator>(II) ? II->getFastMathFlags() : FastMathFlags{};
     Intrinsic::ID ID = II->getIntrinsicID();
     RecurKind RK = getMinMaxReductionRecurKind(ID);
+    TargetTransformInfo::ReductionShuffle RS =
+      TTI->getPreferredExpandedReductionShuffle(II);
 
     Value *Rdx = nullptr;
     IRBuilder<> Builder(II);
@@ -79,7 +81,7 @@ bool expandReductions(Function &F, const TargetTransformInfo *TTI) {
         if (!isPowerOf2_32(
                 cast<FixedVectorType>(Vec->getType())->getNumElements()))
           continue;
-        Rdx = getShuffleReduction(Builder, Vec, RdxOpcode, RK);
+        Rdx = getShuffleReduction(Builder, Vec, RdxOpcode, RS, RK);
         Rdx = Builder.CreateBinOp((Instruction::BinaryOps)RdxOpcode, Acc, Rdx,
                                   "bin.rdx");
       }
@@ -112,7 +114,7 @@ bool expandReductions(Function &F, const TargetTransformInfo *TTI) {
         break;
       }
       unsigned RdxOpcode = getArithmeticReductionInstruction(ID);
-      Rdx = getShuffleReduction(Builder, Vec, RdxOpcode, RK);
+      Rdx = getShuffleReduction(Builder, Vec, RdxOpcode, RS, RK);
       break;
     }
     case Intrinsic::vector_reduce_add:
@@ -127,7 +129,7 @@ bool expandReductions(Function &F, const TargetTransformInfo *TTI) {
               cast<FixedVectorType>(Vec->getType())->getNumElements()))
         continue;
       unsigned RdxOpcode = getArithmeticReductionInstruction(ID);
-      Rdx = getShuffleReduction(Builder, Vec, RdxOpcode, RK);
+      Rdx = getShuffleReduction(Builder, Vec, RdxOpcode, RS, RK);
       break;
     }
     case Intrinsic::vector_reduce_fmax:
@@ -140,7 +142,7 @@ bool expandReductions(Function &F, const TargetTransformInfo *TTI) {
           !FMF.noNaNs())
         continue;
       unsigned RdxOpcode = getArithmeticReductionInstruction(ID);
-      Rdx = getShuffleReduction(Builder, Vec, RdxOpcode, RK);
+      Rdx = getShuffleReduction(Builder, Vec, RdxOpcode, RS, RK);
       break;
     }
     }
diff --git a/llvm/lib/Target/WebAssembly/WebAssemblyTargetTransformInfo.cpp b/llvm/lib/Target/WebAssembly/WebAssemblyTargetTransformInfo.cpp
index 9a434d9b1db54..a286dfc9d4b62 100644
--- a/llvm/lib/Target/WebAssembly/WebAssemblyTargetTransformInfo.cpp
+++ b/llvm/lib/Target/WebAssembly/WebAssemblyTargetTransformInfo.cpp
@@ -94,6 +94,19 @@ WebAssemblyTTIImpl::getVectorInstrCost(unsigned Opcode, Type *Val,
   return Cost;
 }
 
+TTI::ReductionShuffle WebAssemblyTTIImpl::getPreferredExpandedReductionShuffle(
+    const IntrinsicInst *II) const {
+
+  switch (II->getIntrinsicID()) {
+  default:
+    break;
+  case Intrinsic::vector_reduce_add:
+  case Intrinsic::vector_reduce_fadd:
+    return TTI::ReductionShuffle::Pairwise;
+  }
+  return TTI::ReductionShuffle::SplitHalf;
+}
+
 bool WebAssemblyTTIImpl::areInlineCompatible(const Function *Caller,
                                              const Function *Callee) const {
   // Allow inlining only when the Callee has a subset of the Caller's
diff --git a/llvm/lib/Target/WebAssembly/WebAssemblyTargetTransformInfo.h b/llvm/lib/Target/WebAssembly/WebAssemblyTargetTransformInfo.h
index 801f905d377ed..32e9af00f35ce 100644
--- a/llvm/lib/Target/WebAssembly/WebAssemblyTargetTransformInfo.h
+++ b/llvm/lib/Target/WebAssembly/WebAssemblyTargetTransformInfo.h
@@ -70,6 +70,8 @@ class WebAssemblyTTIImpl final : public BasicTTIImplBase<WebAssemblyTTIImpl> {
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index, Value *Op0, Value *Op1);
 
+  TTI::ReductionShuffle
+  getPreferredExpandedReductionShuffle(const IntrinsicInst *II) const;
   /// @}
 
   bool areInlineCompatible(const Function *Caller,
diff --git a/llvm/lib/Transforms/Utils/LoopUtils.cpp b/llvm/lib/Transforms/Utils/LoopUtils.cpp
index cc883a7dc2927..03e687679eb91 100644
--- a/llvm/lib/Transforms/Utils/LoopUtils.cpp
+++ b/llvm/lib/Transforms/Utils/LoopUtils.cpp
@@ -1077,7 +1077,9 @@ Value *llvm::getOrderedReduction(IRBuilderBase &Builder, Value *Acc, Value *Src,
 
 // Helper to generate a log2 shuffle reduction.
 Value *llvm::getShuffleReduction(IRBuilderBase &Builder, Value *Src,
-                                 unsigned Op, RecurKind RdxKind) {
+                                 unsigned Op,
+                                 TargetTransformInfo::ReductionShuffle RS,
+                                 RecurKind RdxKind) {
   unsigned VF = cast<FixedVectorType>(Src->getType())->getNumElements();
   // VF is a power of 2 so we can emit the reduction using log2(VF) shuffles
   // and vector ops, reducing the set of values being computed by half each
@@ -1091,18 +1093,9 @@ Value *llvm::getShuffleReduction(IRBuilderBase &Builder, Value *Src,
   // will never be relevant here.  Note that it would be generally unsound to
   // propagate these from an intrinsic call to the expansion anyways as we/
   // change the order of operations.
-  Value *TmpVec = Src;
-  SmallVector<int, 32> ShuffleMask(VF);
-  for (unsigned i = VF; i != 1; i >>= 1) {
-    // Move the upper half of the vector to the lower half.
-    for (unsigned j = 0; j != i / 2; ++j)
-      ShuffleMask[j] = i / 2 + j;
-
-    // Fill the rest of the mask with undef.
-    std::fill(&ShuffleMask[i / 2], ShuffleMask.end(), -1);
-
+  auto BuildShuffledOp = [&Builder, &Op, &RdxKind](
+      SmallVectorImpl<int> &ShuffleMask, Value*& TmpVec) -> void {
     Value *Shuf = Builder.CreateShuffleVector(TmpVec, ShuffleMask, "rdx.shuf");
-
     if (Op != Instruction::ICmp && Op != Instruction::FCmp) {
       TmpVec = Builder.CreateBinOp((Instruction::BinaryOps)Op, TmpVec, Shuf,
                                    "bin.rdx");
@@ -1111,6 +1104,30 @@ Value *llvm::getShuffleReduction(IRBuilderBase &Builder, Value *Src,
              "Invalid min/max");
       TmpVec = createMinMaxOp(Builder, RdxKind, TmpVec, Shuf);
     }
+  };
+
+  Value *TmpVec = Src;
+  if (TargetTransformInfo::ReductionShuffle::Pairwise == RS) {
+    SmallVector<int, 32> ShuffleMask(VF);
+    for (unsigned stride = 1; stride < VF; stride <<= 1) {
+      // Initialise the mask with undef.
+      std::fill(ShuffleMask.begin(), ShuffleMask.end(), -1);
+      for (unsigned j = 0; j < VF; j += stride << 1) {
+        ShuffleMask[j] = j + stride;
+      }
+      BuildShuffledOp(ShuffleMask, TmpVec);
+    }
+  } else {
+    SmallVector<int, 32> ShuffleMask(VF);
+    for (unsigned i = VF; i != 1; i >>= 1) {
+      // Move the upper half of the vector to the lower half.
+      for (unsigned j = 0; j != i / 2; ++j)
+        ShuffleMask[j] = i / 2 + j;
+
+      // Fill the rest of the mask with undef.
+      std::fill(&ShuffleMask[i / 2], ShuffleMask.end(), -1);
+      BuildShuffledOp(ShuffleMask, TmpVec);
+    }
   }
   // The result is in the first element of the vector.
   return Builder.CreateExtractElement(TmpVec, Builder.getInt32(0));
diff --git a/llvm/test/CodeGen/WebAssembly/vector-reduce.ll b/llvm/test/CodeGen/WebAssembly/vector-reduce.ll
new file mode 100644
index 0000000000000..620e11cda7792
--- /dev/null
+++ b/llvm/test/CodeGen/WebAssembly/vector-reduce.ll
@@ -0,0 +1,130 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4
+; RUN: llc < %s -mtriple=wasm32 -verify-machineinstrs -disable-wasm-fallthrough-return-opt -wasm-disable-explicit-locals -wasm-keep-registers -mattr=+simd128 | FileCheck %s --check-prefix=SIMD128
+
+define i64 @pairwise_add_v2i64(<2 x i64> %arg) {
+; SIMD128-LABEL: pairwise_add_v2i64:
+; SIMD128:         .functype pairwise_add_v2i64 (v128) -> (i64)
+; SIMD128-NEXT:  # %bb.0:
+; SIMD128-NEXT:    i8x16.shuffle $push0=, $0, $0, 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 4, 5, 6, 7
+; SIMD128-NEXT:    i64x2.add $push1=, $0, $pop0
+; SIMD128-NEXT:    i64x2.extract_lane $push2=, $pop1, 0
+; SIMD128-NEXT:    return $pop2
+  %res = tail call i64 @llvm.vector.reduce.add.i64.v4i64(<2 x i64> %arg)
+  ret i64 %res
+}
+
+define i32 @pairwise_add_v4i32(<4 x i32> %arg) {
+; SIMD128-LABEL: pairwise_add_v4i32:
+; SIMD128:         .functype pairwise_add_v4i32 (v128) -> (i32)
+; SIMD128-NEXT:  # %bb.0:
+; SIMD128-NEXT:    i8x16.shuffle $push0=, $0, $0, 4, 5, 6, 7, 0, 1, 2, 3, 12, 13, 14, 15, 0, 1, 2, 3
+; SIMD128-NEXT:    i32x4.add $push5=, $0, $pop0
+; SIMD128-NEXT:    local.tee $push4=, $0=, $pop5
+; SIMD128-NEXT:    i8x16.shuffle $push1=, $0, $0, 8, 9, 10, 11, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3
+; SIMD128-NEXT:    i32x4.add $push2=, $pop4, $pop1
+; SIMD128-NEXT:    i32x4.extract_lane $push3=, $pop2, 0
+; SIMD128-NEXT:    return $pop3
+  %res = tail call i32 @llvm.vector.reduce.add.i32.v4f32(<4 x i32> %arg)
+  ret i32 %res
+}
+
+define i16 @pairwise_add_v8i16(<8 x i16> %arg) {
+; SIMD128-LABEL: pairwise_add_v8i16:
+; SIMD128:         .functype pairwise_add_v8i16 (v128) -> (i32)
+; SIMD128-NEXT:  # %bb.0:
+; SIMD128-NEXT:    i8x16.shuffle $push0=, $0, $0, 2, 3, 0, 1, 6, 7, 0, 1, 10, 11, 0, 1, 14, 15, 0, 1
+; SIMD128-NEXT:    i16x8.add $push8=, $0, $pop0
+; SIMD128-NEXT:    local.tee $push7=, $0=, $pop8
+; SIMD128-NEXT:    i8x16.shuffle $push1=, $0, $0, 4, 5, 0, 1, 0, 1, 0, 1, 12, 13, 0, 1, 0, 1, 0, 1
+; SIMD128-NEXT:    i16x8.add $push6=, $pop7, $pop1
+; SIMD128-NEXT:    local.tee $push5=, $0=, $pop6
+; SIMD128-NEXT:    i8x16.shuffle $push2=, $0, $0, 8, 9, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
+; SIMD128-NEXT:    i16x8.add $push3=, $pop5, $pop2
+; SIMD128-NEXT:    i16x8.extract_lane_u $push4=, $pop3, 0
+; SIMD128-NEXT:    return $pop4
+  %res = tail call i16 @llvm.vector.reduce.add.i16.v8i16(<8 x i16> %arg)
+  ret i16 %res
+}
+
+define i8 @pairwise_add_v16i8(<16 x i8> %arg) {
+; SIMD128-LABEL: pairwise_add_v16i8:
+; SIMD128:         .functype pairwise_add_v16i8 (v128) -> (i32)
+; SIMD128-NEXT:  # %bb.0:
+; SIMD128-NEXT:    i8x16.shuffle $push0=, $0, $0, 1, 0, 3, 0, 5, 0, 7, 0, 9, 0, 11, 0, 13, 0, 15, 0
+; SIMD128-NEXT:    i8x16.add $push11=, $0, $pop0
+; SIMD128-NEXT:    local.tee $push10=, $0=, $pop11
+; SIMD128-NEXT:    i8x16.shuffle $push1=, $0, $0, 2, 0, 0, 0, 6, 0, 0, 0, 10, 0, 0, 0, 14, 0, 0, 0
+; SIMD128-NEXT:    i8x16.add $push9=, $pop10, $pop1
+; SIMD128-NEXT:    local.tee $push8=, $0=, $pop9
+; SIMD128-NEXT:    i8x16.shuffle $push2=, $0, $0, 4, 0, 0, 0, 0, 0, 0, 0, 12, 0, 0, 0, 0, 0, 0, 0
+; SIMD128-NEXT:    i8x16.add $push7=, $pop8, $pop2
+; SIMD128-NEXT:    local.tee $push6=, $0=, $pop7
+; SIMD128-NEXT:    i8x16.shuffle $push3=, $0, $0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+; SIMD128-NEXT:    i8x16.add $push4=, $pop6, $pop3
+; SIMD128-NEXT:    i8x16.extract_lane_u $push5=, $pop4, 0
+; SIMD128-NEXT:    return $pop5
+  %res = tail call i8 @llvm.vector.reduce.add.i8.v16i8(<16 x i8> %arg)
+  ret i8 %res
+}
+
+define double @pairwise_add_v2f64(<2 x double> %arg) {
+; SIMD128-LABEL: pairwise_add_v2f64:
+; SIMD128:         .functype pairwise_add_v2f64 (v128) -> (f64)
+; SIMD128-NEXT:  # %bb.0:
+; SIMD128-NEXT:    f64x2.extract_lane $push1=, $0, 0
+; SIMD128-NEXT:    f64x2.extract_lane $push0=, $0, 1
+; SIMD128-NEXT:    f64.add $push2=, $pop1, $pop0
+; SIMD128-NEXT:    return $pop2
+  %res = tail call double @llvm.vector.reduce.fadd.f64.v2f64(double -0.0, <2 x double> %arg)
+  ret double%res
+}
+
+define double @pairwise_add_v2f64_fast(<2 x double> %arg) {
+; SIMD128-LABEL: pairwise_add_v2f64_fast:
+; SIMD128:         .functype pairwise_add_v2f64_fast (v128) -> (f64)
+; SIMD128-NEXT:  # %bb.0:
+; SIMD128-NEXT:    i8x16.shuffle $push0=, $0, $0, 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 4, 5, 6, 7
+; SIMD128-NEXT:    f64x2.add $push1=, $0, $pop0
+; SIMD128-NEXT:    f64x2.extract_lane $push2=, $pop1, 0
+; SIMD128-NEXT:    return $pop2
+  %res = tail call fast double @llvm.vector.reduce.fadd.f64.v2f64(double -0.0, <2 x double> %arg)
+  ret double%res
+}
+
+define float @pairwise_add_v4f32(<4 x float> %arg) {
+; SIMD128-LABEL: pairwise_add_v4f32:
+; SIMD128:         .functype pairwise_add_v4f32 (v128) -> (f32)
+; SIMD128-NEXT:  # %bb.0:
+; SIMD128-NEXT:    f32x4.extract_lane $push1=, $0, 0
+; SIMD128-NEXT:    f32x4.extract_lane $push0=, $0, 1
+; SIMD128-NEXT:    f32.add $push2=, $pop1, $pop0
+; SIMD128-NEXT:    f32x4.extract_lane $push3=, $0, 2
+; SIMD128-NEXT:    f32.add $push4=, $pop2, $pop3
+; SIMD128-NEXT:    f32x4.extract_lane $push5=, $0, 3
+; SIMD128-NEXT:    f32.add $push6=, $pop4, $pop5
+; SIMD128-NEXT:    return $pop6
+  %res = tail call float @llvm.vector.reduce.fadd.f32.v4f32(float -0.0, <4 x float> %arg)
+  ret float %res
+}
+
+define float @pairwise_add_v4f32_fast(<4 x float> %arg) {
+; SIMD128-LABEL: pairwise_add_v4f32_fast:
+; SIMD128:         .functype pairwise_add_v4f32_fast (v128) -> (f32)
+; SIMD128-NEXT:  # %bb.0:
+; SIMD128-NEXT:    i8x16.shuffle $push0=, $0, $0, 4, 5, 6, 7, 0, 1, 2, 3, 12, 13, 14, 15, 0, 1, 2, 3
+; SIMD128-NEXT:    f32x4.add $push5=, $0, $pop0
+; SIMD128-NEXT:    local.tee $push4=, $0=, $pop5
+; SIMD128-NEXT:    i8x16.shuffle $push1=, $0, $0, 8, 9, 10, 11, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3
+; SIMD128-NEXT:    f32x4.add $push2=, $pop4, $pop1
+; SIMD128-NEXT:    f32x4.extract_lane $push3=, $pop2, 0
+; SIMD128-NEXT:    return $pop3
+  %res = tail call fast float @llvm.vector.reduce.fadd.f32.v4f32(float -0.0, <4 x float> %arg)
+  ret float %res
+}
+
+declare i64 @llvm.vector.reduce.add.i64.v4i64(<2 x i64>)
+declare i32 @llvm.vector.reduce.add.i32.v4i32(<4 x i32>)
+declare i16 @llvm.vector.reduce.add.i16.v8i16(<8 x i16>)
+declare i8 @llvm.vector.reduce.add.i8.v16i8(<16 x i8>)
+declare double @llvm.vector.reduce.fadd.f64.v2f64(double, <2 x double>)
+declare float @llvm.vector.reduce.fadd.f32.v4f32(float, <4 x float>)

github-actions · 2024-05-31T10:51:35Z

✅ With the latest revision this PR passed the C/C++ code formatter.

ppenzin · 2024-06-04T17:12:05Z

Is there a runtime that is expected to consume this already and how are you planning to disseminate this knowledge? Would it be a better approach to introduce a standard way of doing this?

sparker-arm · 2024-06-05T08:58:47Z

Is there a runtime that is expected to consume this already and how are you planning to disseminate this knowledge?

I'm currently working on a patch for AArch64 in V8, I don't think there's been any similar work for other targets there. I guess there could be a risk of breaking some matching for cranelift's iadd_pairwise operation, but I haven't looked into it... I will ping those folks.

Would it be a better approach to introduce a standard way of doing this?

There's a wasm spec post-SIMD-MVP request for FP pairwise adds, which is what I would like.

RKSimon

Do we need to check for ordered float reductions?

sparker-arm · 2024-06-26T09:59:01Z

Do we need to check for ordered float reductions?

If I understand the question correctly, we already do. ExpandReductions will only generate a shuffle for operations that allow reassociation and this is when we can choose our arbitrary shuffle pattern(s).

sparker-arm · 2024-07-12T08:00:00Z

I've changed the scope so that we're now only affecting FADD. This requires that code generators need to implement more pattern matchers, but it also means that, if they don't, the shuffles produced for i8x16 and i16x8 aren't awful.

For the FP workloads that I've been looking at, this change can be worth 3-7% performance uplift.

tlively · 2024-07-15T16:52:35Z

Is the idea here that the Wasm engine can recognize the special shuffle pattern and generate different code for it? If so, is that code the engine is expected to generate correct with respect to the Wasm spec, or is this a method of providing unsafe optimization hints to the engine?

sparker-arm · 2024-07-16T09:29:21Z

is that code the engine is expected to generate correct with respect to the Wasm spec, or is this a method of providing unsafe optimization hints to the engine?

This allows the engine to generate more optimised code while respecting the Wasm spec. The key here is that when compiling with Ofast, LLVM can choose an arbitrary shuffle pattern but once this is lowered to Wasm, it is no longer arbitrary - the engine has to perform the adds in the order specified. By changing the shuffle pattern here, it allows a Wasm engine to select pairwise add vector instructions, such as faddp on AArch64 or haddps on X86.

tlively

Thanks, this LGTM.

WebAssembly doesn't support horizontal operations nor does it have a way of expressing fast-math or reassoc flags, so runtimes are currently unable to use pairwise operations when generating code from the existing shuffle patterns. This patch allows the backend to select which, arbitary, shuffle pattern to be used per reduction intrinsic. The default behaviour is the same as the existing, which is by splitting the vector into a top and bottom half. The other pattern introduced is for a pairwise shuffle. WebAssembly enables pairwise reductions for fp add.

llvm-ci · 2024-07-17T08:47:09Z

LLVM Buildbot has detected a new failure on builder arc-builder running on arc-worker while building llvm at step 6 "test-build-unified-tree-check-all".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/3/builds/1593

Here is the relevant piece of the build log for the reference:

Step 6 (test-build-unified-tree-check-all) failure: 1200 seconds without output running [b'ninja', b'check-all'], attempting to kill
0.836 [82/18/1] Linking CXX executable tools/clang/unittests/Basic/BasicTests
1.381 [81/18/2] Linking CXX executable tools/clang/unittests/Format/FormatTests
5.957 [80/18/3] Linking CXX executable tools/clang/unittests/Lex/LexTests
6.399 [79/18/4] Linking CXX executable tools/clang/unittests/libclang/libclangTests
6.499 [78/18/5] Linking CXX executable tools/clang/unittests/Rewrite/RewriteTests
7.045 [77/18/6] Linking CXX executable tools/clang/unittests/CrossTU/CrossTUTests
command timed out: 1200 seconds without output running [b'ninja', b'check-all'], attempting to kill
process killed by signal 9
program finished with exit code -1
elapsedTime=1207.740788

llvm-ci · 2024-07-17T09:41:40Z

LLVM Buildbot has detected a new failure on builder sanitizer-x86_64-linux running on sanitizer-buildbot1 while building llvm at step 2 "annotate".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/66/builds/1725

Here is the relevant piece of the build log for the reference:

Step 2 (annotate) failure: 'python ../sanitizer_buildbot/sanitizers/zorg/buildbot/builders/sanitizers/buildbot_selector.py' (failure)
...
llvm-lit: /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/fuzzer/lit.cfg.py:60: note: linux feature available
llvm-lit: /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/fuzzer/lit.cfg.py:36: note: lsan feature available
llvm-lit: /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/fuzzer/lit.cfg.py:45: note: msan feature unavailable
llvm-lit: /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/fuzzer/lit.cfg.py:60: note: linux feature available
llvm-lit: /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/fuzzer/lit.cfg.py:36: note: lsan feature available
llvm-lit: /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/fuzzer/lit.cfg.py:48: note: msan feature available
llvm-lit: /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/fuzzer/lit.cfg.py:60: note: linux feature available
llvm-lit: /b/sanitizer-x86_64-linux/build/llvm-project/llvm/utils/lit/lit/main.py:72: note: The test suite configuration requested an individual test timeout of 0 seconds but a timeout of 900 seconds was requested on the command line. Forcing timeout to be 900 seconds.
-- Testing: 10012 tests, 88 workers --
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 60..
FAIL: SanitizerCommon-lsan-i386-Linux :: Linux/soft_rss_limit_mb_test.cpp (6995 of 10012)
******************** TEST 'SanitizerCommon-lsan-i386-Linux :: Linux/soft_rss_limit_mb_test.cpp' FAILED ********************
Exit Code: 1

Command Output (stderr):
--
RUN: at line 2: /b/sanitizer-x86_64-linux/build/build_default/./bin/clang  --driver-mode=g++ -gline-tables-only -fsanitize=leak  -m32 -funwind-tables  -I/b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test -ldl -O2 /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/sanitizer_common/TestCases/Linux/soft_rss_limit_mb_test.cpp -o /b/sanitizer-x86_64-linux/build/build_default/runtimes/runtimes-bins/compiler-rt/test/sanitizer_common/lsan-i386-Linux/Linux/Output/soft_rss_limit_mb_test.cpp.tmp
+ /b/sanitizer-x86_64-linux/build/build_default/./bin/clang --driver-mode=g++ -gline-tables-only -fsanitize=leak -m32 -funwind-tables -I/b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test -ldl -O2 /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/sanitizer_common/TestCases/Linux/soft_rss_limit_mb_test.cpp -o /b/sanitizer-x86_64-linux/build/build_default/runtimes/runtimes-bins/compiler-rt/test/sanitizer_common/lsan-i386-Linux/Linux/Output/soft_rss_limit_mb_test.cpp.tmp
RUN: at line 5: env LSAN_OPTIONS=soft_rss_limit_mb=220:quarantine_size=1:allocator_may_return_null=1      /b/sanitizer-x86_64-linux/build/build_default/runtimes/runtimes-bins/compiler-rt/test/sanitizer_common/lsan-i386-Linux/Linux/Output/soft_rss_limit_mb_test.cpp.tmp 2>&1 | FileCheck /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/sanitizer_common/TestCases/Linux/soft_rss_limit_mb_test.cpp -check-prefix=CHECK_MAY_RETURN_1
+ env LSAN_OPTIONS=soft_rss_limit_mb=220:quarantine_size=1:allocator_may_return_null=1 /b/sanitizer-x86_64-linux/build/build_default/runtimes/runtimes-bins/compiler-rt/test/sanitizer_common/lsan-i386-Linux/Linux/Output/soft_rss_limit_mb_test.cpp.tmp
+ FileCheck /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/sanitizer_common/TestCases/Linux/soft_rss_limit_mb_test.cpp -check-prefix=CHECK_MAY_RETURN_1
/b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/sanitizer_common/TestCases/Linux/soft_rss_limit_mb_test.cpp:68:24: error: CHECK_MAY_RETURN_1: expected string not found in input
// CHECK_MAY_RETURN_1: allocating 512 times
                       ^
<stdin>:52:44: note: scanning from here
Some of the malloc calls returned non-null: 256
                                           ^
<stdin>:52:45: note: possible intended match here
Some of the malloc calls returned non-null: 256
                                            ^

Input file: <stdin>
Check file: /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/sanitizer_common/TestCases/Linux/soft_rss_limit_mb_test.cpp

-dump-input=help explains the following input dump.

Input was:
<<<<<<
            .
            .
            .
           47:  [256] 
           48:  [320] 
           49:  [384] 
           50:  [448] 
           51: Some of the malloc calls returned null: 256 
           52: Some of the malloc calls returned non-null: 256 
check:68'0                                                X~~~~ error: no match found
check:68'1                                                 ?    possible intended match
Step 15 (test compiler-rt default) failure: test compiler-rt default (failure)
...
llvm-lit: /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/fuzzer/lit.cfg.py:60: note: linux feature available
llvm-lit: /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/fuzzer/lit.cfg.py:36: note: lsan feature available
llvm-lit: /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/fuzzer/lit.cfg.py:45: note: msan feature unavailable
llvm-lit: /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/fuzzer/lit.cfg.py:60: note: linux feature available
llvm-lit: /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/fuzzer/lit.cfg.py:36: note: lsan feature available
llvm-lit: /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/fuzzer/lit.cfg.py:48: note: msan feature available
llvm-lit: /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/fuzzer/lit.cfg.py:60: note: linux feature available
llvm-lit: /b/sanitizer-x86_64-linux/build/llvm-project/llvm/utils/lit/lit/main.py:72: note: The test suite configuration requested an individual test timeout of 0 seconds but a timeout of 900 seconds was requested on the command line. Forcing timeout to be 900 seconds.
-- Testing: 10012 tests, 88 workers --
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 60..
FAIL: SanitizerCommon-lsan-i386-Linux :: Linux/soft_rss_limit_mb_test.cpp (6995 of 10012)
******************** TEST 'SanitizerCommon-lsan-i386-Linux :: Linux/soft_rss_limit_mb_test.cpp' FAILED ********************
Exit Code: 1

Command Output (stderr):
--
RUN: at line 2: /b/sanitizer-x86_64-linux/build/build_default/./bin/clang  --driver-mode=g++ -gline-tables-only -fsanitize=leak  -m32 -funwind-tables  -I/b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test -ldl -O2 /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/sanitizer_common/TestCases/Linux/soft_rss_limit_mb_test.cpp -o /b/sanitizer-x86_64-linux/build/build_default/runtimes/runtimes-bins/compiler-rt/test/sanitizer_common/lsan-i386-Linux/Linux/Output/soft_rss_limit_mb_test.cpp.tmp
+ /b/sanitizer-x86_64-linux/build/build_default/./bin/clang --driver-mode=g++ -gline-tables-only -fsanitize=leak -m32 -funwind-tables -I/b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test -ldl -O2 /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/sanitizer_common/TestCases/Linux/soft_rss_limit_mb_test.cpp -o /b/sanitizer-x86_64-linux/build/build_default/runtimes/runtimes-bins/compiler-rt/test/sanitizer_common/lsan-i386-Linux/Linux/Output/soft_rss_limit_mb_test.cpp.tmp
RUN: at line 5: env LSAN_OPTIONS=soft_rss_limit_mb=220:quarantine_size=1:allocator_may_return_null=1      /b/sanitizer-x86_64-linux/build/build_default/runtimes/runtimes-bins/compiler-rt/test/sanitizer_common/lsan-i386-Linux/Linux/Output/soft_rss_limit_mb_test.cpp.tmp 2>&1 | FileCheck /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/sanitizer_common/TestCases/Linux/soft_rss_limit_mb_test.cpp -check-prefix=CHECK_MAY_RETURN_1
+ env LSAN_OPTIONS=soft_rss_limit_mb=220:quarantine_size=1:allocator_may_return_null=1 /b/sanitizer-x86_64-linux/build/build_default/runtimes/runtimes-bins/compiler-rt/test/sanitizer_common/lsan-i386-Linux/Linux/Output/soft_rss_limit_mb_test.cpp.tmp
+ FileCheck /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/sanitizer_common/TestCases/Linux/soft_rss_limit_mb_test.cpp -check-prefix=CHECK_MAY_RETURN_1
/b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/sanitizer_common/TestCases/Linux/soft_rss_limit_mb_test.cpp:68:24: error: CHECK_MAY_RETURN_1: expected string not found in input
// CHECK_MAY_RETURN_1: allocating 512 times
                       ^
<stdin>:52:44: note: scanning from here
Some of the malloc calls returned non-null: 256
                                           ^
<stdin>:52:45: note: possible intended match here
Some of the malloc calls returned non-null: 256
                                            ^

Input file: <stdin>
Check file: /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/sanitizer_common/TestCases/Linux/soft_rss_limit_mb_test.cpp

-dump-input=help explains the following input dump.

Input was:
<<<<<<
            .
            .
            .
           47:  [256] 
           48:  [320] 
           49:  [384] 
           50:  [448] 
           51: Some of the malloc calls returned null: 256 
           52: Some of the malloc calls returned non-null: 256 
check:68'0                                                X~~~~ error: no match found
check:68'1                                                 ?    possible intended match

ppenzin · 2024-07-19T19:26:22Z

I was hoping we can discuss this in Wasm SIMD subgroup before merging.

@yolanda15 @jing-bao you might want to check if this change would have any effect on x86 patterns in V8.

Summary: WebAssembly doesn't support horizontal operations nor does it have a way of expressing fast-math or reassoc flags, so runtimes are currently unable to use pairwise operations when generating code from the existing shuffle patterns. This patch allows the backend to select which, arbitary, shuffle pattern to be used per reduction intrinsic. The default behaviour is the same as the existing, which is by splitting the vector into a top and bottom half. The other pattern introduced is for a pairwise shuffle. WebAssembly enables pairwise reductions for int/fp add/sub. Test Plan: Reviewers: Subscribers: Tasks: Tags: Differential Revision: https://phabricator.intern.facebook.com/D60250991

llvmbot added backend:WebAssembly llvm:analysis llvm:transforms labels May 31, 2024

sparker-arm requested review from RKSimon and aheejin May 31, 2024 10:49

sparker-arm force-pushed the pairwise-reduction-expansion branch from 9d7ee1d to bb62ec2 Compare May 31, 2024 10:56

This was referenced Jun 4, 2024

SIMD meeting on 2024-06-21 WebAssembly/flexible-vectors#64

Closed

SIMD horizontal adds WebAssembly/flexible-vectors#65

Open

RKSimon reviewed Jun 10, 2024

View reviewed changes

sparker-arm force-pushed the pairwise-reduction-expansion branch from bb62ec2 to bdfcb50 Compare June 26, 2024 09:54

sparker-arm force-pushed the pairwise-reduction-expansion branch from bdfcb50 to ddef2e5 Compare July 12, 2024 07:47

sparker-arm requested review from sbc100 and tlively July 12, 2024 07:50

tlively approved these changes Jul 16, 2024

View reviewed changes

sparker-arm force-pushed the pairwise-reduction-expansion branch from ddef2e5 to 9732777 Compare July 17, 2024 08:00

sparker-arm merged commit d28ed29 into llvm:main Jul 17, 2024
3 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TTI][WebAssembly] Pairwise reduction expansion #93948

[TTI][WebAssembly] Pairwise reduction expansion #93948

sparker-arm commented May 31, 2024

llvmbot commented May 31, 2024 •

edited

Loading

github-actions bot commented May 31, 2024 •

edited

Loading

ppenzin commented Jun 4, 2024

sparker-arm commented Jun 5, 2024

RKSimon left a comment

sparker-arm commented Jun 26, 2024

sparker-arm commented Jul 12, 2024

tlively commented Jul 15, 2024

sparker-arm commented Jul 16, 2024

tlively left a comment

llvm-ci commented Jul 17, 2024

llvm-ci commented Jul 17, 2024

ppenzin commented Jul 19, 2024

[TTI][WebAssembly] Pairwise reduction expansion #93948

[TTI][WebAssembly] Pairwise reduction expansion #93948

Conversation

sparker-arm commented May 31, 2024

llvmbot commented May 31, 2024 • edited Loading

github-actions bot commented May 31, 2024 • edited Loading

ppenzin commented Jun 4, 2024

sparker-arm commented Jun 5, 2024

RKSimon left a comment

Choose a reason for hiding this comment

sparker-arm commented Jun 26, 2024

sparker-arm commented Jul 12, 2024

tlively commented Jul 15, 2024

sparker-arm commented Jul 16, 2024

tlively left a comment

Choose a reason for hiding this comment

llvm-ci commented Jul 17, 2024

llvm-ci commented Jul 17, 2024

ppenzin commented Jul 19, 2024

llvmbot commented May 31, 2024 •

edited

Loading

github-actions bot commented May 31, 2024 •

edited

Loading