Optimize count leading ones if promoted type #99591

v01dXYZ · 2024-07-19T00:24:58Z

First suggestions to solve #96455.

tested with:

scalar X86
AArch64
RISCV (and VP_)

For SelectionDAG with AMDGPU, I tried to replace the custom legaliser for CTLZ (which is necessary to make convert it to i32 instead of i16 if no custom legaliser), but I didn't manage to get it work when there is a ANYEXTEND user. That's why the commit is reverted. GlobalISel works fine though.

llvmbot · 2024-07-23T11:24:28Z

@llvm/pr-subscribers-llvm-transforms
@llvm/pr-subscribers-backend-loongarch
@llvm/pr-subscribers-llvm-selectiondag

@llvm/pr-subscribers-llvm-globalisel

Author: None (v01dXYZ)

Changes

First suggestions to solve #96455.

Only tested with scalar X86 and AArch64, RISCV (and VP_) is not tested yet.

For SelectionDAG with AMDGPU, I tried to replace the custom legaliser for CTLZ (which is necessary to make convert it to i32 instead of i16 if no custom legaliser), but I didn't manage to get it work when there is a ANYEXTEND user. That's why the commit is reverted. GlobalISel works fine though.

Full diff: https://github.com/llvm/llvm-project/pull/99591.diff

8 Files Affected:

(modified) llvm/lib/CodeGen/CodeGenPrepare.cpp (+22-1)
(modified) llvm/lib/CodeGen/GlobalISel/LegalizerHelper.cpp (+36)
(modified) llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp (+45)
(modified) llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp (+49)
(added) llvm/test/CodeGen/AArch64/ctlo.ll (+100)
(modified) llvm/test/CodeGen/LoongArch/ctlz-cttz-ctpop.ll (+8-16)
(modified) llvm/test/CodeGen/RISCV/rvv/ctlz-vp.ll (+52)
(modified) llvm/test/CodeGen/X86/ctlo.ll (+11-15)

diff --git a/llvm/lib/CodeGen/CodeGenPrepare.cpp b/llvm/lib/CodeGen/CodeGenPrepare.cpp
index 22d0708f54786..37ba7d7b0b23e 100644
--- a/llvm/lib/CodeGen/CodeGenPrepare.cpp
+++ b/llvm/lib/CodeGen/CodeGenPrepare.cpp
@@ -2298,8 +2298,29 @@ static bool despeculateCountZeros(IntrinsicInst *CountZeros,
   if (match(CountZeros->getOperand(1), m_One()))
     return false;
 
-  // If it's cheap to speculate, there's nothing to do.
   Type *Ty = CountZeros->getType();
+  EVT VTy = TLI->getValueType(*DL, Ty);
+
+  // do not despeculate if we have (ctlz (xor op -1)) if the operand is
+  // promoted as legalisation would later transform to:
+  //
+  // (ctlz (lshift (xor (extend op) -1)
+  //               lshiftamount))
+  //
+  // Despeculation is not only useless but also not wanted with SelectionDAG
+  // as XOR and CTLZ would be in different basic blocks.
+  ConstantInt *C;
+  Value *Op0;
+  Value *Op1;
+  if ((TLI->getTypeAction(CountZeros->getContext(), VTy) ==
+           TargetLowering::TypePromoteInteger ||
+       TLI->getOperationAction(ISD::CTLZ, VTy) == TargetLowering::Promote) &&
+      match(CountZeros->getOperand(0), m_Xor(m_Value(Op0), m_Value(Op1))) &&
+      ((C = dyn_cast<ConstantInt>(Op0)) || (C = dyn_cast<ConstantInt>(Op1))) &&
+      C->isMinusOne())
+    return false;
+
+  // If it's cheap to speculate, there's nothing to do.
   auto IntrinsicID = CountZeros->getIntrinsicID();
   if ((IntrinsicID == Intrinsic::cttz && TLI->isCheapToSpeculateCttz(Ty)) ||
       (IntrinsicID == Intrinsic::ctlz && TLI->isCheapToSpeculateCtlz(Ty)))
diff --git a/llvm/lib/CodeGen/GlobalISel/LegalizerHelper.cpp b/llvm/lib/CodeGen/GlobalISel/LegalizerHelper.cpp
index 3f1094e0ac703..a334b14e0cc8b 100644
--- a/llvm/lib/CodeGen/GlobalISel/LegalizerHelper.cpp
+++ b/llvm/lib/CodeGen/GlobalISel/LegalizerHelper.cpp
@@ -2356,6 +2356,35 @@ LegalizerHelper::widenScalarMulo(MachineInstr &MI, unsigned TypeIdx,
   return Legalized;
 }
 
+static bool extendCtlzNot(const MachineInstr &MI, MachineIRBuilder &MIRBuilder,
+                          MachineRegisterInfo &MRI, LLT WideTy) {
+  Register XorSrc;
+  Register CstReg;
+  if (!mi_match(MI.getOperand(1).getReg(), MRI,
+                m_GXor(m_Reg(XorSrc), m_Reg(CstReg))))
+    return false;
+
+  auto OptCst = getIConstantVRegValWithLookThrough(CstReg, MRI);
+  APInt Cst = OptCst->Value;
+
+  if (!Cst.isAllOnes())
+    return false;
+
+  auto AllOnes = MIRBuilder.buildConstant(
+      WideTy, APInt::getAllOnes(WideTy.getSizeInBits()));
+  auto Res = MIRBuilder.buildAnyExt(WideTy, XorSrc);
+
+  Register SrcReg = MI.getOperand(1).getReg();
+  LLT CurTy = MRI.getType(SrcReg);
+  unsigned SizeDiff = WideTy.getSizeInBits() - CurTy.getSizeInBits();
+  Res = MIRBuilder.buildShl(WideTy, Res,
+                            MIRBuilder.buildConstant(WideTy, SizeDiff));
+  Res = MIRBuilder.buildXor(WideTy, Res, AllOnes);
+  Res = MIRBuilder.buildCTLZ_ZERO_UNDEF(MI.getOperand(0), Res);
+
+  return true;
+}
+
 LegalizerHelper::LegalizeResult
 LegalizerHelper::widenScalar(MachineInstr &MI, unsigned TypeIdx, LLT WideTy) {
   switch (MI.getOpcode()) {
@@ -2449,6 +2478,13 @@ LegalizerHelper::widenScalar(MachineInstr &MI, unsigned TypeIdx, LLT WideTy) {
     auto MIBSrc = MIRBuilder.buildInstr(ExtOpc, {WideTy}, {SrcReg});
     LLT CurTy = MRI.getType(SrcReg);
     unsigned NewOpc = MI.getOpcode();
+
+    if ((MI.getOpcode() == TargetOpcode::G_CTLZ ||
+         MI.getOpcode() == TargetOpcode::G_CTLZ_ZERO_UNDEF) &&
+        extendCtlzNot(MI, MIRBuilder, MRI, WideTy)) {
+      MI.eraseFromParent();
+      return Legalized;
+    }
     if (NewOpc == TargetOpcode::G_CTTZ) {
       // The count is the same in the larger type except if the original
       // value was zero.  This can be handled by setting the bit just off
diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp b/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
index d6a0dd9ae9b20..f062ad2543dd9 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
@@ -5049,6 +5049,40 @@ static MVT getPromotedVectorElementType(const TargetLowering &TLI,
   return MidVT;
 }
 
+// (CTLZ (XOR Op -1)) --> (TRUNCATE (CTLZ_ZERO_UNDEF
+//                                    (XOR (SHIFT (ANYEXTEND Op1)
+//                                                ShiftAmount)
+//                                         -1)))
+static bool ExtendCtlzNot(SDNode *Node, SDValue &Result, SDLoc &dl, MVT OVT,
+                          MVT NVT, SelectionDAG &DAG) {
+  SDValue NotOp = Node->getOperand(0);
+  if (NotOp.getOpcode() != ISD::XOR)
+    return false;
+
+  SDValue SrcOp = NotOp->getOperand(0);
+  SDValue CstOp = NotOp->getOperand(1);
+
+  ConstantSDNode *Cst = dyn_cast<ConstantSDNode>(CstOp);
+
+  if (!Cst || !Cst->isAllOnes())
+    return false;
+
+  auto ExtSrc = DAG.getNode(ISD::ANY_EXTEND, dl, NVT, SrcOp);
+  unsigned SHLAmount = NVT.getScalarSizeInBits() - OVT.getScalarSizeInBits();
+  auto ShiftConst =
+      DAG.getShiftAmountConstant(SHLAmount, ExtSrc.getValueType(), dl);
+  SDValue NSrcOp = DAG.getNode(ISD::SHL, dl, NVT, ExtSrc, ShiftConst);
+
+  SDValue NCstOp =
+      DAG.getConstant(APInt::getAllOnes(NVT.getScalarSizeInBits()), dl, NVT);
+
+  Result = DAG.getNode(NotOp->getOpcode(), dl, NVT, NSrcOp, NCstOp,
+                       NotOp->getFlags());
+  Result = DAG.getNode(ISD::CTLZ_ZERO_UNDEF, dl, NVT, Result);
+  Result = DAG.getNode(ISD::TRUNCATE, dl, OVT, Result);
+  return true;
+}
+
 void SelectionDAGLegalize::PromoteNode(SDNode *Node) {
   LLVM_DEBUG(dbgs() << "Trying to promote node\n");
   SmallVector<SDValue, 8> Results;
@@ -5084,6 +5118,13 @@ void SelectionDAGLegalize::PromoteNode(SDNode *Node) {
   case ISD::CTTZ_ZERO_UNDEF:
   case ISD::CTLZ:
   case ISD::CTPOP: {
+    // If the operand of CTLZ is NOT, push the extend in the NOT.
+    if (Node->getOpcode() == ISD::CTLZ &&
+        ExtendCtlzNot(Node, Tmp1, dl, OVT, NVT, DAG)) {
+      Results.push_back(Tmp1);
+      break;
+    }
+
     // Zero extend the argument unless its cttz, then use any_extend.
     if (Node->getOpcode() == ISD::CTTZ ||
         Node->getOpcode() == ISD::CTTZ_ZERO_UNDEF)
@@ -5115,6 +5156,10 @@ void SelectionDAGLegalize::PromoteNode(SDNode *Node) {
     break;
   }
   case ISD::CTLZ_ZERO_UNDEF: {
+    if (ExtendCtlzNot(Node, Tmp1, dl, OVT, NVT, DAG)) {
+      Results.push_back(Tmp1);
+      break;
+    }
     // We know that the argument is unlikely to be zero, hence we can take a
     // different approach as compared to ISD::CTLZ
 
diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp b/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
index fed5ebcc3c903..e55fed038f576 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
@@ -638,6 +638,47 @@ SDValue DAGTypeLegalizer::PromoteIntRes_Constant(SDNode *N) {
   return Result;
 }
 
+// (CTLZ (XOR Op -1)) --> (CTLZ_ZERO_UNDEF (XOR (SHIFT (ANYEXTEND Op1)
+//                                                     ShiftAmount)
+//                                               -1))
+static bool ExtendCtlzNot(SDNode *Node, SDValue &Result, SDLoc &dl, EVT OVT,
+                          EVT NVT, SelectionDAG &DAG) {
+  SDValue NotOp = Node->getOperand(0);
+  if (NotOp.getOpcode() != ISD::XOR)
+    return false;
+
+  SDValue SrcOp = NotOp->getOperand(0);
+  SDValue CstOp = NotOp->getOperand(1);
+
+  if (!isAllOnesOrAllOnesSplat(CstOp))
+    return false;
+
+  auto ExtSrc = DAG.getNode(ISD::ANY_EXTEND, dl, NVT, SrcOp);
+  unsigned SHLAmount = NVT.getScalarSizeInBits() - OVT.getScalarSizeInBits();
+  auto ShiftConst =
+      DAG.getShiftAmountConstant(SHLAmount, ExtSrc.getValueType(), dl);
+
+  SDValue NCstOp =
+      DAG.getConstant(APInt::getAllOnes(NVT.getScalarSizeInBits()), dl, NVT);
+  if (!Node->isVPOpcode()) {
+    SDValue NSrcOp = DAG.getNode(ISD::SHL, dl, NVT, ExtSrc, ShiftConst);
+
+    Result = DAG.getNode(ISD::XOR, dl, NVT, NSrcOp, NCstOp);
+    Result = DAG.getNode(ISD::CTLZ_ZERO_UNDEF, dl, NVT, Result);
+  } else {
+    SDValue Mask = Node->getOperand(1);
+    SDValue EVL = Node->getOperand(2);
+
+    SDValue NSrcOp =
+        DAG.getNode(ISD::VP_SHL, dl, NVT, ExtSrc, ShiftConst, Mask, EVL);
+
+    Result = DAG.getNode(ISD::VP_XOR, dl, NVT, NSrcOp, NCstOp, Mask, EVL);
+    Result = DAG.getNode(ISD::VP_CTLZ_ZERO_UNDEF, dl, NVT, Result, Mask, EVL);
+  }
+
+  return true;
+}
+
 SDValue DAGTypeLegalizer::PromoteIntRes_CTLZ(SDNode *N) {
   EVT OVT = N->getValueType(0);
   EVT NVT = TLI.getTypeToTransformTo(*DAG.getContext(), OVT);
@@ -656,6 +697,14 @@ SDValue DAGTypeLegalizer::PromoteIntRes_CTLZ(SDNode *N) {
   }
 
   unsigned CtlzOpcode = N->getOpcode();
+  // If the operand of CTLZ is NOT, push the extend in the NOT.
+  if (SDValue Res;
+      (CtlzOpcode == ISD::CTLZ || CtlzOpcode == ISD::CTLZ_ZERO_UNDEF ||
+       CtlzOpcode == ISD::VP_CTLZ || CtlzOpcode == ISD::VP_CTLZ_ZERO_UNDEF) &&
+      ExtendCtlzNot(N, Res, dl, OVT, NVT, DAG)) {
+    return Res;
+  }
+
   if (CtlzOpcode == ISD::CTLZ || CtlzOpcode == ISD::VP_CTLZ) {
     // Subtract off the extra leading bits in the bigger type.
     SDValue ExtractLeadingBits = DAG.getConstant(
diff --git a/llvm/test/CodeGen/AArch64/ctlo.ll b/llvm/test/CodeGen/AArch64/ctlo.ll
new file mode 100644
index 0000000000000..5f15f540f458d
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/ctlo.ll
@@ -0,0 +1,100 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc < %s --mtriple=aarch64 -verify-machineinstrs | FileCheck %s
+; RUN: llc < %s --mtriple=aarch64 -global-isel -verify-machineinstrs | FileCheck %s
+
+declare i8 @llvm.ctlz.i8(i8, i1)
+declare i16 @llvm.ctlz.i16(i16, i1)
+declare i32 @llvm.ctlz.i32(i32, i1)
+declare i64 @llvm.ctlz.i64(i64, i1)
+
+define i8 @ctlo_i8(i8 %x) {
+; CHECK-LABEL: ctlo_i8:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov w8, #-1 // =0xffffffff
+; CHECK-NEXT:    eor w8, w8, w0, lsl #24
+; CHECK-NEXT:    clz w0, w8
+; CHECK-NEXT:    ret
+  %tmp1 = xor i8 %x, -1
+  %tmp2 = call i8 @llvm.ctlz.i8( i8 %tmp1, i1 false )
+  ret i8 %tmp2
+}
+
+define i8 @ctlo_i8_undef(i8 %x) {
+; CHECK-LABEL: ctlo_i8_undef:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov w8, #-1 // =0xffffffff
+; CHECK-NEXT:    eor w8, w8, w0, lsl #24
+; CHECK-NEXT:    clz w0, w8
+; CHECK-NEXT:    ret
+  %tmp1 = xor i8 %x, -1
+  %tmp2 = call i8 @llvm.ctlz.i8( i8 %tmp1, i1 true )
+  ret i8 %tmp2
+}
+
+define i16 @ctlo_i16(i16 %x) {
+; CHECK-LABEL: ctlo_i16:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov w8, #-1 // =0xffffffff
+; CHECK-NEXT:    eor w8, w8, w0, lsl #16
+; CHECK-NEXT:    clz w0, w8
+; CHECK-NEXT:    ret
+  %tmp1 = xor i16 %x, -1
+  %tmp2 = call i16 @llvm.ctlz.i16( i16 %tmp1, i1 false )
+  ret i16 %tmp2
+}
+
+define i16 @ctlo_i16_undef(i16 %x) {
+; CHECK-LABEL: ctlo_i16_undef:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov w8, #-1 // =0xffffffff
+; CHECK-NEXT:    eor w8, w8, w0, lsl #16
+; CHECK-NEXT:    clz w0, w8
+; CHECK-NEXT:    ret
+  %tmp1 = xor i16 %x, -1
+  %tmp2 = call i16 @llvm.ctlz.i16( i16 %tmp1, i1 true )
+  ret i16 %tmp2
+}
+
+define i32 @ctlo_i32(i32 %x) {
+; CHECK-LABEL: ctlo_i32:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mvn w8, w0
+; CHECK-NEXT:    clz w0, w8
+; CHECK-NEXT:    ret
+  %tmp1 = xor i32 %x, -1
+  %tmp2 = call i32 @llvm.ctlz.i32( i32 %tmp1, i1 false )
+  ret i32 %tmp2
+}
+
+define i32 @ctlo_i32_undef(i32 %x) {
+; CHECK-LABEL: ctlo_i32_undef:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mvn w8, w0
+; CHECK-NEXT:    clz w0, w8
+; CHECK-NEXT:    ret
+  %tmp1 = xor i32 %x, -1
+  %tmp2 = call i32 @llvm.ctlz.i32( i32 %tmp1, i1 true )
+  ret i32 %tmp2
+}
+
+define i64 @ctlo_i64(i64 %x) {
+; CHECK-LABEL: ctlo_i64:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mvn x8, x0
+; CHECK-NEXT:    clz x0, x8
+; CHECK-NEXT:    ret
+  %tmp1 = xor i64 %x, -1
+  %tmp2 = call i64 @llvm.ctlz.i64( i64 %tmp1, i1 false )
+  ret i64 %tmp2
+}
+
+define i64 @ctlo_i64_undef(i64 %x) {
+; CHECK-LABEL: ctlo_i64_undef:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mvn x8, x0
+; CHECK-NEXT:    clz x0, x8
+; CHECK-NEXT:    ret
+  %tmp1 = xor i64 %x, -1
+  %tmp2 = call i64 @llvm.ctlz.i64( i64 %tmp1, i1 true )
+  ret i64 %tmp2
+}
diff --git a/llvm/test/CodeGen/LoongArch/ctlz-cttz-ctpop.ll b/llvm/test/CodeGen/LoongArch/ctlz-cttz-ctpop.ll
index f17cec231f323..e993ecfcdf3b8 100644
--- a/llvm/test/CodeGen/LoongArch/ctlz-cttz-ctpop.ll
+++ b/llvm/test/CodeGen/LoongArch/ctlz-cttz-ctpop.ll
@@ -89,18 +89,14 @@ define i64 @test_ctlz_i64(i64 %a) nounwind {
 define i8 @test_not_ctlz_i8(i8 %a) nounwind {
 ; LA32-LABEL: test_not_ctlz_i8:
 ; LA32:       # %bb.0:
-; LA32-NEXT:    ori $a1, $zero, 255
-; LA32-NEXT:    andn $a0, $a1, $a0
-; LA32-NEXT:    clz.w $a0, $a0
-; LA32-NEXT:    addi.w $a0, $a0, -24
+; LA32-NEXT:    slli.w $a0, $a0, 24
+; LA32-NEXT:    clo.w $a0, $a0
 ; LA32-NEXT:    ret
 ;
 ; LA64-LABEL: test_not_ctlz_i8:
 ; LA64:       # %bb.0:
-; LA64-NEXT:    ori $a1, $zero, 255
-; LA64-NEXT:    andn $a0, $a1, $a0
-; LA64-NEXT:    clz.d $a0, $a0
-; LA64-NEXT:    addi.d $a0, $a0, -56
+; LA64-NEXT:    slli.d $a0, $a0, 56
+; LA64-NEXT:    clo.d $a0, $a0
 ; LA64-NEXT:    ret
   %neg = xor i8 %a, -1
   %tmp = call i8 @llvm.ctlz.i8(i8 %neg, i1 false)
@@ -110,18 +106,14 @@ define i8 @test_not_ctlz_i8(i8 %a) nounwind {
 define i16 @test_not_ctlz_i16(i16 %a) nounwind {
 ; LA32-LABEL: test_not_ctlz_i16:
 ; LA32:       # %bb.0:
-; LA32-NEXT:    nor $a0, $a0, $zero
-; LA32-NEXT:    bstrpick.w $a0, $a0, 15, 0
-; LA32-NEXT:    clz.w $a0, $a0
-; LA32-NEXT:    addi.w $a0, $a0, -16
+; LA32-NEXT:    slli.w $a0, $a0, 16
+; LA32-NEXT:    clo.w $a0, $a0
 ; LA32-NEXT:    ret
 ;
 ; LA64-LABEL: test_not_ctlz_i16:
 ; LA64:       # %bb.0:
-; LA64-NEXT:    nor $a0, $a0, $zero
-; LA64-NEXT:    bstrpick.d $a0, $a0, 15, 0
-; LA64-NEXT:    clz.d $a0, $a0
-; LA64-NEXT:    addi.d $a0, $a0, -48
+; LA64-NEXT:    slli.d $a0, $a0, 48
+; LA64-NEXT:    clo.d $a0, $a0
 ; LA64-NEXT:    ret
   %neg = xor i16 %a, -1
   %tmp = call i16 @llvm.ctlz.i16(i16 %neg, i1 false)
diff --git a/llvm/test/CodeGen/RISCV/rvv/ctlz-vp.ll b/llvm/test/CodeGen/RISCV/rvv/ctlz-vp.ll
index 58882525e55c4..6f89489bb39d6 100644
--- a/llvm/test/CodeGen/RISCV/rvv/ctlz-vp.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/ctlz-vp.ll
@@ -2624,6 +2624,58 @@ define <vscale x 1 x i9> @vp_ctlz_zero_undef_nxv1i9(<vscale x 1 x i9> %va, <vsca
   %v = call <vscale x 1 x i9> @llvm.vp.ctlz.nxv1i9(<vscale x 1 x i9> %va, i1 true, <vscale x 1 x i1> %m, i32 %evl)
   ret <vscale x 1 x i9> %v
 }
+define <vscale x 1 x i9> @vp_ctlo_nxv1i9(<vscale x 1 x i9> %va, <vscale x 1 x i1> %m, i32 zeroext %evl) {
+; CHECK-LABEL: vp_ctlo_nxv1i9:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vsetvli zero, a0, e16, mf4, ta, ma
+; CHECK-NEXT:    vsll.vi v8, v8, 7, v0.t
+; CHECK-NEXT:    vnot.v v8, v8, v0.t
+; CHECK-NEXT:    vfwcvt.f.xu.v v9, v8, v0.t
+; CHECK-NEXT:    vsetvli zero, zero, e32, mf2, ta, ma
+; CHECK-NEXT:    vsrl.vi v8, v9, 23, v0.t
+; CHECK-NEXT:    vsetvli zero, zero, e16, mf4, ta, ma
+; CHECK-NEXT:    vnsrl.wi v8, v8, 0, v0.t
+; CHECK-NEXT:    li a0, 142
+; CHECK-NEXT:    vrsub.vx v8, v8, a0, v0.t
+; CHECK-NEXT:    ret
+;
+; CHECK-ZVBB-LABEL: vp_ctlo_nxv1i9:
+; CHECK-ZVBB:       # %bb.0:
+; CHECK-ZVBB-NEXT:    vsetvli zero, a0, e16, mf4, ta, ma
+; CHECK-ZVBB-NEXT:    vsll.vi v8, v8, 7, v0.t
+; CHECK-ZVBB-NEXT:    vnot.v v8, v8, v0.t
+; CHECK-ZVBB-NEXT:    vclz.v v8, v8, v0.t
+; CHECK-ZVBB-NEXT:    ret
+  %va.not = xor <vscale x 1 x i9> %va, splat (i9 -1)
+  %v = call <vscale x 1 x i9> @llvm.vp.ctlz.nxv1i9(<vscale x 1 x i9> %va.not, i1 false, <vscale x 1 x i1> %m, i32 %evl)
+  ret <vscale x 1 x i9> %v
+}
+define <vscale x 1 x i9> @vp_ctlo_zero_undef_nxv1i9(<vscale x 1 x i9> %va, <vscale x 1 x i1> %m, i32 zeroext %evl) {
+; CHECK-LABEL: vp_ctlo_zero_undef_nxv1i9:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vsetvli zero, a0, e16, mf4, ta, ma
+; CHECK-NEXT:    vsll.vi v8, v8, 7, v0.t
+; CHECK-NEXT:    vnot.v v8, v8, v0.t
+; CHECK-NEXT:    vfwcvt.f.xu.v v9, v8, v0.t
+; CHECK-NEXT:    vsetvli zero, zero, e32, mf2, ta, ma
+; CHECK-NEXT:    vsrl.vi v8, v9, 23, v0.t
+; CHECK-NEXT:    vsetvli zero, zero, e16, mf4, ta, ma
+; CHECK-NEXT:    vnsrl.wi v8, v8, 0, v0.t
+; CHECK-NEXT:    li a0, 142
+; CHECK-NEXT:    vrsub.vx v8, v8, a0, v0.t
+; CHECK-NEXT:    ret
+;
+; CHECK-ZVBB-LABEL: vp_ctlo_zero_undef_nxv1i9:
+; CHECK-ZVBB:       # %bb.0:
+; CHECK-ZVBB-NEXT:    vsetvli zero, a0, e16, mf4, ta, ma
+; CHECK-ZVBB-NEXT:    vsll.vi v8, v8, 7, v0.t
+; CHECK-ZVBB-NEXT:    vnot.v v8, v8, v0.t
+; CHECK-ZVBB-NEXT:    vclz.v v8, v8, v0.t
+; CHECK-ZVBB-NEXT:    ret
+  %va.not = xor <vscale x 1 x i9> %va, splat (i9 -1)
+  %v = call <vscale x 1 x i9> @llvm.vp.ctlz.nxv1i9(<vscale x 1 x i9> %va.not, i1 true, <vscale x 1 x i1> %m, i32 %evl)
+  ret <vscale x 1 x i9> %v
+}
 ;; NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
 ; RV32: {{.*}}
 ; RV64: {{.*}}
diff --git a/llvm/test/CodeGen/X86/ctlo.ll b/llvm/test/CodeGen/X86/ctlo.ll
index 7431f94f0fdf2..020d6d1b80136 100644
--- a/llvm/test/CodeGen/X86/ctlo.ll
+++ b/llvm/test/CodeGen/X86/ctlo.ll
@@ -46,20 +46,18 @@ define i8 @ctlo_i8(i8 %x) {
 ;
 ; X86-CLZ-LABEL: ctlo_i8:
 ; X86-CLZ:       # %bb.0:
-; X86-CLZ-NEXT:    movzbl {{[0-9]+}}(%esp), %eax
-; X86-CLZ-NEXT:    notb %al
-; X86-CLZ-NEXT:    movzbl %al, %eax
+; X86-CLZ-NEXT:    movl {{[0-9]+}}(%esp), %eax
+; X86-CLZ-NEXT:    shll $24, %eax
+; X86-CLZ-NEXT:    notl %eax
 ; X86-CLZ-NEXT:    lzcntl %eax, %eax
-; X86-CLZ-NEXT:    addl $-24, %eax
 ; X86-CLZ-NEXT:    # kill: def $al killed $al killed $eax
 ; X86-CLZ-NEXT:    retl
 ;
 ; X64-CLZ-LABEL: ctlo_i8:
 ; X64-CLZ:       # %bb.0:
-; X64-CLZ-NEXT:    notb %dil
-; X64-CLZ-NEXT:    movzbl %dil, %eax
-; X64-CLZ-NEXT:    lzcntl %eax, %eax
-; X64-CLZ-NEXT:    addl $-24, %eax
+; X64-CLZ-NEXT:    shll $24, %edi
+; X64-CLZ-NEXT:    notl %edi
+; X64-CLZ-NEXT:    lzcntl %edi, %eax
 ; X64-CLZ-NEXT:    # kill: def $al killed $al killed $eax
 ; X64-CLZ-NEXT:    retq
   %tmp1 = xor i8 %x, -1
@@ -89,20 +87,18 @@ define i8 @ctlo_i8_undef(i8 %x) {
 ;
 ; X86-CLZ-LABEL: ctlo_i8_undef:
 ; X86-CLZ:       # %bb.0:
-; X86-CLZ-NEXT:    movzbl {{[0-9]+}}(%esp), %eax
-; X86-CLZ-NEXT:    notb %al
-; X86-CLZ-NEXT:    movzbl %al, %eax
+; X86-CLZ-NEXT:    movl {{[0-9]+}}(%esp), %eax
 ; X86-CLZ-NEXT:    shll $24, %eax
+; X86-CLZ-NEXT:    notl %eax
 ; X86-CLZ-NEXT:    lzcntl %eax, %eax
 ; X86-CLZ-NEXT:    # kill: def $al killed $al killed $eax
 ; X86-CLZ-NEXT:    retl
 ;
 ; X64-CLZ-LABEL: ctlo_i8_undef:
 ; X64-CLZ:       # %bb.0:
-; X64-CLZ-NEXT:    notb %dil
-; X64-CLZ-NEXT:    movzbl %dil, %eax
-; X64-CLZ-NEXT:    shll $24, %eax
-; X64-CLZ-NEXT:    lzcntl %eax, %eax
+; X64-CLZ-NEXT:    shll $24, %edi
+; X64-CLZ-NEXT:    notl %edi
+; X64-CLZ-NEXT:    lzcntl %edi, %eax
 ; X64-CLZ-NEXT:    # kill: def $al killed $al killed $eax
 ; X64-CLZ-NEXT:    retq
   %tmp1 = xor i8 %x, -1

llvm/lib/CodeGen/CodeGenPrepare.cpp

llvm/lib/CodeGen/GlobalISel/LegalizerHelper.cpp

llvm/lib/CodeGen/CodeGenPrepare.cpp

dtcxzyw · 2024-07-23T11:45:32Z

It is a bit weird that this patch contains both CGP and SDAG changes. Can we split it into multiple patches?

llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp

llvm/lib/CodeGen/CodeGenPrepare.cpp

v01dXYZ · 2024-07-23T14:34:37Z

Thanks for your review.

Done:

CGP and SD into two separate commits.
only uses the m/mi/sd_match API. Prefer to use getNOT/buildNoT instead of manually creating AllOnes and XOR (this one is not available for VP though in SD).

The remaining task is the test for CGP (especially for vector types).

v01dXYZ · 2024-07-24T12:38:45Z

~~I've got some problems when rebasing my work (Some unrelated tests are failing). That's why I won't fix today the conflict.~~ Fixed (AMDGPU broke its cmake conf e6c20e1).

…omotion For count leading/trailing ones, ie (CTLZ/CTTZ (XOR Op -1)), legalisation should be able to optimise this case when a promotion is necessary. Despeculation should not be applied in this case as it will separate XOR and CTLZ/CTTZ in two different basic blocks. This is particularly problematic with SelectionDAG.

v01dXYZ · 2024-07-25T11:29:19Z

I added some tests for CGP for two different cases:

X86: DAG legalisation is the one doing the promotion
RISCV: type legalisation is the one doing the promotion.

There is still a remaining issue with X86 when LZCNT is not available as DAG legalisation is custom for i8. I tried to mark addOperationPromotedToType(CTLZ, i8, i32) but the code is still not optimal with a spurious XOR 31.

Note also that there is no optimisation for counting trailing ones here, only for counting leading ones, although CGP takes the two cases into account in this PR. Tell me if you think it is relevant to add it.

arsenm · 2024-07-25T16:14:36Z

llvm/lib/CodeGen/CodeGenPrepare.cpp

+  // Despeculation is not only useless but also not wanted with SelectionDAG
+  // as XOR and CTLZ/CTTZ would be in different basic blocks.
+  EVT VTy = TLI->getValueType(*DL, Ty);
+  int ISDOpcode = IntrinsicID == Intrinsic::ctlz ? ISD::CTLZ : ISD::CTTZ;


Do you need to account for the zero_or_undef operand?

No because of the check above early exiting if Operand[1] ~= One.

llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp

(CTLZ (XOR Op -1)) --> (CTLZ_ZERO_UNDEF (XOR (SHIFT (ANYEXTEND Op) ShiftAmount) -1)) The optimisation also applies for CTLZ_ZERO_UNDEF, VP_CTLZ, VP_CTLZ_ZERO_UNDEF. Fixes llvm#96455

v01dXYZ · 2024-07-30T11:04:45Z

Ping

topperc · 2024-08-01T23:50:06Z

llvm/lib/CodeGen/CodeGenPrepare.cpp

+  // Despeculation is not only useless but also not wanted with SelectionDAG
+  // as XOR and CTLZ/CTTZ would be in different basic blocks.
+  EVT VTy = TLI->getValueType(*DL, Ty);
+  int ISDOpcode = IntrinsicID == Intrinsic::ctlz ? ISD::CTLZ : ISD::CTTZ;


Why int instead of unsigned?

In this file, we have in other parts int ISDOpcode = ..., that's why I use int. But it seems there is no usage of setting a variable with an ISD::<Variant> with an int in the code base.

topperc · 2024-08-01T23:55:02Z

llvm/test/CodeGen/RISCV/rvv/ctlz-vp.ll

+; CHECK-ZVBB-NEXT:    vnot.v v8, v8, v0.t
+; CHECK-ZVBB-NEXT:    vclz.v v8, v8, v0.t
+; CHECK-ZVBB-NEXT:    ret
+  %va.not = xor <vscale x 1 x i9> %va, splat (i9 -1)


It would be very unusual to see a vp.ctlz and a regular xor. It be more likely to be a vp.xor and a vp.ctlz.

I added support for vp.xor when it has the same mask and vector length as vp.ctlz. ~~But I let a bug get through (not taking into account the possible case of CTLZ with VP_XOR.)~~ (fixed)

Maybe you want me to not try to match VP_CTLZ + XOR and only keep VP_CTLZ + VP_XOR ?

Maybe you want me to not try to match VP_CTLZ + XOR and only keep VP_CTLZ + VP_XOR ?

Yeah not much point in matching a mixing of VP and non-VP. It won't scale well if we try to do that with every optimization.

topperc · 2024-08-07T20:04:38Z

llvm/test/CodeGen/AArch64/ctlo.ll

+define i8 @ctlo_i8_undef(i8 %x) {
+; CHECK-LABEL: ctlo_i8_undef:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov w8, #-1 // =0xffffffff


This looks worse than the current codegen. Or at least not an obvious improvement.

ctlo_i8_undef: // @ctlo_i8_undef .cfi_startproc // %bb.0: mvn w8, w0 lsl w8, w8, #24 clz w0, w8 ret

topperc · 2024-08-07T20:19:02Z

Have you investigated doing this as a DAGCombine after type legalization? Its very unusual to change type legalization behavior based on surrounding instructions.

v01dXYZ · 2024-08-08T03:17:06Z

@topperc Currently, the type legaliser not only extends the operands but needs to add a substract node too. Using a DAGCombiner would mean detect this pattern and reverse it instead of trying to prevent to create it (as it is in this PR). I understand your point though as this PR adds another exception to Type Legalisation. I'm not experienced enough to have a worthy opinion about which one is better.

I'll try to implement it with a DAGCombiner in another PR.

Do not optimise for CTLZ_ZERO_UNDEF

Do not optimise for VP_CTLZ + XOR. That way, the non-VP and VP cases are more clearly separated which is easier to follow.

@topperc

… promotion (#102877) This PR is related to #99591. In this PR, instead of modifying how the legalisation occurs depending on surrounding instructions, we refine after legalisation. This PR has two parts: * `SDPatternMatch/MatchContext`: Modify a little bit the code to match Operands (used by `m_Node(...)`) and Unary/Binary/Ternary Patterns to make it compatible with `VPMatchContext`, instead of only `m_Opc` supported. Some tests were added to ensure no regressions. * `DAGCombiner`: Add a `foldSubCtlzNot` which detect and rewrite the patterns using matching context. Remaining Tasks: - [ ] GlobalISel - [ ] Currently the pattern matching will occur even before legalisation. Should I restrict it to specific stages instead ? - [ ] Style: Add a visitVP_SUB ?? Move `foldSubCtlzNot` in another location for style consistency purpose ? @topperc --------- Co-authored-by: v01dxyz <[email protected]>

v01dXYZ force-pushed the 96455-opt-count-leading-ones-if-promoted-type branch from 14b59c7 to 83b525e Compare July 23, 2024 11:23

v01dXYZ marked this pull request as ready for review July 23, 2024 11:23

llvmbot added backend:AArch64 backend:X86 llvm:globalisel backend:loongarch llvm:SelectionDAG SelectionDAGISel as well labels Jul 23, 2024

dtcxzyw requested review from RKSimon, arsenm, jayfoad and topperc July 23, 2024 11:44

arsenm reviewed Jul 23, 2024

View reviewed changes

llvm/lib/CodeGen/CodeGenPrepare.cpp Outdated Show resolved Hide resolved

llvm/lib/CodeGen/GlobalISel/LegalizerHelper.cpp Outdated Show resolved Hide resolved

llvm/lib/CodeGen/CodeGenPrepare.cpp Outdated Show resolved Hide resolved

RKSimon reviewed Jul 23, 2024

View reviewed changes

llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp Outdated Show resolved Hide resolved

v01dXYZ force-pushed the 96455-opt-count-leading-ones-if-promoted-type branch from 83b525e to f6f2161 Compare July 23, 2024 14:24

arsenm reviewed Jul 23, 2024

View reviewed changes

llvm/lib/CodeGen/CodeGenPrepare.cpp Outdated Show resolved Hide resolved

v01dXYZ force-pushed the 96455-opt-count-leading-ones-if-promoted-type branch 2 times, most recently from 6b7bb51 to b6e18db Compare July 24, 2024 10:15

v01dXYZ force-pushed the 96455-opt-count-leading-ones-if-promoted-type branch from b6e18db to 7c23984 Compare July 24, 2024 13:26

v01dXYZ force-pushed the 96455-opt-count-leading-ones-if-promoted-type branch from 7c23984 to 4800bb4 Compare July 25, 2024 10:09

llvmbot added the llvm:transforms label Jul 25, 2024

arsenm reviewed Jul 25, 2024

View reviewed changes

topperc reviewed Jul 25, 2024

View reviewed changes

llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp Outdated Show resolved Hide resolved

[CodeGen] Legalisation with promotion: optimise count leading ones

8b95ef8

(CTLZ (XOR Op -1)) --> (CTLZ_ZERO_UNDEF (XOR (SHIFT (ANYEXTEND Op) ShiftAmount) -1)) The optimisation also applies for CTLZ_ZERO_UNDEF, VP_CTLZ, VP_CTLZ_ZERO_UNDEF. Fixes llvm#96455

v01dXYZ force-pushed the 96455-opt-count-leading-ones-if-promoted-type branch from 4800bb4 to 8b95ef8 Compare July 26, 2024 08:41

topperc reviewed Aug 1, 2024

View reviewed changes

v01dXYZ force-pushed the 96455-opt-count-leading-ones-if-promoted-type branch from fefcd7a to 3adc473 Compare August 5, 2024 09:07

Count leading ones promoted type optimisation: Support VP_XOR

2654579

v01dXYZ force-pushed the 96455-opt-count-leading-ones-if-promoted-type branch from 3adc473 to 2654579 Compare August 5, 2024 10:07

topperc reviewed Aug 7, 2024

View reviewed changes

v01dxyz added 2 commits August 8, 2024 09:12

Counting leading ones: Optimise only for CTLZ

35551b7

Do not optimise for CTLZ_ZERO_UNDEF

Counting leading ones: Optimise VP_CTLZ only with VP_XOR

915c34e

Do not optimise for VP_CTLZ + XOR. That way, the non-VP and VP cases are more clearly separated which is easier to follow.

v01dXYZ mentioned this pull request Aug 12, 2024

[DAGCombine] Count leading ones: refine post DAG/Type Legalisation if promotion #102877

Merged

3 tasks

Optimize count leading ones if promoted type #99591

Are you sure you want to change the base?

Optimize count leading ones if promoted type #99591

Uh oh!

Conversation

v01dXYZ commented Jul 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Jul 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dtcxzyw commented Jul 23, 2024

Uh oh!

Uh oh!

Uh oh!

v01dXYZ commented Jul 23, 2024

Uh oh!

v01dXYZ commented Jul 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

v01dXYZ commented Jul 25, 2024

Uh oh!

arsenm Jul 25, 2024

Choose a reason for hiding this comment

Uh oh!

v01dXYZ Jul 25, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

v01dXYZ commented Jul 30, 2024

Uh oh!

topperc Aug 1, 2024

Choose a reason for hiding this comment

Uh oh!

v01dXYZ Aug 5, 2024

Choose a reason for hiding this comment

Uh oh!

topperc Aug 1, 2024

Choose a reason for hiding this comment

Uh oh!

v01dXYZ Aug 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

v01dXYZ Aug 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

topperc Aug 7, 2024

Choose a reason for hiding this comment

Uh oh!

topperc Aug 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

topperc commented Aug 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

v01dXYZ commented Aug 8, 2024

Uh oh!

Uh oh!

v01dXYZ commented Jul 19, 2024 •

edited

Loading

llvmbot commented Jul 23, 2024 •

edited

Loading

v01dXYZ commented Jul 24, 2024 •

edited

Loading

v01dXYZ Aug 5, 2024 •

edited

Loading

v01dXYZ Aug 5, 2024 •

edited

Loading

topperc Aug 7, 2024 •

edited

Loading

topperc commented Aug 7, 2024 •

edited

Loading