Skip to content

Commit 0b34e6b

Browse files
raulcdpitrou
andauthored
apacheGH-46777: [C++] Use SimplifyIsIn only when the value_set of the expression is lower than a threshold (apache#46859)
### Rationale for this change Using `SimplifyIsIn` when the value set is large has a substantial performance penalty. ### What changes are included in this PR? Ensure we do not use the simplification when the value_set on the expression is higher than a threshold (50). ### Are these changes tested? I've tested locally that the reproducer goes back to pre change levels. ``` $ python read.py === PYARROW VERSION 20 === Retrieved 10,000,000 rows in 3.08 seconds. ``` I have added a test for large sets and validate the expression is not being modified. ### Are there any user-facing changes? No * GitHub Issue: apache#46777 Lead-authored-by: Raúl Cumplido <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]>
1 parent 1e01c52 commit 0b34e6b

File tree

2 files changed

+29
-0
lines changed

2 files changed

+29
-0
lines changed

cpp/src/arrow/compute/expression.cc

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1267,6 +1267,15 @@ struct Inequality {
12671267

12681268
auto options = checked_pointer_cast<SetLookupOptions>(is_in_call->options);
12691269

1270+
// The maximum number of values in the is_in expression set of values
1271+
// in order to use the simplification.
1272+
// If the set is large there are performance implications, see:
1273+
// https://github.com/apache/arrow/issues/46777
1274+
constexpr int16_t kIsInSimplificationMaxValueSet = 50;
1275+
if (options->value_set.length() > kIsInSimplificationMaxValueSet) {
1276+
return std::nullopt;
1277+
}
1278+
12701279
const auto& lhs = Comparison::StripOrderPreservingCasts(is_in_call->arguments[0]);
12711280
if (!lhs.field_ref()) return std::nullopt;
12721281
if (*lhs.field_ref() != guarantee.target) return std::nullopt;

cpp/src/arrow/compute/expression_test.cc

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,16 @@ Expression add(Expression l, Expression r) {
8080
return call("add", {std::move(l), std::move(r)});
8181
}
8282

83+
std::string make_range_json(int start, int end) {
84+
std::string result = "[";
85+
for (int i = start; i <= end; ++i) {
86+
if (i > start) result += ",";
87+
result += std::to_string(i);
88+
}
89+
result += "]";
90+
return result;
91+
}
92+
8393
const auto no_change = std::nullopt;
8494

8595
TEST(ExpressionUtils, Comparison) {
@@ -1679,6 +1689,16 @@ TEST(Expression, SimplifyIsIn) {
16791689
Simplify{is_in(field_ref("u32"), int64(), "[1,3,5,7,9]", null_matching)}
16801690
.WithGuarantee(greater(field_ref("u32"), literal(3)))
16811691
.Expect(is_in(field_ref("u32"), int64(), "[5,7,9]", null_matching));
1692+
1693+
Simplify{is_in(field_ref("u32"), int64(), make_range_json(1, 40), null_matching)}
1694+
.WithGuarantee(greater(field_ref("u32"), literal(10)))
1695+
.Expect(is_in(field_ref("u32"), int64(), make_range_json(11, 40), null_matching));
1696+
1697+
// For large ranges we don't do any simplification, see
1698+
// `kIsInSimplificationMaxValueSet` in expression.cc.
1699+
Simplify{is_in(field_ref("u32"), int64(), make_range_json(1, 100), null_matching)}
1700+
.WithGuarantee(greater(field_ref("u32"), literal(3)))
1701+
.ExpectUnchanged();
16821702
}
16831703

16841704
Simplify{

0 commit comments

Comments
 (0)