Query Builder regex filter support #2385

phoebusm · 2025-06-04T19:24:05Z

Reference Issues/PRs

https://man312219.monday.com/boards/7852509418/pulses/9238018056

What does this implement or fix?

This brings support regex support on query builder filtering. The new api regex_match only sypports filtering str columns.
This PR includes the implementation, docstrings and correponding tests.

Any other comments?

Valid asv run: https://github.com/man-group/ArcticDB/actions/runs/15472639473
The default asv CI triggered by this PR is destined to fail as it will test the new API in the new asv test against the existing master wheel.

Checklist

Checklist for code changes...

Have you updated the relevant docstrings, documentation and copyright notice?
Is this contribution tested against all ArcticDB's features?
Do all exceptions introduced raise appropriate error messages?
Are API changes highlighted in the PR description?
Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

alexowens90 · 2025-06-06T12:37:36Z

python/arcticdb/version_store/processing.py

@@ -429,7 +439,7 @@ class QueryBuilder:

    Supported filtering operations:

-    * isna, isnull, notna, and notnull - return all rows where a specified column is/is not NaN or None. isna is
+    * isna, isnull, notna, notnull and regex_match - return all rows where a specified column is/is not NaN or None. isna is


This doesn't belong here

alexowens90 · 2025-06-06T12:39:25Z

python/arcticdb/version_store/processing.py

+
+    def regex_match(self, pattern: str):
+        if isinstance(pattern, str):
+            _RegexPattern(pattern)  # Validate the regex pattern


This is a bit hacky, can we have a function that does this?

alexowens90 · 2025-06-06T12:47:16Z

python/arcticdb/version_store/processing.py

@@ -450,6 +460,8 @@ class QueryBuilder:

        q.isin(1, 2, 3)

+    regex_match accepts string as pattern and can only filter string columns


Could mention that it is similar to Pandas contains (which is an awful name)

alexowens90 · 2025-06-06T12:50:42Z

python/tests/unit/arcticdb/version_store/test_filtering.py

+
+    q = QueryBuilder()
+    q = q[q["a"].regex_match(pattern_a) & q["c"].regex_match(pattern_c)]
+    expected = df[df.a.str.contains(pattern_a) & df.c.astype(str).str.contains(pattern_c)]


How come the astype is needed here?

alexowens90 · 2025-06-06T12:52:05Z

python/tests/unit/arcticdb/version_store/test_filtering.py

+    assert lib.read(sym, query_builder=q2).data.empty
+
+
+def test_filter_regex_match_empty_symbol(lmdb_version_store_v1, sym):


I think this behaviour will change after the modify_schema branch is merged, can just remove this test

alexowens90 · 2025-06-06T13:22:39Z

cpp/arcticdb/column_store/string_pool.cpp

+    auto unique_values = unique_values_for_string_column(column);
+    remove_nones_and_nans(unique_values);
+
+    util::RegexPattern pattern{std::string(str)};


This will mean the regex is re-compiled for every row-slice, it should only be compiled once

alexowens90 · 2025-06-06T13:28:09Z

cpp/arcticdb/processing/operation_dispatch_binary.cpp

+        details::visit_type(val.type().data_type(), [&](auto val_tag) {
+            using val_type_info = ScalarTypeInfo<decltype(val_tag)>;
+            if constexpr(is_sequence_type(col_type_info::data_type) && is_sequence_type(val_type_info::data_type)) {
+                std::string value_string = get_string_from_value_type(column_with_strings, val);


This won't be correct for fixed-width string types. PCRE supports UTF-32:
https://www.pcre.org/original/doc/html/pcreunicode.html
so we can just use that natively

alexowens90 · 2025-06-06T13:29:44Z

cpp/arcticdb/processing/operation_dispatch_binary.cpp

@@ -159,6 +159,8 @@ VariantData dispatch_binary(const VariantData& left, const VariantData& right, O
            return visit_binary_membership(left, right, IsInOperator{});
        case OperationType::ISNOTIN:
            return visit_binary_membership(left, right, IsNotInOperator{});
+        case OperationType::REGEX_MATCH:
+            return visit_regex_match_membership(left, right);


We discussed implementing this as a binary comparator, and adding a Regex type to VariantData. This will also make it simple to only compile the regex once (or I guess twice, once for UTF-8 and once for UTF-32)

I aggree with this.

Just need to be careful if we're passing RegexPatterns around as they only hold a reference to the string and that's a recipe for holding a reference to a destructed object.

Yes I still remember the discussion. I agree with putting RegexPattern in to VariantData to avoid multiple compilation. For putting it in binary_membership, I just thinking maybe making the if cases in binary_membership more complicated may not be the best idea? The datatype checks in regex match are quite different from binary comparing

alexowens90 · 2025-06-06T13:32:13Z

python/tests/unit/arcticdb/version_store/test_filtering.py

Can we add a test for finding members of a comma-separated list of strings? We have a customer use case for this. e.g.
df = pd.DataFrame({"col": ["this,is,a,comma,separated,list,of,strings", ...]})
and matching on both a single element being in the lists, and on multiple elements being present

alexowens90 · 2025-06-06T13:34:31Z

python/tests/unit/arcticdb/version_store/test_filtering.py

We should run all of these tests with both fixed-width and dynamic string columns once the implementation is fixed to handle UTF-32

IvoDD · 2025-06-09T07:10:06Z

python/arcticdb/version_store/processing.py

+    def regex_match(self, pattern: str):
+        if isinstance(pattern, str):
+            _RegexPattern(pattern)  # Validate the regex pattern
+            return self._apply(pattern, _OperationType.REGEX_MATCH)


I think it would be good if we self.apply(pattern, REGEX_MATCH) can have the pattern be only of type RegexPattern.

This way we will only construct the regex pattern once in python here and use it throughout.

Related to this comment from Alex

Yes definitely. I have missed the fact that the operation is made per segment. As that is the case, putting the compiled regex pattern in the tree is a logical choice

IvoDD · 2025-06-09T07:55:48Z

python/benchmarks/local_query_builder.py

@@ -70,6 +70,15 @@ def peakmem_filtering_string_isin(self, num_rows):
        q = q[q["id1"].isin(string_set)]
        self.lib.read(f"{num_rows}_rows", columns=["v3"], query_builder=q)

+    def time_filtering_string_regex_match(self, num_rows):
+        # Selects about 1% of the rows
+        k = min(3, num_rows // 1000)


The choice of k is weird, 3 will always be < num_rows//1000 (smallest num_rows is 1_000_000). Maybe you meant:
min(3, int(math.log10(num_rows)) - 3) or something similar?

Also the comment on 1% of rows is misleading. It will filter 3 digit numbers which are 1% for 1_000_000 case but .1% for 10_000_000 case. (Will be fixed if we apply the log10 change I think)

IvoDD · 2025-06-09T08:04:56Z

cpp/arcticdb/processing/operation_dispatch_binary.cpp

@@ -159,6 +159,8 @@ VariantData dispatch_binary(const VariantData& left, const VariantData& right, O
            return visit_binary_membership(left, right, IsInOperator{});
        case OperationType::ISNOTIN:
            return visit_binary_membership(left, right, IsNotInOperator{});
+        case OperationType::REGEX_MATCH:
+            return visit_regex_match_membership(left, right);


I aggree with this.

Just need to be careful if we're passing RegexPatterns around as they only hold a reference to the string and that's a recipe for holding a reference to a destructed object.

phoebusm added 10 commits May 29, 2025 16:17

Checkpoint

fbb51f8

Implementation

92ad7ef

Minor exception adjustment

c46b93e

Update exception type

6929942

Add validate for regex pattern

095044c

Basic tests

30887ba

Remove not in use column in test

cbc7c66

Add chain test

42c03ff

Add asv benchmark

f8754f9

Add docstring

241bce2

phoebusm added the minor Feature change, should increase minor version label Jun 5, 2025

phoebusm changed the title ~~Query Builder regex support~~ Query Builder regex filter support Jun 5, 2025

Code simplification

9c2ec0c

phoebusm marked this pull request as ready for review June 5, 2025 16:03

phoebusm requested review from alexowens90, willdealtry and poodlewars as code owners June 5, 2025 16:03

alexowens90 requested changes Jun 6, 2025

View reviewed changes

IvoDD reviewed Jun 9, 2025

View reviewed changes

		@@ -450,6 +460,8 @@ class QueryBuilder:

		q.isin(1, 2, 3)

		regex_match accepts string as pattern and can only filter string columns

		assert lib.read(sym, query_builder=q2).data.empty


		def test_filter_regex_match_empty_symbol(lmdb_version_store_v1, sym):

Query Builder regex filter support #2385

Are you sure you want to change the base?

Query Builder regex filter support #2385

Uh oh!

Conversation

phoebusm commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement or fix?

Any other comments?

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

phoebusm commented Jun 4, 2025 •

edited

Loading