Skip to content

Query Builder regex filter support #2385

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open

Query Builder regex filter support #2385

wants to merge 11 commits into from

Conversation

phoebusm
Copy link
Collaborator

@phoebusm phoebusm commented Jun 4, 2025

Reference Issues/PRs

https://man312219.monday.com/boards/7852509418/pulses/9238018056

What does this implement or fix?

This brings support regex support on query builder filtering. The new api regex_match only sypports filtering str columns.
This PR includes the implementation, docstrings and correponding tests.

Any other comments?

Valid asv run: https://github.com/man-group/ArcticDB/actions/runs/15472639473
The default asv CI triggered by this PR is destined to fail as it will test the new API in the new asv test against the existing master wheel.

Checklist

Checklist for code changes...
  • Have you updated the relevant docstrings, documentation and copyright notice?
  • Is this contribution tested against all ArcticDB's features?
  • Do all exceptions introduced raise appropriate error messages?
  • Are API changes highlighted in the PR description?
  • Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

@phoebusm phoebusm added the minor Feature change, should increase minor version label Jun 5, 2025
@phoebusm phoebusm changed the title Query Builder regex support Query Builder regex filter support Jun 5, 2025
@phoebusm phoebusm marked this pull request as ready for review June 5, 2025 16:03
@@ -429,7 +439,7 @@ class QueryBuilder:

Supported filtering operations:

* isna, isnull, notna, and notnull - return all rows where a specified column is/is not NaN or None. isna is
* isna, isnull, notna, notnull and regex_match - return all rows where a specified column is/is not NaN or None. isna is
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't belong here


def regex_match(self, pattern: str):
if isinstance(pattern, str):
_RegexPattern(pattern) # Validate the regex pattern
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit hacky, can we have a function that does this?

@@ -450,6 +460,8 @@ class QueryBuilder:

q.isin(1, 2, 3)

regex_match accepts string as pattern and can only filter string columns
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could mention that it is similar to Pandas contains (which is an awful name)


q = QueryBuilder()
q = q[q["a"].regex_match(pattern_a) & q["c"].regex_match(pattern_c)]
expected = df[df.a.str.contains(pattern_a) & df.c.astype(str).str.contains(pattern_c)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come the astype is needed here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

assert lib.read(sym, query_builder=q2).data.empty


def test_filter_regex_match_empty_symbol(lmdb_version_store_v1, sym):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this behaviour will change after the modify_schema branch is merged, can just remove this test

auto unique_values = unique_values_for_string_column(column);
remove_nones_and_nans(unique_values);

util::RegexPattern pattern{std::string(str)};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will mean the regex is re-compiled for every row-slice, it should only be compiled once

details::visit_type(val.type().data_type(), [&](auto val_tag) {
using val_type_info = ScalarTypeInfo<decltype(val_tag)>;
if constexpr(is_sequence_type(col_type_info::data_type) && is_sequence_type(val_type_info::data_type)) {
std::string value_string = get_string_from_value_type(column_with_strings, val);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't be correct for fixed-width string types. PCRE supports UTF-32:
https://www.pcre.org/original/doc/html/pcreunicode.html
so we can just use that natively

@@ -159,6 +159,8 @@ VariantData dispatch_binary(const VariantData& left, const VariantData& right, O
return visit_binary_membership(left, right, IsInOperator{});
case OperationType::ISNOTIN:
return visit_binary_membership(left, right, IsNotInOperator{});
case OperationType::REGEX_MATCH:
return visit_regex_match_membership(left, right);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed implementing this as a binary comparator, and adding a Regex type to VariantData. This will also make it simple to only compile the regex once (or I guess twice, once for UTF-8 and once for UTF-32)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I aggree with this.

Just need to be careful if we're passing RegexPatterns around as they only hold a reference to the string and that's a recipe for holding a reference to a destructed object.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I still remember the discussion. I agree with putting RegexPattern in to VariantData to avoid multiple compilation. For putting it in binary_membership, I just thinking maybe making the if cases in binary_membership more complicated may not be the best idea? The datatype checks in regex match are quite different from binary comparing

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test for finding members of a comma-separated list of strings? We have a customer use case for this. e.g.
df = pd.DataFrame({"col": ["this,is,a,comma,separated,list,of,strings", ...]})
and matching on both a single element being in the lists, and on multiple elements being present

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should run all of these tests with both fixed-width and dynamic string columns once the implementation is fixed to handle UTF-32

def regex_match(self, pattern: str):
if isinstance(pattern, str):
_RegexPattern(pattern) # Validate the regex pattern
return self._apply(pattern, _OperationType.REGEX_MATCH)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good if we self.apply(pattern, REGEX_MATCH) can have the pattern be only of type RegexPattern.

This way we will only construct the regex pattern once in python here and use it throughout.

Related to this comment from Alex

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes definitely. I have missed the fact that the operation is made per segment. As that is the case, putting the compiled regex pattern in the tree is a logical choice

@@ -70,6 +70,15 @@ def peakmem_filtering_string_isin(self, num_rows):
q = q[q["id1"].isin(string_set)]
self.lib.read(f"{num_rows}_rows", columns=["v3"], query_builder=q)

def time_filtering_string_regex_match(self, num_rows):
# Selects about 1% of the rows
k = min(3, num_rows // 1000)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The choice of k is weird, 3 will always be < num_rows//1000 (smallest num_rows is 1_000_000). Maybe you meant:
min(3, int(math.log10(num_rows)) - 3) or something similar?

Also the comment on 1% of rows is misleading. It will filter 3 digit numbers which are 1% for 1_000_000 case but .1% for 10_000_000 case. (Will be fixed if we apply the log10 change I think)

@@ -159,6 +159,8 @@ VariantData dispatch_binary(const VariantData& left, const VariantData& right, O
return visit_binary_membership(left, right, IsInOperator{});
case OperationType::ISNOTIN:
return visit_binary_membership(left, right, IsNotInOperator{});
case OperationType::REGEX_MATCH:
return visit_regex_match_membership(left, right);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I aggree with this.

Just need to be careful if we're passing RegexPatterns around as they only hold a reference to the string and that's a recipe for holding a reference to a destructed object.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
minor Feature change, should increase minor version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants