-
Notifications
You must be signed in to change notification settings - Fork 132
Query Builder regex filter support #2385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@@ -429,7 +439,7 @@ class QueryBuilder: | |||
|
|||
Supported filtering operations: | |||
|
|||
* isna, isnull, notna, and notnull - return all rows where a specified column is/is not NaN or None. isna is | |||
* isna, isnull, notna, notnull and regex_match - return all rows where a specified column is/is not NaN or None. isna is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't belong here
|
||
def regex_match(self, pattern: str): | ||
if isinstance(pattern, str): | ||
_RegexPattern(pattern) # Validate the regex pattern |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit hacky, can we have a function that does this?
@@ -450,6 +460,8 @@ class QueryBuilder: | |||
|
|||
q.isin(1, 2, 3) | |||
|
|||
regex_match accepts string as pattern and can only filter string columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could mention that it is similar to Pandas contains
(which is an awful name)
|
||
q = QueryBuilder() | ||
q = q[q["a"].regex_match(pattern_a) & q["c"].regex_match(pattern_c)] | ||
expected = df[df.a.str.contains(pattern_a) & df.c.astype(str).str.contains(pattern_c)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come the astype
is needed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
assert lib.read(sym, query_builder=q2).data.empty | ||
|
||
|
||
def test_filter_regex_match_empty_symbol(lmdb_version_store_v1, sym): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this behaviour will change after the modify_schema
branch is merged, can just remove this test
auto unique_values = unique_values_for_string_column(column); | ||
remove_nones_and_nans(unique_values); | ||
|
||
util::RegexPattern pattern{std::string(str)}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will mean the regex is re-compiled for every row-slice, it should only be compiled once
details::visit_type(val.type().data_type(), [&](auto val_tag) { | ||
using val_type_info = ScalarTypeInfo<decltype(val_tag)>; | ||
if constexpr(is_sequence_type(col_type_info::data_type) && is_sequence_type(val_type_info::data_type)) { | ||
std::string value_string = get_string_from_value_type(column_with_strings, val); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't be correct for fixed-width string types. PCRE supports UTF-32:
https://www.pcre.org/original/doc/html/pcreunicode.html
so we can just use that natively
@@ -159,6 +159,8 @@ VariantData dispatch_binary(const VariantData& left, const VariantData& right, O | |||
return visit_binary_membership(left, right, IsInOperator{}); | |||
case OperationType::ISNOTIN: | |||
return visit_binary_membership(left, right, IsNotInOperator{}); | |||
case OperationType::REGEX_MATCH: | |||
return visit_regex_match_membership(left, right); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed implementing this as a binary comparator, and adding a Regex
type to VariantData
. This will also make it simple to only compile the regex once (or I guess twice, once for UTF-8 and once for UTF-32)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I aggree with this.
Just need to be careful if we're passing RegexPattern
s around as they only hold a reference to the string and that's a recipe for holding a reference to a destructed object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I still remember the discussion. I agree with putting RegexPattern
in to VariantData
to avoid multiple compilation. For putting it in binary_membership
, I just thinking maybe making the if
cases in binary_membership
more complicated may not be the best idea? The datatype checks in regex match are quite different from binary comparing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a test for finding members of a comma-separated list of strings? We have a customer use case for this. e.g.
df = pd.DataFrame({"col": ["this,is,a,comma,separated,list,of,strings", ...]})
and matching on both a single element being in the lists, and on multiple elements being present
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should run all of these tests with both fixed-width and dynamic string columns once the implementation is fixed to handle UTF-32
def regex_match(self, pattern: str): | ||
if isinstance(pattern, str): | ||
_RegexPattern(pattern) # Validate the regex pattern | ||
return self._apply(pattern, _OperationType.REGEX_MATCH) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good if we self.apply(pattern, REGEX_MATCH)
can have the pattern be only of type RegexPattern
.
This way we will only construct the regex pattern once in python here and use it throughout.
Related to this comment from Alex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes definitely. I have missed the fact that the operation is made per segment. As that is the case, putting the compiled regex pattern in the tree is a logical choice
@@ -70,6 +70,15 @@ def peakmem_filtering_string_isin(self, num_rows): | |||
q = q[q["id1"].isin(string_set)] | |||
self.lib.read(f"{num_rows}_rows", columns=["v3"], query_builder=q) | |||
|
|||
def time_filtering_string_regex_match(self, num_rows): | |||
# Selects about 1% of the rows | |||
k = min(3, num_rows // 1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The choice of k is weird, 3 will always be < num_rows//1000
(smallest num_rows is 1_000_000). Maybe you meant:
min(3, int(math.log10(num_rows)) - 3)
or something similar?
Also the comment on 1% of rows is misleading. It will filter 3 digit numbers which are 1% for 1_000_000 case but .1% for 10_000_000 case. (Will be fixed if we apply the log10 change I think)
@@ -159,6 +159,8 @@ VariantData dispatch_binary(const VariantData& left, const VariantData& right, O | |||
return visit_binary_membership(left, right, IsInOperator{}); | |||
case OperationType::ISNOTIN: | |||
return visit_binary_membership(left, right, IsNotInOperator{}); | |||
case OperationType::REGEX_MATCH: | |||
return visit_regex_match_membership(left, right); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I aggree with this.
Just need to be careful if we're passing RegexPattern
s around as they only hold a reference to the string and that's a recipe for holding a reference to a destructed object.
Reference Issues/PRs
https://man312219.monday.com/boards/7852509418/pulses/9238018056
What does this implement or fix?
This brings support regex support on query builder filtering. The new api
regex_match
only sypports filtering str columns.This PR includes the implementation, docstrings and correponding tests.
Any other comments?
Valid asv run: https://github.com/man-group/ArcticDB/actions/runs/15472639473
The default asv CI triggered by this PR is destined to fail as it will test the new API in the new asv test against the existing master wheel.
Checklist
Checklist for code changes...