-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet: Make row-group filters cooperate to filter #10090
base: main
Are you sure you want to change the base?
Parquet: Make row-group filters cooperate to filter #10090
Conversation
@@ -290,7 +299,7 @@ private <T> boolean shouldRead( | |||
hashValue = bloom.hash(((Number) value).intValue()); | |||
return bloom.findHash(hashValue); | |||
default: | |||
return ROWS_MIGHT_MATCH; | |||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is more readable to keep the constants:
return true; | |
return ROWS_MIGHT_MATCH; |
A few more occurrences below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because the types of these two constants have been changed from boolean
to the generic class Expression
of FindsResidualVisitor
.
For readability, maybe we can add a comment after the boolean value ?
return true; /* rows might match */
@zhongyujiang I'm really sorry for the delayed review on my part. I think this is an important improvement! I will be taking a deeper look at the implementation this week. A while back, I did check this code out locally and run some tests on some samples to get confidence on correctness but of course we'll also want unit test coverage as much as reasonably possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your patience @zhongyujiang . Had some comments, I think the main one is around test structure.
public Boolean not(Boolean result) { | ||
return !result; | ||
public Expression not(Expression result) { | ||
throw new UnsupportedOperationException("This path shouldn't be reached."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really following why this throws UnsupportedOperationException? Should it have already been rewritten by RewriteNot
or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right, it should already been rewritten.
It's just when I was refactoring this part, I found that the dict filter and metric filter handle not
inconsistently compared to the bloom filter. The bloom filter will directly throw an UnsupportedOperationException
because it cannot evaluate not
. In fact, the dict filter and metric filter also cannot handle not since they can only conclude ROWS_CANNOT_MATCH
or ROWS_MIGHT_MATCH
. So, I made their handling of not
consistent with the bloom filter. Anyway, this should not be reached.
Is this ok? I can revert this if not.
Expression expected = Binder.bind(SCHEMA.asStruct(), Expressions.or(bloom, dict), true); | ||
ParquetMetricsRowGroupFilter metricFilter = new ParquetMetricsRowGroupFilter(SCHEMA, expr); | ||
Expression metricResidual = metricFilter.residualFor(parquetSchema, rowGroupMetadata); | ||
assertThat(expected.isEquivalentTo(metricResidual)) | ||
.as("Expected residual: %s, actual residual: %s", expected, metricResidual); | ||
|
||
expected = Binder.bind(SCHEMA.asStruct(), bloom, true); | ||
ParquetDictionaryRowGroupFilter dictFilter = | ||
new ParquetDictionaryRowGroupFilter(SCHEMA, metricResidual); | ||
Expression dictResidual = | ||
dictFilter.residualFor( | ||
parquetSchema, rowGroupMetadata, reader.getDictionaryReader(rowGroupMetadata)); | ||
|
||
assertThat(expected.isEquivalentTo(dictResidual)) | ||
.as("Expected residual: %s, actual residual: %s", expected, dictResidual); | ||
|
||
expected = Expressions.alwaysFalse(); | ||
ParquetBloomRowGroupFilter bloomFilter = new ParquetBloomRowGroupFilter(SCHEMA, dictResidual); | ||
Expression bloomResidual = | ||
bloomFilter.residualFor( | ||
parquetSchema, rowGroupMetadata, reader.getBloomFilterDataReader(rowGroupMetadata)); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels a bit odd to test the combined filter residual logic in this test method in TestBloomRowGroupFilter
. See my comment above on a separate test class which encapsulates the logic, which should also make it easier for testing since we can then move this to a separate test class.
As far as testing goes, here are the cases I think:
1.) We know the existing tests should cover the cases where an individual filter is always true or always false.
2.) So following 1, then we'd want the following tests for the combined tests:
a.) Where the metrics filter has a residual that's always true/false
b.) Where the metrics filter has a residual that's not true/false and the dictionary filter has one that is true/false.
c.) Where the metrics filter has a residual that's not true/false, the dictionary filter does not have one that is true/false, and the bloom filter returns a residual that's not true/false.
d.) Same as c but the bloom filter does return true/false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @amogh-jahagirdar, thanks for the comments.
I'm sorry for the delay, comments have been addressed, please take a look when you have time.
47302b4
to
dd70a4c
Compare
dd70a4c
to
9b8cbe5
Compare
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
Hey @amogh-jahagirdar @Fokko @nastra @danielcweeks, can you help review this when you have time? Thanks |
This PR refactors three Parquet row-group filters into a form that computes residual expressions, allowing it to return a residual expression for the given row-groups. The residual computed by the previous filter can be passed to the next filter, allowing the three Parquet row-group filters to work together. This improves the handling of some
OR
condition queries.For example: Let's assume we have a query
a = 'foo' OR b = 'bar'
, where column a is dictionary-encoded in a Parquet row-group, while column b is not entirely dictionary-encoded in all data pages but has a bloom filter. Therefore,a = 'foo'
can only be evaluated by the dictionary filter, andb = 'bar'
can only be evaluated by the bloom filter. In the current situation, even if both filters evaluate the expressions asROWS_CANNOT_MATCH
individually, because each filter can only evaluate one sub-expression, the final output would still beROWS_MIGHT_MATCH
(let's assume the metric filter evaluates both sub-expressions asROWS_MIGHT_MATCH
).After refactoring into the form of computing residuals, the dictionary filter will compute the residual for
a = 'foo' OR b = 'bar'
asb = 'bar'
. Then this residual expression will be passed to the bloom filter and evaluated asExpressions.alwaysFalse()
. As a result, the reading of this row-group can be skipped.This is a revive of #6893, and can close #10029.
cc @cccs-jc @rdblue @huaxingao @amogh-jahagirdar @RussellSpitzer Could you please review this? Thanks!