-
Couldn't load subscription status.
- Fork 118
Data skipping correctly handles nested columns and column mapping #512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally seems like a good direction. Left some specific comments
|
|
||
| let field = ApplyNameMapping.transform_struct_field(Cow::Borrowed(self)); | ||
| Ok(field.unwrap().into_owned()) | ||
| // NOTE: unwrap is safe because the transformer is incapable of returning None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is true afaict, but it requires quite a lot of layers to never return None since many of the called functions do something like transform(..)?. There are many places we could make changes where this assumption would be invalidated.
Would it make sense to return a Result or Option here and propagate ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love making infallible operations seem fallible... it's one more thing to test that can't actually be tested (because it never happens).
The default implementation of the transform never returns None, and never will (it's a no-op). So we only have to worry about the one method we actually provide here while implementing the trait. And it doesn't directly return None. At worst they use ? on a recursive operation (that never returns None).
I don't know a way to generically express that idea in the trait, unfortunately... ideas welcome
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #512 +/- ##
==========================================
+ Coverage 82.80% 83.21% +0.41%
==========================================
Files 74 74
Lines 16536 16775 +239
Branches 16536 16775 +239
==========================================
+ Hits 13692 13960 +268
+ Misses 2195 2166 -29
Partials 649 649 ☔ View full report in Codecov by Sentry. |
| )); | ||
| }; | ||
| let name = field.physical_name(global_state.column_mapping_mode)?; | ||
| let name = field.physical_name(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nicklan -- any idea why we didn't just apply column mapping to the partition columns along with all the others? Why wait until now?
kernel/src/scan/mod.rs
Outdated
| // | ||
| // NOTE: It is possible the predicate resolves to FALSE even ignoring column references, | ||
| // e.g. `col > 10 AND FALSE`. Such predicates can statically skip the whole query. | ||
| fn build_physical_predicate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this is a conversion, how's something like logical_predicate_to_physical?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this is more a curiosity question: When do you make a static method a member of the struct vs a function defined outside the struct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO associated functions ("static methods") are useful as a namespace grouping technique for benefit of downstream consumers, and occasionally because a trait wants it. Since this was a private function anyway, it didn't seem super useful to make it associated.
No strong reasons either way tho!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re naming: It's not just a conversion because it takes multiple inputs and produces multiple outputs (physical expression and physical referenced schema).
No strong feelings either way tho (esp. for a private helper method with one call site)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Namespace specifically for consumers is a good way to put it. I used to think of it as just a logical grouping.
build makes sense then 👍
kernel/src/scan/mod.rs
Outdated
| // | ||
| // NOTE: It is possible the predicate resolves to FALSE even ignoring column references, | ||
| // e.g. `col > 10 AND FALSE`. Such predicates can statically skip the whole query. | ||
| fn build_physical_predicate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this is more a curiosity question: When do you make a static method a member of the struct vs a function defined outside the struct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great, no notes. really awesome how useful the transform frameworks have been!
| // clause has invalid column references. Data skipping is best-effort and the predicate | ||
| // anyway needs to be evaluated against every row of data -- which is impossible if the | ||
| // columns are missing/invalid. Just blow up instead of trying to handle it gracefully. | ||
| return Err(Error::missing_column(format!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing we'll want to watch out for in the future is a predicate like where id = 2 and _change_type = "insert".
The _change_type part of the predicate can't be applied to the physical data, so it should be dropped.
Same thing applies to generated columns and partition columns I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're checking against the logical schema, which includes partition columns and also (hopefully) includes the _change_type and other generated columns?
Meanwhile, we have disabled the optimization in the parquet skipping code, that would treat missing physical columns as all-NULL. If we ever tried to re-enable that optimization, we would need to be very sure we've classified the columns correctly first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have disabled the optimization in the parquet skipping code, that would treat missing physical columns as all-NULL
Aha I'd expected such an optimization to exist here in the physical predicate. Since we handle that elsewhere, then we're good.
we would need to be very sure we've classified the columns correctly first
If we had access to Vec<ColumnType>, we'd know how every column is classified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and eventually we'll need that to support data skipping over partition columns, since we do have the metadata for those but it's in a different column and requires slightly different logic.
What changes are proposed in this pull request?
The existing implementation of data skipping has two flaws:
It turns out the two issues are intertwined, because both column mapping and nested column references need a schema traversal. So while we could solve them separately, it's actually easier to just do it all at once.
Also -- the data skipping predicate we pass around needs an associated "referenced" schema (in order to build a stats schema); if that schema is empty, it means the data skipping predicate is "static" and should be evaluated once to decide whether to even initiate a log scan. That adds some complexity to the log replay path. But it also allows a predicate like the following to be treated as static, in spite of appearing to reference table columns:
This PR affects the following public APIs
scan::Scan::predicaterenamed asphysical_predicateto eliminate ambiguityscan::log_replay::scan_action_iternow takes fewer (and different) params.How was this change tested?
Existing unit tests, plus new unit tests that verify the new behavior.