-
Couldn't load subscription status.
- Fork 118
refactor: Consolidate regular scan and CDF scan field handling #1359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: Consolidate regular scan and CDF scan field handling #1359
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1359 +/- ##
=======================================
Coverage 84.89% 84.90%
=======================================
Files 113 114 +1
Lines 28966 28935 -31
Branches 28966 28935 -31
=======================================
- Hits 24592 24566 -26
Misses 3200 3200
+ Partials 1174 1169 -5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
6bb2d3f to
15e632a
Compare
9127339 to
e49f2f3
Compare
f1ea04a to
1908575
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly good, just some style stuff 👍
5d9d80f to
5d64781
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, just some small testing nits
| pub(crate) fn get_cdf_transform_expr( | ||
| scan_file: &CdfScanFile, | ||
| logical_schema: &SchemaRef, | ||
| transform_spec: &TransformSpec, | ||
| state_info: &StateInfo, | ||
| physical_schema: &StructType, | ||
| ) -> DeltaResult<ExpressionRef> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should probably avoid creating a new expression if it's going to end up being the identity expression anyway. Could you add an issue to make this return Option<ExpressionRef>?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we also get to avoid creating the empty_spec needlessly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally looks great. i had one question
kernel/src/scan/field_classifiers.rs
Outdated
|
|
||
| /// Regular scan field classifier for standard Delta table scans. | ||
| /// Handles partition columns as metadata-derived fields. | ||
| pub(crate) struct DefaultTransformFieldClassifier; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be a ScanTransformFieldClassifier? It's used in scan, not be default right?
| for (index, logical_field) in logical_schema.fields().enumerate() { | ||
| if partition_columns.contains(logical_field.name()) { | ||
| if logical_field.is_metadata_column() { | ||
| let transform = classifier.classify_field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some fields that need a transform also require modifying read_fields as well. For example, if the user requested row ids we need to tell the parquet reader to read both a row-index column and the physical row-id column if it's present. I think we can do that by changing the classify_field call to either return something that includes other fields to read, or by passing in &mut read_fields but just want to make sure that makes sense to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense to me. Would you be able to incorporate that in your row ids PR?
0913eb5 to
1dbb566
Compare
1dbb566 to
3df2019
Compare
3df2019 to
095156b
Compare
…-io#1359) ## What changes are proposed in this pull request? Consolidates regular scan and Change Data Feed (CDF) scan field handling by introducing a TransformFieldClassifier trait pattern. Removed `all_fields` and `ColumnType`. ## How was this change tested? Existing unit tests
What changes are proposed in this pull request?
Consolidates regular scan and Change Data Feed (CDF) scan field handling by introducing a TransformFieldClassifier trait pattern.
Removed
all_fieldsandColumnType.How was this change tested?
Existing unit tests