-
Couldn't load subscription status.
- Fork 118
feat!: Change input to write_json_file to be FilteredEngineData #1312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat!: Change input to write_json_file to be FilteredEngineData #1312
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1312 +/- ##
==========================================
+ Coverage 84.76% 84.79% +0.03%
==========================================
Files 113 113
Lines 28421 28613 +192
Branches 28421 28613 +192
==========================================
+ Hits 24091 24263 +172
- Misses 3194 3196 +2
- Partials 1136 1154 +18 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally looks good, but one biggish question
kernel/src/engine_data.rs
Outdated
| let len = data.len(); | ||
| Self { | ||
| data, | ||
| selection_vector: vec![true; len], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hrmm, so we've generally had the convention that if your selection vector is shorter than the number of rows, all remaining rows are "selected". This is how things come out of the dv roaring treemap. See docs here for example.
BUT, I see we don't specify that in FilteredEngineData at all, so we really should decide on this semantic, and the document it.
All of which is to say, I think we can just do an empty vec here, which will be way more efficient, but does require slightly more clever code in the processing steps (i.e. don't filter if it's empty, extend if it isn't)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like this convention might be error prone. I guess this is part of the longer term design but it would be nice to abstract the actual selection vector into a trait (maybe backed by roaring) so that engines don't have to think about this case. That is likely part of a broader design on the relationship between EngineData and FilteredEngineData for now I can update to follow this convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, once we have FilteredEngineData more pervasively, we can consider not using Vec<bool> but something more abstracted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the contract to be similar. Also handled edge case in the opposite direction.
960be84 to
0cc2f40
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! couple questions to consider first.
also discussed some offline.
- this is in the right direction of ultimately making all
EngineData'filtered' - need to ensure arrow (and everyone else) can cope with this easily (given NULL != non-selected) - can we make a follow-up to invest more in migrating other APIs to filtered engine data (or rather just making all of EngineData 'filtered')? (if we push this into expression evaluator we can get start to respect removals during evaluation right?)
| } | ||
|
|
||
| impl FilteredEngineData { | ||
| pub fn try_new(data: Box<dyn EngineData>, selection_vector: Vec<bool>) -> DeltaResult<Self> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we consider making fields private to force people through the validation? we could do an into_inner or just getters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
kernel/src/engine/arrow_utils.rs
Outdated
| // Honor the new contract: if selection vector is shorter than the number of rows, | ||
| // then all rows not covered by the selection vector are assumed to be selected | ||
| let num_rows = batch.num_rows(); | ||
| let mut selection_vector = filtered_data.selection_vector.clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we do better than a full clone and then mutable selection vector here? (for example this isn't needed if SV is empty below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, after refactoring to accessors this is not a reference or consumption
kernel/src/engine_data.rs
Outdated
| impl HasSelectionVector for FilteredEngineData { | ||
| /// Returns true if any row in the selection vector is marked as selected | ||
| fn has_selected_rows(&self) -> bool { | ||
| // Per contract if selection is not as long as then at least one row is selected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // Per contract if selection is not as long as then at least one row is selected. | |
| // Per contract if selection is not as long as data then at least one row is selected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
kernel/src/checkpoint/tests.rs
Outdated
| let batch = data_iter.next().unwrap()?; | ||
| assert_eq!(batch.selection_vector, [true]); | ||
| // According to the new contract, with_all_rows_selected creates an empty selection vector | ||
| assert_eq!(batch.selection_vector, vec![] as Vec<bool>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need cast?
kernel/src/engine/arrow_utils.rs
Outdated
| } else if selection_vector.len() > num_rows { | ||
| // Take a sublist of the selection vector equal to the data size | ||
| selection_vector = selection_vector[..num_rows].to_vec(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it's an error below, let's treat it as an error here.
| } else if selection_vector.len() > num_rows { | |
| // Take a sublist of the selection vector equal to the data size | |
| selection_vector = selection_vector[..num_rows].to_vec(); | |
| } else if selection_vector.len() > num_rows { | |
| return Err(Error::InvalidSelectionVector("Selection vectors must have fewer or equal rows to data in FilteredEngineData. Data had {num_rows} rows, while selection vector had {selection_vector.len()} rows.")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just want that one extra error case :)
f280f17 to
56e20f5
Compare
|
@emkornfield just needs CI to pass (Example: |
Thanks will look into this, cleaning up a few more comments. Also some tests failing mysteriously for me, I'll see if this replicates on CI. |
🥞 Stacked PR
Use this link to review incremental changes.
What changes are proposed in this pull request?
Change the json handler to use
FilteredEngineDatainstead ofEngineDataThis PR affects the following public APIs
Engine json handler
How was this change tested?
Existing tests pass, plus additional tests for verifyng new utility method and that default engine correctly only writes missing values.
BREAKING CHANGE: write_json_files takes FilteredEngineData instead of EngineData to allow for more complicated transaction types (e.g. remove files will use FilteredEngineData)