Skip to content

Conversation

@emkornfield
Copy link
Collaborator

@emkornfield emkornfield commented Sep 17, 2025

🥞 Stacked PR

Use this link to review incremental changes.


What changes are proposed in this pull request?

Change the json handler to use FilteredEngineData instead of EngineData

This PR affects the following public APIs

Engine json handler

How was this change tested?

Existing tests pass, plus additional tests for verifyng new utility method and that default engine correctly only writes missing values.

BREAKING CHANGE: write_json_files takes FilteredEngineData instead of EngineData to allow for more complicated transaction types (e.g. remove files will use FilteredEngineData)

@codecov
Copy link

codecov bot commented Sep 17, 2025

Codecov Report

❌ Patch coverage is 89.76378% with 26 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.79%. Comparing base (fc5ccb0) to head (3551d49).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/engine/arrow_utils.rs 64.81% 1 Missing and 18 partials ⚠️
ffi/src/scan.rs 0.00% 1 Missing ⚠️
kernel/src/action_reconciliation/log_replay.rs 92.30% 0 Missing and 1 partial ⚠️
kernel/src/checkpoint/mod.rs 66.66% 0 Missing and 1 partial ⚠️
kernel/src/checkpoint/tests.rs 90.00% 0 Missing and 1 partial ⚠️
kernel/src/engine_data.rs 99.24% 1 Missing ⚠️
kernel/src/scan/mod.rs 93.33% 0 Missing and 1 partial ⚠️
kernel/src/scan/state.rs 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1312      +/-   ##
==========================================
+ Coverage   84.76%   84.79%   +0.03%     
==========================================
  Files         113      113              
  Lines       28421    28613     +192     
  Branches    28421    28613     +192     
==========================================
+ Hits        24091    24263     +172     
- Misses       3194     3196       +2     
- Partials     1136     1154      +18     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@emkornfield emkornfield changed the title feat: Change input to write_json_file to be FilteredEngineData feat!: Change input to write_json_file to be FilteredEngineData Sep 17, 2025
@emkornfield emkornfield requested a review from nicklan September 18, 2025 00:51
Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally looks good, but one biggish question

let len = data.len();
Self {
data,
selection_vector: vec![true; len],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hrmm, so we've generally had the convention that if your selection vector is shorter than the number of rows, all remaining rows are "selected". This is how things come out of the dv roaring treemap. See docs here for example.

BUT, I see we don't specify that in FilteredEngineData at all, so we really should decide on this semantic, and the document it.

All of which is to say, I think we can just do an empty vec here, which will be way more efficient, but does require slightly more clever code in the processing steps (i.e. don't filter if it's empty, extend if it isn't)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like this convention might be error prone. I guess this is part of the longer term design but it would be nice to abstract the actual selection vector into a trait (maybe backed by roaring) so that engines don't have to think about this case. That is likely part of a broader design on the relationship between EngineData and FilteredEngineData for now I can update to follow this convention.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, once we have FilteredEngineData more pervasively, we can consider not using Vec<bool> but something more abstracted

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the contract to be similar. Also handled edge case in the opposite direction.

@emkornfield emkornfield force-pushed the stack/change_to_filtered_data branch from 960be84 to 0cc2f40 Compare September 22, 2025 22:44
@emkornfield emkornfield requested a review from nicklan September 22, 2025 22:44
Copy link
Member

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! couple questions to consider first.

also discussed some offline.

  1. this is in the right direction of ultimately making all EngineData 'filtered' - need to ensure arrow (and everyone else) can cope with this easily (given NULL != non-selected)
  2. can we make a follow-up to invest more in migrating other APIs to filtered engine data (or rather just making all of EngineData 'filtered')? (if we push this into expression evaluator we can get start to respect removals during evaluation right?)

}

impl FilteredEngineData {
pub fn try_new(data: Box<dyn EngineData>, selection_vector: Vec<bool>) -> DeltaResult<Self> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we consider making fields private to force people through the validation? we could do an into_inner or just getters?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

// Honor the new contract: if selection vector is shorter than the number of rows,
// then all rows not covered by the selection vector are assumed to be selected
let num_rows = batch.num_rows();
let mut selection_vector = filtered_data.selection_vector.clone();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do better than a full clone and then mutable selection vector here? (for example this isn't needed if SV is empty below)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, after refactoring to accessors this is not a reference or consumption

impl HasSelectionVector for FilteredEngineData {
/// Returns true if any row in the selection vector is marked as selected
fn has_selected_rows(&self) -> bool {
// Per contract if selection is not as long as then at least one row is selected.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Per contract if selection is not as long as then at least one row is selected.
// Per contract if selection is not as long as data then at least one row is selected.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

let batch = data_iter.next().unwrap()?;
assert_eq!(batch.selection_vector, [true]);
// According to the new contract, with_all_rows_selected creates an empty selection vector
assert_eq!(batch.selection_vector, vec![] as Vec<bool>);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need cast?

Comment on lines 1085 to 1087
} else if selection_vector.len() > num_rows {
// Take a sublist of the selection vector equal to the data size
selection_vector = selection_vector[..num_rows].to_vec();
Copy link
Collaborator

@OussamaSaoudi OussamaSaoudi Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's an error below, let's treat it as an error here.

Suggested change
} else if selection_vector.len() > num_rows {
// Take a sublist of the selection vector equal to the data size
selection_vector = selection_vector[..num_rows].to_vec();
} else if selection_vector.len() > num_rows {
return Err(Error::InvalidSelectionVector("Selection vectors must have fewer or equal rows to data in FilteredEngineData. Data had {num_rows} rows, while selection vector had {selection_vector.len()} rows."))

Copy link
Collaborator

@OussamaSaoudi OussamaSaoudi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just want that one extra error case :)

@emkornfield emkornfield force-pushed the stack/change_to_filtered_data branch from f280f17 to 56e20f5 Compare September 29, 2025 21:10
@github-actions github-actions bot added the breaking-change Change that require a major version bump label Sep 29, 2025
@OussamaSaoudi
Copy link
Collaborator

@emkornfield just needs CI to pass (Example: cargo clippy --benches --tests --all-features -- -D warnings) and we're good to go :)

@emkornfield
Copy link
Collaborator Author

@emkornfield just needs CI to pass (Example: cargo clippy --benches --tests --all-features -- -D warnings) and we're good to go :)

Thanks will look into this, cleaning up a few more comments. Also some tests failing mysteriously for me, I'll see if this replicates on CI.

@OussamaSaoudi OussamaSaoudi merged commit a3429b7 into delta-io:main Sep 29, 2025
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking-change Change that require a major version bump

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants