Skip to content

Conversation

@scovich
Copy link
Collaborator

@scovich scovich commented Nov 19, 2024

What changes are proposed in this pull request?

The existing implementation of data skipping has two flaws:

  1. Only top-level columns are considered for data skipping (due to how the stats schema is derived)
  2. Column mapping is completely ignored -- the data skipping expression with its logical column references is evaluated against physical data. At best this means data skipping is ineffective (because the column doesn't exist in the parquet). At worst, we would attempt to skip over the wrong column (e.g. if a table was upgraded to use column mapping because a column was dropped and re-added, we'd wrongly work with the old/dropped column because its logical and physical names were the same)

It turns out the two issues are intertwined, because both column mapping and nested column references need a schema traversal. So while we could solve them separately, it's actually easier to just do it all at once.

Also -- the data skipping predicate we pass around needs an associated "referenced" schema (in order to build a stats schema); if that schema is empty, it means the data skipping predicate is "static" and should be evaluated once to decide whether to even initiate a log scan. That adds some complexity to the log replay path. But it also allows a predicate like the following to be treated as static, in spite of appearing to reference table columns:

WHERE my_col < 10 AND FALSE

This PR affects the following public APIs

scan::Scan::predicate renamed as physical_predicate to eliminate ambiguity

scan::log_replay::scan_action_iter now takes fewer (and different) params.

How was this change tested?

Existing unit tests, plus new unit tests that verify the new behavior.

@github-actions github-actions bot added the breaking-change Change that require a major version bump label Nov 19, 2024
Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally seems like a good direction. Left some specific comments


let field = ApplyNameMapping.transform_struct_field(Cow::Borrowed(self));
Ok(field.unwrap().into_owned())
// NOTE: unwrap is safe because the transformer is incapable of returning None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true afaict, but it requires quite a lot of layers to never return None since many of the called functions do something like transform(..)?. There are many places we could make changes where this assumption would be invalidated.

Would it make sense to return a Result or Option here and propagate ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love making infallible operations seem fallible... it's one more thing to test that can't actually be tested (because it never happens).

The default implementation of the transform never returns None, and never will (it's a no-op). So we only have to worry about the one method we actually provide here while implementing the trait. And it doesn't directly return None. At worst they use ? on a recursive operation (that never returns None).

I don't know a way to generically express that idea in the trait, unfortunately... ideas welcome

@scovich scovich requested a review from nicklan November 20, 2024 20:28
@codecov
Copy link

codecov bot commented Nov 20, 2024

Codecov Report

Attention: Patch coverage is 93.76947% with 20 lines in your changes missing coverage. Please review.

Project coverage is 83.21%. Comparing base (c1b202a) to head (c49540d).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/scan/mod.rs 96.72% 9 Missing ⚠️
kernel/src/table_changes/scan.rs 61.11% 6 Missing and 1 partial ⚠️
kernel/src/scan/data_skipping.rs 57.14% 0 Missing and 3 partials ⚠️
kernel/src/table_changes/log_replay/tests.rs 91.66% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #512      +/-   ##
==========================================
+ Coverage   82.80%   83.21%   +0.41%     
==========================================
  Files          74       74              
  Lines       16536    16775     +239     
  Branches    16536    16775     +239     
==========================================
+ Hits        13692    13960     +268     
+ Misses       2195     2166      -29     
  Partials      649      649              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

));
};
let name = field.physical_name(global_state.column_mapping_mode)?;
let name = field.physical_name();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nicklan -- any idea why we didn't just apply column mapping to the partition columns along with all the others? Why wait until now?

//
// NOTE: It is possible the predicate resolves to FALSE even ignoring column references,
// e.g. `col > 10 AND FALSE`. Such predicates can statically skip the whole query.
fn build_physical_predicate(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is a conversion, how's something like logical_predicate_to_physical?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this is more a curiosity question: When do you make a static method a member of the struct vs a function defined outside the struct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO associated functions ("static methods") are useful as a namespace grouping technique for benefit of downstream consumers, and occasionally because a trait wants it. Since this was a private function anyway, it didn't seem super useful to make it associated.

No strong reasons either way tho!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re naming: It's not just a conversion because it takes multiple inputs and produces multiple outputs (physical expression and physical referenced schema).

No strong feelings either way tho (esp. for a private helper method with one call site)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Namespace specifically for consumers is a good way to put it. I used to think of it as just a logical grouping.

build makes sense then 👍

//
// NOTE: It is possible the predicate resolves to FALSE even ignoring column references,
// e.g. `col > 10 AND FALSE`. Such predicates can statically skip the whole query.
fn build_physical_predicate(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this is more a curiosity question: When do you make a static method a member of the struct vs a function defined outside the struct?

@scovich scovich removed the breaking-change Change that require a major version bump label Dec 3, 2024
@scovich scovich changed the title [WIP] Data skipping correctly handles nested columns and column mapping Data skipping correctly handles nested columns and column mapping Dec 3, 2024
@github-actions github-actions bot added the breaking-change Change that require a major version bump label Dec 3, 2024
Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! Thanks!

Copy link
Member

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great, no notes. really awesome how useful the transform frameworks have been!

// clause has invalid column references. Data skipping is best-effort and the predicate
// anyway needs to be evaluated against every row of data -- which is impossible if the
// columns are missing/invalid. Just blow up instead of trying to handle it gracefully.
return Err(Error::missing_column(format!(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing we'll want to watch out for in the future is a predicate like where id = 2 and _change_type = "insert".

The _change_type part of the predicate can't be applied to the physical data, so it should be dropped.

Same thing applies to generated columns and partition columns I think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're checking against the logical schema, which includes partition columns and also (hopefully) includes the _change_type and other generated columns?

Meanwhile, we have disabled the optimization in the parquet skipping code, that would treat missing physical columns as all-NULL. If we ever tried to re-enable that optimization, we would need to be very sure we've classified the columns correctly first.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have disabled the optimization in the parquet skipping code, that would treat missing physical columns as all-NULL

Aha I'd expected such an optimization to exist here in the physical predicate. Since we handle that elsewhere, then we're good.

we would need to be very sure we've classified the columns correctly first

If we had access to Vec<ColumnType>, we'd know how every column is classified.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and eventually we'll need that to support data skipping over partition columns, since we do have the metadata for those but it's in a different column and requires slightly different logic.

@scovich scovich merged commit af075a8 into delta-io:main Dec 10, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking-change Change that require a major version bump

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants