Data skipping correctly handles nested columns and column mapping #512

scovich · 2024-11-19T18:32:01Z

What changes are proposed in this pull request?

The existing implementation of data skipping has two flaws:

Only top-level columns are considered for data skipping (due to how the stats schema is derived)
Column mapping is completely ignored -- the data skipping expression with its logical column references is evaluated against physical data. At best this means data skipping is ineffective (because the column doesn't exist in the parquet). At worst, we would attempt to skip over the wrong column (e.g. if a table was upgraded to use column mapping because a column was dropped and re-added, we'd wrongly work with the old/dropped column because its logical and physical names were the same)

It turns out the two issues are intertwined, because both column mapping and nested column references need a schema traversal. So while we could solve them separately, it's actually easier to just do it all at once.

Also -- the data skipping predicate we pass around needs an associated "referenced" schema (in order to build a stats schema); if that schema is empty, it means the data skipping predicate is "static" and should be evaluated once to decide whether to even initiate a log scan. That adds some complexity to the log replay path. But it also allows a predicate like the following to be treated as static, in spite of appearing to reference table columns:

WHERE my_col < 10 AND FALSE

This PR affects the following public APIs

scan::Scan::predicate renamed as physical_predicate to eliminate ambiguity

scan::log_replay::scan_action_iter now takes fewer (and different) params.

How was this change tested?

Existing unit tests, plus new unit tests that verify the new behavior.

nicklan

generally seems like a good direction. Left some specific comments

kernel/src/expressions/mod.rs

kernel/src/scan/mod.rs

nicklan · 2024-11-19T21:10:55Z

kernel/src/schema.rs

-
-        let field = ApplyNameMapping.transform_struct_field(Cow::Borrowed(self));
-        Ok(field.unwrap().into_owned())
+        // NOTE: unwrap is safe because the transformer is incapable of returning None


This is true afaict, but it requires quite a lot of layers to never return None since many of the called functions do something like transform(..)?. There are many places we could make changes where this assumption would be invalidated.

Would it make sense to return a Result or Option here and propagate ?

I don't love making infallible operations seem fallible... it's one more thing to test that can't actually be tested (because it never happens).

The default implementation of the transform never returns None, and never will (it's a no-op). So we only have to worry about the one method we actually provide here while implementing the trait. And it doesn't directly return None. At worst they use ? on a recursive operation (that never returns None).

I don't know a way to generically express that idea in the trait, unfortunately... ideas welcome

kernel/src/table_features/column_mapping.rs

kernel/src/scan/mod.rs

codecov · 2024-11-20T20:31:19Z

Codecov Report

Attention: Patch coverage is 93.76947% with 20 lines in your changes missing coverage. Please review.

Project coverage is 83.21%. Comparing base (c1b202a) to head (c49540d).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
kernel/src/scan/mod.rs	96.72%	9 Missing ⚠️
kernel/src/table_changes/scan.rs	61.11%	6 Missing and 1 partial ⚠️
kernel/src/scan/data_skipping.rs	57.14%	0 Missing and 3 partials ⚠️
kernel/src/table_changes/log_replay/tests.rs	91.66%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #512      +/-   ##
==========================================
+ Coverage   82.80%   83.21%   +0.41%     
==========================================
  Files          74       74              
  Lines       16536    16775     +239     
  Branches    16536    16775     +239     
==========================================
+ Hits        13692    13960     +268     
+ Misses       2195     2166      -29     
  Partials      649      649

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

scovich · 2024-11-21T23:41:10Z

kernel/src/scan/mod.rs

                    ));
                };
-                let name = field.physical_name(global_state.column_mapping_mode)?;
+                let name = field.physical_name();


@nicklan -- any idea why we didn't just apply column mapping to the partition columns along with all the others? Why wait until now?

OussamaSaoudi-db · 2024-11-22T03:37:57Z

kernel/src/scan/mod.rs

+    //
+    // NOTE: It is possible the predicate resolves to FALSE even ignoring column references,
+    // e.g. `col > 10 AND FALSE`. Such predicates can statically skip the whole query.
+    fn build_physical_predicate(


since this is a conversion, how's something like logical_predicate_to_physical?

Also this is more a curiosity question: When do you make a static method a member of the struct vs a function defined outside the struct?

IMO associated functions ("static methods") are useful as a namespace grouping technique for benefit of downstream consumers, and occasionally because a trait wants it. Since this was a private function anyway, it didn't seem super useful to make it associated.

No strong reasons either way tho!

Re naming: It's not just a conversion because it takes multiple inputs and produces multiple outputs (physical expression and physical referenced schema).

No strong feelings either way tho (esp. for a private helper method with one call site)

Namespace specifically for consumers is a good way to put it. I used to think of it as just a logical grouping.

build makes sense then 👍

kernel/src/scan/mod.rs

OussamaSaoudi-db · 2024-11-22T03:45:16Z

kernel/src/scan/mod.rs

+    //
+    // NOTE: It is possible the predicate resolves to FALSE even ignoring column references,
+    // e.g. `col > 10 AND FALSE`. Such predicates can statically skip the whole query.
+    fn build_physical_predicate(


Also this is more a curiosity question: When do you make a static method a member of the struct vs a function defined outside the struct?

kernel/src/scan/mod.rs

kernel/src/schema.rs

nicklan

lgtm! Thanks!

zachschuermann

looks great, no notes. really awesome how useful the transform frameworks have been!

OussamaSaoudi-db · 2024-12-10T17:37:17Z

kernel/src/scan/mod.rs

+            // clause has invalid column references. Data skipping is best-effort and the predicate
+            // anyway needs to be evaluated against every row of data -- which is impossible if the
+            // columns are missing/invalid. Just blow up instead of trying to handle it gracefully.
+            return Err(Error::missing_column(format!(


One thing we'll want to watch out for in the future is a predicate like where id = 2 and _change_type = "insert".

The _change_type part of the predicate can't be applied to the physical data, so it should be dropped.

Same thing applies to generated columns and partition columns I think?

We're checking against the logical schema, which includes partition columns and also (hopefully) includes the _change_type and other generated columns?

Meanwhile, we have disabled the optimization in the parquet skipping code, that would treat missing physical columns as all-NULL. If we ever tried to re-enable that optimization, we would need to be very sure we've classified the columns correctly first.

we have disabled the optimization in the parquet skipping code, that would treat missing physical columns as all-NULL

Aha I'd expected such an optimization to exist here in the physical predicate. Since we handle that elsewhere, then we're good.

we would need to be very sure we've classified the columns correctly first

If we had access to Vec<ColumnType>, we'd know how every column is classified.

Yes, and eventually we'll need that to support data skipping over partition columns, since we do have the metadata for those but it's in a different column and requires slightly different logic.

[WIP] Data skipping correctly handles nested columns and column mapping

5afd8e7

scovich requested review from OussamaSaoudi-db, nicklan and zachschuermann November 19, 2024 18:32

github-actions bot assigned scovich Nov 19, 2024

github-actions bot added the breaking-change Change that require a major version bump label Nov 19, 2024

nicklan reviewed Nov 19, 2024

View reviewed changes

scovich added 4 commits November 19, 2024 15:29

cleanups

56186bb

address reviewer comments

6e4d2a3

test+doctest fixes

755fce4

fmt

bd907b2

scovich requested a review from nicklan November 20, 2024 20:28

scovich commented Nov 21, 2024

View reviewed changes

scovich mentioned this pull request Nov 21, 2024

Implement Builder for Scans on TableChanges #521

Merged

OussamaSaoudi-db reviewed Nov 22, 2024

View reviewed changes

scovich added 3 commits November 22, 2024 08:28

Merge remote-tracking branch 'oss/main' into nested-data-skipping

948ba4c

review comments

9eab3f3

Merge remote-tracking branch 'oss/main' into nested-data-skipping

4e5c6b4

scovich removed the breaking-change Change that require a major version bump label Dec 3, 2024

Merge remote-tracking branch 'oss/main' into nested-data-skipping

fa57dd3

scovich changed the title ~~[WIP] Data skipping correctly handles nested columns and column mapping~~ Data skipping correctly handles nested columns and column mapping Dec 3, 2024

scovich requested a review from OussamaSaoudi-db December 3, 2024 01:07

github-actions bot added the breaking-change Change that require a major version bump label Dec 3, 2024

nicklan approved these changes Dec 4, 2024

View reviewed changes

scovich mentioned this pull request Dec 5, 2024

Implement Log Replay for Change Data Feed #540

Merged

zachschuermann approved these changes Dec 5, 2024

View reviewed changes

scovich added 2 commits December 9, 2024 18:08

add unit tests

3f4949e

Merge remote-tracking branch 'oss/main' into nested-data-skipping

92eecdc

scovich added 2 commits December 9, 2024 21:54

Merge remote-tracking branch 'oss/main' into nested-data-skipping

7732867

CDF uses new physical predicate

47ffab7

OussamaSaoudi-db reviewed Dec 10, 2024

View reviewed changes

OussamaSaoudi-db approved these changes Dec 10, 2024

View reviewed changes

Merge remote-tracking branch 'oss/main' into nested-data-skipping

c49540d

scovich merged commit af075a8 into delta-io:main Dec 10, 2024
20 checks passed

scovich mentioned this pull request Dec 11, 2024

Add TableChangesScan::schema method to get logical schema #589

Merged

Uh oh!

Data skipping correctly handles nested columns and column mapping #512

Data skipping correctly handles nested columns and column mapping #512

Uh oh!

Conversation

scovich commented Nov 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes are proposed in this pull request?

This PR affects the following public APIs

How was this change tested?

Uh oh!

nicklan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Nov 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nicklan left a comment

Choose a reason for hiding this comment

Uh oh!

zachschuermann left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

scovich commented Nov 19, 2024 •

edited

Loading

codecov bot commented Nov 20, 2024 •

edited

Loading