Simplify column mapping mode handling #543

scovich · 2024-11-26T21:45:53Z

What changes are proposed in this pull request?

Today's code scatters column mapping validity checks around the various use sites which makes them complex and requires propagating the column mapping mode through various function calls.

We can simplify by validating the schema once, up front during snapshot load. In particular:

Ensure that (correct) column mapping annotations are present in the schema exactly and only when column mapping is actually enabled. This simplifies logical -> physical name translation because we should use the annotation if present and use the field's logical name otherwise.
Reject column mapping ID mode early. Most code only needs to care whether column mapping is enabled at all.

As a result, most column mapping operations become infallible and with simpler code.

This PR affects the following public APIs

StructField::physical_name no longer takes a ColumnMapping argument, because it is no longer needed for validation.

How was this change tested?

Existing unit tests.

scovich · 2024-11-26T21:56:00Z

Hmm, this test failure doesn't look good:

---- reader_test::iceberg_compat_v1/test_case_info.json ----
test panicked: called `Result::unwrap()` on an `Err` value: KernelError(InvalidColumnMappingMode("Column mapping is not enabled but field 'letter' is annotated with delta.columnMapping.id"))

According to the Delta spec, column mapping mode should be enabled on a table with iceberg compat v1 enabled.

codecov · 2024-11-26T22:46:33Z

Codecov Report

Attention: Patch coverage is 97.73585% with 6 lines in your changes missing coverage. Please review.

Project coverage is 81.07%. Comparing base (817ba17) to head (7f3727e).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
kernel/src/table_features/column_mapping.rs	98.36%	3 Missing and 1 partial ⚠️
kernel/src/snapshot.rs	0.00%	0 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #543      +/-   ##
==========================================
+ Coverage   80.71%   81.07%   +0.35%     
==========================================
  Files          67       67              
  Lines       14278    14496     +218     
  Branches    14278    14496     +218     
==========================================
+ Hits        11524    11752     +228     
+ Misses       2179     2172       -7     
+ Partials      575      572       -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

scovich · 2024-11-26T22:49:22Z

Hmm, this test failure doesn't look good:
---- reader_test::iceberg_compat_v1/test_case_info.json ----
test panicked: called `Result::unwrap()` on an `Err` value: KernelError(InvalidColumnMappingMode("Column mapping is not enabled but field 'letter' is annotated with delta.columnMapping.id"))
According to the Delta spec, column mapping mode should be enabled on a table with iceberg compat v1 enabled.

Known issue upstream: delta-incubator/dat#52

Meanwhile, the code that attempted to disable the known-broken test was only partially effective; I fixed it so the test is fully skipped now.

zachschuermann

looks great just one main question on whether or not to attempt encoding the validation in the type system

zachschuermann · 2024-11-27T16:42:28Z

kernel/src/schema.rs

+    /// NOTE: Caller affirms that the schema was alreasdy validated by
+    /// [`crate::table_features::validate_schema_column_mapping`], to ensure that
+    /// annotations are always and only present when column mapping mode is enabled.
+    pub fn make_physical(&self) -> Self {


(tell me if this is overkill)

In an ideal world I think we would encode this in the type system so that it's prohibited to ever have a make_physical() call without calling validate_schema_column_mapping. I was trying to think of relatively easy ways to encode this and came up with two ideas (one easy, one harder)

(easy) rename Schema before it is validated to struct UnvalidatedSchema(Schema) and require a validate method to consume the unvalidated and produce a regular Schema. Considering the approach of this PR is to do the validation up-front this seems to be the easiest?

(harder) We could leverage a couple zero-sized types to tag a Schema as either validated/unvalidated. Something like:

// markers pub struct Validated; pub struct Unvalidated; pub struct Schema<ValidationState = Unvalidated> { // ... _validation_state: PhantomData<ValidationState>, } // then a function on `Schema<Unvalidated>` to create a `Schema<Validated>` impl Schema<Unvalidated> { pub fn validate(self, mode: ColumnMappingMode) -> DeltaResult<Schema<Validated>> { // do column mapping validation and other validations in the future } }

Very interesting question. I actually have started wondering the same thing about physical vs. logical schema -- it's too easy to mix them up, as evidenced by the data skipping code that was trying to apply an expression full of logical column names to a physical parquet schema. And IIRC, delta-spark has hit more than one bug where somebody forgot which schema they were working with (and there are probably places where they double-apply the logical->physical mapping as well, tho that should be harmless). Probably a pair of newtypes would be the best way to make it explicit: struct PhysicalSchema(StructType) and struct LogicalSchema(StructType). TBD whether they should impl Deref or AsRef -- for convenience they probably should, and anyway it would be a red flag for some class or method that cares about the difference to take a bare StructType.

For this specific case -- the unvalidated schema should be very short-lived. If we changed Metadata::schema (**) to return a newtype UnvalidatedSchema with the validate method you suggest, then nobody else would have to know or care about it -- all bare schemas are known to be validated. We'd have to be careful on the write path any time we modify a schema in a way that could have implications for column mapping validation (create table, add column, enable column mapping, etc). But those operations would just need to produce an UnvalidatedSchema as their output?

(**) BTW, we should rename that method as Metadata::parse_schema to make clear that it's an expensive function to call -- not a getter.

Update: After trying out the UnvalidatedSchema concept locally, I learned that the validation of protocol, schema, and table properties are all intertwined, so it's not enough to merely validate the schema. Most likely, we'll need to introduce a new TableConfiguration that encapsulates all three (and eventually system domain metadata as well), which cross-checks everything and with helpful utility methods to expose e.g. column mapping mode.

IMO that's a bigger effort for a different PR.

yep sounds good! I liked the TableConfiguration idea!!

zachschuermann · 2024-11-27T16:43:15Z

kernel/src/table_features/column_mapping.rs

+    }
+}
+
+impl<'a> SchemaTransform<'a> for ValidateColumnMappings<'a> {


new schema transform at work! nice!!

zachschuermann

LGTM

zachschuermann · 2024-11-27T20:29:11Z

kernel/src/table_features/column_mapping.rs

        let empty_features = Some::<[String; 0]>([]);
        let protocol =
            Protocol::try_new(3, 7, empty_features.clone(), empty_features.clone()).unwrap();
        assert_eq!(
-            column_mapping_mode(&protocol, &table_properties),
+            column_mapping_mode(&protocol, &table_properties).unwrap(),
            ColumnMappingMode::None


if there are no features with 3/7 protocol that implies no column mapping - but I think your is exposed as an error? (noticed failing test)

scovich · 2024-11-27T20:40:42Z

Uh-oh...

failures:

---- table::tests::test_table stdout ----
thread 'table::tests::test_table' panicked at kernel/src/table.rs:157:54:
called `Result::unwrap()` on an `Err` value: InvalidColumnMappingMode("Table does not support column mapping mode, but the table property is set")
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- snapshot::tests::test_new_snapshot stdout ----
thread 'snapshot::tests::test_new_snapshot' panicked at kernel/src/snapshot.rs:247:62:
called `Result::unwrap()` on an `Err` value: InvalidColumnMappingMode("Table does not support column mapping mode, but the table property is set")

---- snapshot::tests::test_snapshot_read_metadata stdout ----
thread 'snapshot::tests::test_snapshot_read_metadata' panicked at kernel/src/snapshot.rs:229:65:
called `Result::unwrap()` on an `Err` value: InvalidColumnMappingMode("Table does not support column mapping mode, but the table property is set")

---- table_features::column_mapping::tests::test_column_mapping_mode stdout ----
thread 'table_features::column_mapping::tests::test_column_mapping_mode' panicked at kernel/src/table_features/column_mapping.rs:185:63:
called `Result::unwrap()` on an `Err` value: InvalidColumnMappingMode("Table does not support column mapping mode, but the table property is set")

---- dv_table stdout ----
Error: InvalidColumnMappingMode("Table does not support column mapping mode, but the table property is set")

---- with_predicate_and_removes stdout ----
Error: InvalidColumnMappingMode("Table does not support column mapping mode, but the table property is set")

Almost all of the failures are caused by reading the DAT table table-with-dv-small. This might be another DAT bug?

zachschuermann · 2024-11-27T20:46:06Z

sigh... yep here's the metaData and protocol actions of that table-with-dv-small.

the table property is set but column mapping isn't in the readerFeatures/writerFeatures

{
    "protocol": {
        "minReaderVersion": 3,
        "minWriterVersion": 7,
        "readerFeatures": [
            "deletionVectors"
        ],
        "writerFeatures": [
            "deletionVectors"
        ]
    }
}
{
    "metaData": {
        "id": "testId",
        "format": {
            "provider": "parquet",
            "options": {}
        },
        "schemaString": "{\"type\":\"struct\",\"fields\":[{\"name\":\"value\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}}]}",
        "partitionColumns": [],
        "configuration": {
            "delta.enableDeletionVectors": "true",
            "delta.columnMapping.mode": "none"
        },
        "createdTime": 1677811175819
    }
}

scovich · 2024-11-27T21:06:11Z

sigh... yep here's the metaData and protocol actions of that table-with-dv-small.

the table property is set but column mapping isn't in the readerFeatures/writerFeatures

Actually, the Delta spec says this is legal:

The column mapping is governed by the table property delta.columnMapping.mode being one of none, id, and name. The table property should only be honored if the table's protocol has reader and writer versions and/or table features that support the columnMapping table feature.

Reverting the over-zealous check.

zachschuermann · 2024-11-27T21:08:04Z

SGTM! thanks!

OussamaSaoudi-db · 2024-11-28T21:12:54Z

kernel/src/table_features/column_mapping.rs

+        path: vec![],
+        err: None,
+    };
+    let _ = validator.transform_struct(schema);


It feels weird to have to manually do error handling here and in all the transform_* methods. Why not make transform struct return a Result<Option<...>>?

Tho ig this would require that we make every schema transform fallible 🤔

I'd love a way to make this configurable, but I don't see any obvious way. Note that Option already poses similar problems, and it's quite common for transforms to either always return None (if they're just visiting the schema rather than changing it), or to always return Some (if they're modifying the schema but not filtering it in any way). In fact, there are even some that always return Some(Cow::Owned) because they unconditionally transform the schema.

If we wanted to capture this generically, we have three problems to solve:

What do the provided "leaf" methods return as their default"identify" transform? Today, it's Some(Cow::Borrowed(val)); Cow::Borrowed(val).into() would cover that and also cover Cow<'a, T> return type... but it does not cover DeltaResult<Cow<'a, T>> (which has no blanket impl From).

How should the provided "internal" methods generically handle values that could be None or Err? None means "filter it out and keep going" while Err would mean "abort the traversal immediately"

How to handle the ? operator, if the return type might be just Cow<'a, T>? Or DeltaResult<Option<Cow<'a, T>>?

To make matters worse, I can think of (at least) twelve return types the visitor might reasonably want to use:

() (visitor )

T (unconditional transform)

Cow<'a, T> (conditional transform)

DeltaResult<()> (checker)

DeltaResult<T> (fallible unconditional transform)

DeltaResult<Cow<'a, T>> (fallible conditional transform)

Option<&T> (filter)

Option<T> (filtering unconditional transform)

Option<Cow<'a, T>> (filtering conditional transform)

DeltaResult<Option<&T>> (fallible filter)

DeltaResult<Option<T>> (fallible filtering unconditional transform)

DeltaResult<Option<Cow<'a, T>>> (fallible filtering conditional transform)

And that's ignoring the "aggregation" case where the visitor returns some completely other value instead of a schema (both fallible and infallible varieties).

Note: If we did decide to make the transforms fallible, the trait would need to know what error type to use. We could perhaps hard-wire it as kernel::Error but then infallible operations are out of luck. We can't default it because associated type defaults for traits are unstable rust. So we'd have to force every trait impl to define the type -- which is maybe not the end of the world -- and let infallible transforms specify std::convert::Infallible instead? But that has its own headaches when it's time to unpack the value (the ? is still "conditional" and requires the calling function to return Result, and unwrap et al are still technically a panic risk. But at least the compiler recognizes that let Ok(val) = returns_infallible_result() is irrefutable in spite of not having an else clause.

Woah this is a big design space.
There seems to be two extremes going on here:

Have a single design that we adapt to our usecases (current solution). More semantics is baked into the code.

Have a type/trait that encompasses exactly the behaviour we want (all 12+ cases). In this case, the semantics are baked into the type system.

I wonder if we can strike a balance by having three variants:

Cow<'a, T> to cover 2 and 3

Option<Cow<'a, T>> to cover 1, 7, 8, 9

DeltaResult<Option<Cow<'a, T>> to cover cases 10, 11, 12.
But then there will be semantics that aren't communicated by the type system. Maybe that's okay? For example:

2 will always return a Cow::Owned

7 will only return None or a Some(Cow::Borrow)

8 will only return None or Some(Cow::Owned)

The idea of casting a wide net by introducing a generic error seems promising.
Regarding the generic defaulting: I could see a SchemaTransform<E: Error> with type InfallibleSchemaTransform = SchemaTransform<Infallible> and type FallibleSchemaTransform = SchemaTransform<kernel::Error>

This is definitely out of scope for this PR, but I think it's worth considering before implementing all the variety of transforms.

Problem is, multiple slightly different types of transforms brings back the exact kind of code duplication the transforms were intended to eliminate...

kernel/src/table_features/column_mapping.rs

OussamaSaoudi-db · 2024-11-28T21:33:31Z

kernel/src/table_features/column_mapping.rs

+}
+
 #[cfg(test)]
 mod tests {


I would like to see some of the validator error cases exercised. Either in this PR or a followup one :)

kernel/src/schema.rs

Simplify column mapping mode handling

f938c7a

scovich requested review from OussamaSaoudi-db and zachschuermann November 26, 2024 21:45

github-actions bot assigned scovich Nov 26, 2024

github-actions bot added the breaking-change Change that require a major version bump label Nov 26, 2024

actually skip the broken iceberg_compat_v1 test

2e24f13

scovich requested a review from nicklan November 27, 2024 15:11

zachschuermann reviewed Nov 27, 2024

View reviewed changes

add missing validation

f234bce

scovich requested a review from zachschuermann November 27, 2024 20:24

zachschuermann approved these changes Nov 27, 2024

View reviewed changes

zachschuermann reviewed Nov 27, 2024

View reviewed changes

scovich added 2 commits November 27, 2024 12:51

fix incomplete and broken unit test

65770ad

revert incorrectly strict check

d81ef9e

clarify comment

6250a9d

OussamaSaoudi-db reviewed Nov 28, 2024

View reviewed changes

OussamaSaoudi-db approved these changes Nov 29, 2024

View reviewed changes

kernel/src/schema.rs Outdated Show resolved Hide resolved

scovich mentioned this pull request Dec 2, 2024

feat: Add check for schema read compatibility #554

Merged

scovich and others added 4 commits December 2, 2024 15:42

Merge branch 'main' into prevalidate-column-mapping

7fcfb17

add tests

bdab288

Merge remote-tracking branch 'oss/main' into prevalidate-column-mapping

76bb186

fmt

7f3727e

scovich merged commit 391d10c into delta-io:main Dec 3, 2024
20 checks passed

scovich mentioned this pull request Dec 3, 2024

Data skipping correctly handles nested columns and column mapping #512

Merged

Uh oh!

Simplify column mapping mode handling #543

Simplify column mapping mode handling #543

Uh oh!

Conversation

scovich commented Nov 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes are proposed in this pull request?

This PR affects the following public APIs

How was this change tested?

Uh oh!

scovich commented Nov 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Nov 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

scovich commented Nov 26, 2024

Uh oh!

zachschuermann left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zachschuermann left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scovich commented Nov 27, 2024

Uh oh!

zachschuermann commented Nov 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scovich commented Nov 27, 2024

Uh oh!

zachschuermann commented Nov 27, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

scovich commented Nov 26, 2024 •

edited

Loading

scovich commented Nov 26, 2024 •

edited

Loading

codecov bot commented Nov 26, 2024 •

edited

Loading

zachschuermann commented Nov 27, 2024 •

edited

Loading