Skip to content

Conversation

@scovich
Copy link
Collaborator

@scovich scovich commented Nov 26, 2024

What changes are proposed in this pull request?

Today's code scatters column mapping validity checks around the various use sites which makes them complex and requires propagating the column mapping mode through various function calls.

We can simplify by validating the schema once, up front during snapshot load. In particular:

  1. Ensure that (correct) column mapping annotations are present in the schema exactly and only when column mapping is actually enabled. This simplifies logical -> physical name translation because we should use the annotation if present and use the field's logical name otherwise.
  2. Reject column mapping ID mode early. Most code only needs to care whether column mapping is enabled at all.

As a result, most column mapping operations become infallible and with simpler code.

This PR affects the following public APIs

StructField::physical_name no longer takes a ColumnMapping argument, because it is no longer needed for validation.

How was this change tested?

Existing unit tests.

@github-actions github-actions bot added the breaking-change Change that require a major version bump label Nov 26, 2024
@scovich
Copy link
Collaborator Author

scovich commented Nov 26, 2024

Hmm, this test failure doesn't look good:

---- reader_test::iceberg_compat_v1/test_case_info.json ----
test panicked: called `Result::unwrap()` on an `Err` value: KernelError(InvalidColumnMappingMode("Column mapping is not enabled but field 'letter' is annotated with delta.columnMapping.id"))

According to the Delta spec, column mapping mode should be enabled on a table with iceberg compat v1 enabled.

@codecov
Copy link

codecov bot commented Nov 26, 2024

Codecov Report

Attention: Patch coverage is 97.73585% with 6 lines in your changes missing coverage. Please review.

Project coverage is 81.07%. Comparing base (817ba17) to head (7f3727e).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/table_features/column_mapping.rs 98.36% 3 Missing and 1 partial ⚠️
kernel/src/snapshot.rs 0.00% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #543      +/-   ##
==========================================
+ Coverage   80.71%   81.07%   +0.35%     
==========================================
  Files          67       67              
  Lines       14278    14496     +218     
  Branches    14278    14496     +218     
==========================================
+ Hits        11524    11752     +228     
+ Misses       2179     2172       -7     
+ Partials      575      572       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@scovich
Copy link
Collaborator Author

scovich commented Nov 26, 2024

Hmm, this test failure doesn't look good:

---- reader_test::iceberg_compat_v1/test_case_info.json ----
test panicked: called `Result::unwrap()` on an `Err` value: KernelError(InvalidColumnMappingMode("Column mapping is not enabled but field 'letter' is annotated with delta.columnMapping.id"))

According to the Delta spec, column mapping mode should be enabled on a table with iceberg compat v1 enabled.

Known issue upstream: delta-incubator/dat#52

Meanwhile, the code that attempted to disable the known-broken test was only partially effective; I fixed it so the test is fully skipped now.

@scovich scovich requested a review from nicklan November 27, 2024 15:11
Copy link
Member

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great just one main question on whether or not to attempt encoding the validation in the type system

Comment on lines 191 to 194
/// NOTE: Caller affirms that the schema was alreasdy validated by
/// [`crate::table_features::validate_schema_column_mapping`], to ensure that
/// annotations are always and only present when column mapping mode is enabled.
pub fn make_physical(&self) -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(tell me if this is overkill)

In an ideal world I think we would encode this in the type system so that it's prohibited to ever have a make_physical() call without calling validate_schema_column_mapping. I was trying to think of relatively easy ways to encode this and came up with two ideas (one easy, one harder)

  1. (easy) rename Schema before it is validated to struct UnvalidatedSchema(Schema) and require a validate method to consume the unvalidated and produce a regular Schema. Considering the approach of this PR is to do the validation up-front this seems to be the easiest?
  2. (harder) We could leverage a couple zero-sized types to tag a Schema as either validated/unvalidated. Something like:
// markers
pub struct Validated;
pub struct Unvalidated;

pub struct Schema<ValidationState = Unvalidated> {
    // ...
    _validation_state: PhantomData<ValidationState>,
}

// then a function on `Schema<Unvalidated>` to create a `Schema<Validated>`
impl Schema<Unvalidated> {
    pub fn validate(self, mode: ColumnMappingMode) -> DeltaResult<Schema<Validated>> {
        // do column mapping validation and other validations in the future
    }
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very interesting question. I actually have started wondering the same thing about physical vs. logical schema -- it's too easy to mix them up, as evidenced by the data skipping code that was trying to apply an expression full of logical column names to a physical parquet schema. And IIRC, delta-spark has hit more than one bug where somebody forgot which schema they were working with (and there are probably places where they double-apply the logical->physical mapping as well, tho that should be harmless). Probably a pair of newtypes would be the best way to make it explicit: struct PhysicalSchema(StructType) and struct LogicalSchema(StructType). TBD whether they should impl Deref or AsRef -- for convenience they probably should, and anyway it would be a red flag for some class or method that cares about the difference to take a bare StructType.

For this specific case -- the unvalidated schema should be very short-lived. If we changed Metadata::schema (**) to return a newtype UnvalidatedSchema with the validate method you suggest, then nobody else would have to know or care about it -- all bare schemas are known to be validated. We'd have to be careful on the write path any time we modify a schema in a way that could have implications for column mapping validation (create table, add column, enable column mapping, etc). But those operations would just need to produce an UnvalidatedSchema as their output?

(**) BTW, we should rename that method as Metadata::parse_schema to make clear that it's an expensive function to call -- not a getter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: After trying out the UnvalidatedSchema concept locally, I learned that the validation of protocol, schema, and table properties are all intertwined, so it's not enough to merely validate the schema. Most likely, we'll need to introduce a new TableConfiguration that encapsulates all three (and eventually system domain metadata as well), which cross-checks everything and with helpful utility methods to expose e.g. column mapping mode.

IMO that's a bigger effort for a different PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep sounds good! I liked the TableConfiguration idea!!

}
}

impl<'a> SchemaTransform<'a> for ValidateColumnMappings<'a> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new schema transform at work! nice!!

Copy link
Member

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines 181 to 186
let empty_features = Some::<[String; 0]>([]);
let protocol =
Protocol::try_new(3, 7, empty_features.clone(), empty_features.clone()).unwrap();
assert_eq!(
column_mapping_mode(&protocol, &table_properties),
column_mapping_mode(&protocol, &table_properties).unwrap(),
ColumnMappingMode::None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there are no features with 3/7 protocol that implies no column mapping - but I think your is exposed as an error? (noticed failing test)

@scovich
Copy link
Collaborator Author

scovich commented Nov 27, 2024

Uh-oh...

failures:

---- table::tests::test_table stdout ----
thread 'table::tests::test_table' panicked at kernel/src/table.rs:157:54:
called `Result::unwrap()` on an `Err` value: InvalidColumnMappingMode("Table does not support column mapping mode, but the table property is set")
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- snapshot::tests::test_new_snapshot stdout ----
thread 'snapshot::tests::test_new_snapshot' panicked at kernel/src/snapshot.rs:247:62:
called `Result::unwrap()` on an `Err` value: InvalidColumnMappingMode("Table does not support column mapping mode, but the table property is set")

---- snapshot::tests::test_snapshot_read_metadata stdout ----
thread 'snapshot::tests::test_snapshot_read_metadata' panicked at kernel/src/snapshot.rs:229:65:
called `Result::unwrap()` on an `Err` value: InvalidColumnMappingMode("Table does not support column mapping mode, but the table property is set")

---- table_features::column_mapping::tests::test_column_mapping_mode stdout ----
thread 'table_features::column_mapping::tests::test_column_mapping_mode' panicked at kernel/src/table_features/column_mapping.rs:185:63:
called `Result::unwrap()` on an `Err` value: InvalidColumnMappingMode("Table does not support column mapping mode, but the table property is set")

---- dv_table stdout ----
Error: InvalidColumnMappingMode("Table does not support column mapping mode, but the table property is set")

---- with_predicate_and_removes stdout ----
Error: InvalidColumnMappingMode("Table does not support column mapping mode, but the table property is set")

Almost all of the failures are caused by reading the DAT table table-with-dv-small. This might be another DAT bug?

@zachschuermann
Copy link
Member

zachschuermann commented Nov 27, 2024

sigh... yep here's the metaData and protocol actions of that table-with-dv-small.

the table property is set but column mapping isn't in the readerFeatures/writerFeatures

{
    "protocol": {
        "minReaderVersion": 3,
        "minWriterVersion": 7,
        "readerFeatures": [
            "deletionVectors"
        ],
        "writerFeatures": [
            "deletionVectors"
        ]
    }
}
{
    "metaData": {
        "id": "testId",
        "format": {
            "provider": "parquet",
            "options": {}
        },
        "schemaString": "{\"type\":\"struct\",\"fields\":[{\"name\":\"value\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}}]}",
        "partitionColumns": [],
        "configuration": {
            "delta.enableDeletionVectors": "true",
            "delta.columnMapping.mode": "none"
        },
        "createdTime": 1677811175819
    }
}

@scovich
Copy link
Collaborator Author

scovich commented Nov 27, 2024

sigh... yep here's the metaData and protocol actions of that table-with-dv-small.

the table property is set but column mapping isn't in the readerFeatures/writerFeatures

Actually, the Delta spec says this is legal:

The column mapping is governed by the table property delta.columnMapping.mode being one of none, id, and name. The table property should only be honored if the table's protocol has reader and writer versions and/or table features that support the columnMapping table feature.

Reverting the over-zealous check.

@zachschuermann
Copy link
Member

SGTM! thanks!

path: vec![],
err: None,
};
let _ = validator.transform_struct(schema);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels weird to have to manually do error handling here and in all the transform_* methods. Why not make transform struct return a Result<Option<...>>?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tho ig this would require that we make every schema transform fallible 🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love a way to make this configurable, but I don't see any obvious way. Note that Option already poses similar problems, and it's quite common for transforms to either always return None (if they're just visiting the schema rather than changing it), or to always return Some (if they're modifying the schema but not filtering it in any way). In fact, there are even some that always return Some(Cow::Owned) because they unconditionally transform the schema.

If we wanted to capture this generically, we have three problems to solve:

  1. What do the provided "leaf" methods return as their default"identify" transform? Today, it's Some(Cow::Borrowed(val)); Cow::Borrowed(val).into() would cover that and also cover Cow<'a, T> return type... but it does not cover DeltaResult<Cow<'a, T>> (which has no blanket impl From).
  2. How should the provided "internal" methods generically handle values that could be None or Err? None means "filter it out and keep going" while Err would mean "abort the traversal immediately"
  3. How to handle the ? operator, if the return type might be just Cow<'a, T>? Or DeltaResult<Option<Cow<'a, T>>?

To make matters worse, I can think of (at least) twelve return types the visitor might reasonably want to use:

  1. () (visitor )
  2. T (unconditional transform)
  3. Cow<'a, T> (conditional transform)
  4. DeltaResult<()> (checker)
  5. DeltaResult<T> (fallible unconditional transform)
  6. DeltaResult<Cow<'a, T>> (fallible conditional transform)
  7. Option<&T> (filter)
  8. Option<T> (filtering unconditional transform)
  9. Option<Cow<'a, T>> (filtering conditional transform)
  10. DeltaResult<Option<&T>> (fallible filter)
  11. DeltaResult<Option<T>> (fallible filtering unconditional transform)
  12. DeltaResult<Option<Cow<'a, T>>> (fallible filtering conditional transform)

And that's ignoring the "aggregation" case where the visitor returns some completely other value instead of a schema (both fallible and infallible varieties).

Note: If we did decide to make the transforms fallible, the trait would need to know what error type to use. We could perhaps hard-wire it as kernel::Error but then infallible operations are out of luck. We can't default it because associated type defaults for traits are unstable rust. So we'd have to force every trait impl to define the type -- which is maybe not the end of the world -- and let infallible transforms specify std::convert::Infallible instead? But that has its own headaches when it's time to unpack the value (the ? is still "conditional" and requires the calling function to return Result, and unwrap et al are still technically a panic risk. But at least the compiler recognizes that let Ok(val) = returns_infallible_result() is irrefutable in spite of not having an else clause.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woah this is a big design space.
There seems to be two extremes going on here:

  1. Have a single design that we adapt to our usecases (current solution). More semantics is baked into the code.
  2. Have a type/trait that encompasses exactly the behaviour we want (all 12+ cases). In this case, the semantics are baked into the type system.

I wonder if we can strike a balance by having three variants:

  • Cow<'a, T> to cover 2 and 3
  • Option<Cow<'a, T>> to cover 1, 7, 8, 9
  • DeltaResult<Option<Cow<'a, T>> to cover cases 10, 11, 12.
    But then there will be semantics that aren't communicated by the type system. Maybe that's okay? For example:
  • 2 will always return a Cow::Owned
  • 7 will only return None or a Some(Cow::Borrow)
  • 8 will only return None or Some(Cow::Owned)

The idea of casting a wide net by introducing a generic error seems promising.
Regarding the generic defaulting: I could see a SchemaTransform<E: Error> with type InfallibleSchemaTransform = SchemaTransform<Infallible> and type FallibleSchemaTransform = SchemaTransform<kernel::Error>

This is definitely out of scope for this PR, but I think it's worth considering before implementing all the variety of transforms.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problem is, multiple slightly different types of transforms brings back the exact kind of code duplication the transforms were intended to eliminate...

}

#[cfg(test)]
mod tests {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see some of the validator error cases exercised. Either in this PR or a followup one :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@scovich scovich merged commit 391d10c into delta-io:main Dec 3, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking-change Change that require a major version bump

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants