-
Couldn't load subscription status.
- Fork 118
Simplify column mapping mode handling #543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hmm, this test failure doesn't look good: According to the Delta spec, column mapping mode should be enabled on a table with iceberg compat v1 enabled. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #543 +/- ##
==========================================
+ Coverage 80.71% 81.07% +0.35%
==========================================
Files 67 67
Lines 14278 14496 +218
Branches 14278 14496 +218
==========================================
+ Hits 11524 11752 +228
+ Misses 2179 2172 -7
+ Partials 575 572 -3 ☔ View full report in Codecov by Sentry. |
Known issue upstream: delta-incubator/dat#52 Meanwhile, the code that attempted to disable the known-broken test was only partially effective; I fixed it so the test is fully skipped now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great just one main question on whether or not to attempt encoding the validation in the type system
kernel/src/schema.rs
Outdated
| /// NOTE: Caller affirms that the schema was alreasdy validated by | ||
| /// [`crate::table_features::validate_schema_column_mapping`], to ensure that | ||
| /// annotations are always and only present when column mapping mode is enabled. | ||
| pub fn make_physical(&self) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(tell me if this is overkill)
In an ideal world I think we would encode this in the type system so that it's prohibited to ever have a make_physical() call without calling validate_schema_column_mapping. I was trying to think of relatively easy ways to encode this and came up with two ideas (one easy, one harder)
- (easy) rename
Schemabefore it is validated tostruct UnvalidatedSchema(Schema)and require a validate method to consume the unvalidated and produce a regularSchema. Considering the approach of this PR is to do the validation up-front this seems to be the easiest? - (harder) We could leverage a couple zero-sized types to tag a
Schemaas either validated/unvalidated. Something like:
// markers
pub struct Validated;
pub struct Unvalidated;
pub struct Schema<ValidationState = Unvalidated> {
// ...
_validation_state: PhantomData<ValidationState>,
}
// then a function on `Schema<Unvalidated>` to create a `Schema<Validated>`
impl Schema<Unvalidated> {
pub fn validate(self, mode: ColumnMappingMode) -> DeltaResult<Schema<Validated>> {
// do column mapping validation and other validations in the future
}
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very interesting question. I actually have started wondering the same thing about physical vs. logical schema -- it's too easy to mix them up, as evidenced by the data skipping code that was trying to apply an expression full of logical column names to a physical parquet schema. And IIRC, delta-spark has hit more than one bug where somebody forgot which schema they were working with (and there are probably places where they double-apply the logical->physical mapping as well, tho that should be harmless). Probably a pair of newtypes would be the best way to make it explicit: struct PhysicalSchema(StructType) and struct LogicalSchema(StructType). TBD whether they should impl Deref or AsRef -- for convenience they probably should, and anyway it would be a red flag for some class or method that cares about the difference to take a bare StructType.
For this specific case -- the unvalidated schema should be very short-lived. If we changed Metadata::schema (**) to return a newtype UnvalidatedSchema with the validate method you suggest, then nobody else would have to know or care about it -- all bare schemas are known to be validated. We'd have to be careful on the write path any time we modify a schema in a way that could have implications for column mapping validation (create table, add column, enable column mapping, etc). But those operations would just need to produce an UnvalidatedSchema as their output?
(**) BTW, we should rename that method as Metadata::parse_schema to make clear that it's an expensive function to call -- not a getter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: After trying out the UnvalidatedSchema concept locally, I learned that the validation of protocol, schema, and table properties are all intertwined, so it's not enough to merely validate the schema. Most likely, we'll need to introduce a new TableConfiguration that encapsulates all three (and eventually system domain metadata as well), which cross-checks everything and with helpful utility methods to expose e.g. column mapping mode.
IMO that's a bigger effort for a different PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep sounds good! I liked the TableConfiguration idea!!
| } | ||
| } | ||
|
|
||
| impl<'a> SchemaTransform<'a> for ValidateColumnMappings<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new schema transform at work! nice!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| let empty_features = Some::<[String; 0]>([]); | ||
| let protocol = | ||
| Protocol::try_new(3, 7, empty_features.clone(), empty_features.clone()).unwrap(); | ||
| assert_eq!( | ||
| column_mapping_mode(&protocol, &table_properties), | ||
| column_mapping_mode(&protocol, &table_properties).unwrap(), | ||
| ColumnMappingMode::None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there are no features with 3/7 protocol that implies no column mapping - but I think your is exposed as an error? (noticed failing test)
|
Uh-oh... Almost all of the failures are caused by reading the DAT table |
|
sigh... yep here's the the table property is set but column mapping isn't in the |
Actually, the Delta spec says this is legal:
Reverting the over-zealous check. |
|
SGTM! thanks! |
| path: vec![], | ||
| err: None, | ||
| }; | ||
| let _ = validator.transform_struct(schema); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels weird to have to manually do error handling here and in all the transform_* methods. Why not make transform struct return a Result<Option<...>>?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tho ig this would require that we make every schema transform fallible 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd love a way to make this configurable, but I don't see any obvious way. Note that Option already poses similar problems, and it's quite common for transforms to either always return None (if they're just visiting the schema rather than changing it), or to always return Some (if they're modifying the schema but not filtering it in any way). In fact, there are even some that always return Some(Cow::Owned) because they unconditionally transform the schema.
If we wanted to capture this generically, we have three problems to solve:
- What do the provided "leaf" methods return as their default"identify" transform? Today, it's
Some(Cow::Borrowed(val));Cow::Borrowed(val).into()would cover that and also coverCow<'a, T>return type... but it does not coverDeltaResult<Cow<'a, T>>(which has no blanketimpl From). - How should the provided "internal" methods generically handle values that could be
NoneorErr?Nonemeans "filter it out and keep going" whileErrwould mean "abort the traversal immediately" - How to handle the
?operator, if the return type might be justCow<'a, T>? OrDeltaResult<Option<Cow<'a, T>>?
To make matters worse, I can think of (at least) twelve return types the visitor might reasonably want to use:
()(visitor )T(unconditional transform)Cow<'a, T>(conditional transform)DeltaResult<()>(checker)DeltaResult<T>(fallible unconditional transform)DeltaResult<Cow<'a, T>>(fallible conditional transform)Option<&T>(filter)Option<T>(filtering unconditional transform)Option<Cow<'a, T>>(filtering conditional transform)DeltaResult<Option<&T>>(fallible filter)DeltaResult<Option<T>>(fallible filtering unconditional transform)DeltaResult<Option<Cow<'a, T>>>(fallible filtering conditional transform)
And that's ignoring the "aggregation" case where the visitor returns some completely other value instead of a schema (both fallible and infallible varieties).
Note: If we did decide to make the transforms fallible, the trait would need to know what error type to use. We could perhaps hard-wire it as kernel::Error but then infallible operations are out of luck. We can't default it because associated type defaults for traits are unstable rust. So we'd have to force every trait impl to define the type -- which is maybe not the end of the world -- and let infallible transforms specify std::convert::Infallible instead? But that has its own headaches when it's time to unpack the value (the ? is still "conditional" and requires the calling function to return Result, and unwrap et al are still technically a panic risk. But at least the compiler recognizes that let Ok(val) = returns_infallible_result() is irrefutable in spite of not having an else clause.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woah this is a big design space.
There seems to be two extremes going on here:
- Have a single design that we adapt to our usecases (current solution). More semantics is baked into the code.
- Have a type/trait that encompasses exactly the behaviour we want (all 12+ cases). In this case, the semantics are baked into the type system.
I wonder if we can strike a balance by having three variants:
- Cow<'a, T> to cover 2 and 3
- Option<Cow<'a, T>> to cover 1, 7, 8, 9
- DeltaResult<Option<Cow<'a, T>> to cover cases 10, 11, 12.
But then there will be semantics that aren't communicated by the type system. Maybe that's okay? For example: - 2 will always return a
Cow::Owned - 7 will only return
Noneor aSome(Cow::Borrow) - 8 will only return
NoneorSome(Cow::Owned)
The idea of casting a wide net by introducing a generic error seems promising.
Regarding the generic defaulting: I could see a SchemaTransform<E: Error> with type InfallibleSchemaTransform = SchemaTransform<Infallible> and type FallibleSchemaTransform = SchemaTransform<kernel::Error>
This is definitely out of scope for this PR, but I think it's worth considering before implementing all the variety of transforms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Problem is, multiple slightly different types of transforms brings back the exact kind of code duplication the transforms were intended to eliminate...
| } | ||
|
|
||
| #[cfg(test)] | ||
| mod tests { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to see some of the validator error cases exercised. Either in this PR or a followup one :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
What changes are proposed in this pull request?
Today's code scatters column mapping validity checks around the various use sites which makes them complex and requires propagating the column mapping mode through various function calls.
We can simplify by validating the schema once, up front during snapshot load. In particular:
As a result, most column mapping operations become infallible and with simpler code.
This PR affects the following public APIs
StructField::physical_nameno longer takes aColumnMappingargument, because it is no longer needed for validation.How was this change tested?
Existing unit tests.