-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[Kernel] Add ability to store type changes on StructField #4519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -0,0 +1,176 @@ | |||
/* | |||
* Copyright (2023) The Delta Lake Project Authors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Copyright (2023) The Delta Lake Project Authors. | |
* Copyright (2025) The Delta Lake Project Authors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly qns
* where the metadata is attached depend on if the change is for nested arrays/maps or primitive | ||
* types. | ||
*/ | ||
public static class TypeChange { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be pull this out, given this is going to be a public API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
DataType dataType, | ||
boolean nullable, | ||
FieldMetadata metadata, | ||
List<TypeChange> typeChanges) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm.. we need a builder for StructField soon. (not related to this PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep :)
@@ -95,6 +162,11 @@ public boolean isNullable() { | |||
return nullable; | |||
} | |||
|
|||
/** @return the list of type changes for this field */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a list because the filed can undergo more than one type change? Also is the last one the latest one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, updated docs.
|
||
FieldMetadata collationMetadata = fetchCollationMetadata(); | ||
this.metadata = | ||
new FieldMetadata.Builder().fromMetadata(metadata).fromMetadata(collationMetadata).build(); | ||
if (!this.typeChanges.isEmpty() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the map contains a struct either as key or value, is the type change allowed in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the idea of this assert is it enforces types changes are only recorded on leaf nodes of the schema. We could also assert this in the other way I think that DataType is a primitive type.
I think this is a core design question that we should answer which option we want to pursue here.
- By adding these to leaf nodes in the schema we have to add code similar to collations that will construct appropriate FieldMetadata on construction (and do the inverse operation for when parsing) for nested Map/Array types.
- The other option would be not add the field at all and solely work with TypeChanges are a function of FieldMetadata (which would require less book-keeping but might be more error prone in the long run. This is how field assignment works for icebergCompatV2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the question is which pattern we want to follow (and then migrate all of these common elements to the common pattern).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@allisonport-db this is somewhat related to the bug on populating nested field IDs, not sure if you have thoughts on which approach you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little confused because I think this is the level where we would want the type changes for map/array? From the protocol https://github.com/delta-io/delta/blob/master/PROTOCOL.md#type-change-metadata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We wouldn't want to support TypeChanges on their element or key/map types though.
I think I maybe understand what you're saying however, we could choose to store them at either level and then serialize them at this level?
I honestly find the use of StructField for the key/value/element really confusing since there is no such thing as a key's fieldMetadata (it does not exist!). For that reason, I'd prefer to store it here at this level. And honestly I've been meaning to take a look at that code and see if there's a better way to make sure there isn't confusing when using StructField in those data types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then we need to add the additional "fieldPath" to the TypeChange class right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also feel free to disagree, I think choosing where to store this info (and other stuff like the nested ids) is an important discussion to be had
this.name = name; | ||
this.dataType = dataType; | ||
this.nullable = nullable; | ||
this.typeChanges = typeChanges == null ? Collections.emptyList() : typeChanges; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should discourage null
ever being passed in. I agree this is a safe check but ... would be nice to never have to think about null
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 probably can either (1) requireNonNull if we want this to always be present but empty for no type widening or (2) make it optional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I was behind on my reviews, left a few comments mostly about the nesting levels for array/map
this.name = name; | ||
this.dataType = dataType; | ||
this.nullable = nullable; | ||
this.typeChanges = typeChanges == null ? Collections.emptyList() : typeChanges; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 probably can either (1) requireNonNull if we want this to always be present but empty for no type widening or (2) make it optional
|
||
FieldMetadata collationMetadata = fetchCollationMetadata(); | ||
this.metadata = | ||
new FieldMetadata.Builder().fromMetadata(metadata).fromMetadata(collationMetadata).build(); | ||
if (!this.typeChanges.isEmpty() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little confused because I think this is the level where we would want the type changes for map/array? From the protocol https://github.com/delta-io/delta/blob/master/PROTOCOL.md#type-change-metadata
|
||
FieldMetadata collationMetadata = fetchCollationMetadata(); | ||
this.metadata = | ||
new FieldMetadata.Builder().fromMetadata(metadata).fromMetadata(collationMetadata).build(); | ||
if (!this.typeChanges.isEmpty() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We wouldn't want to support TypeChanges on their element or key/map types though.
I think I maybe understand what you're saying however, we could choose to store them at either level and then serialize them at this level?
I honestly find the use of StructField for the key/value/element really confusing since there is no such thing as a key's fieldMetadata (it does not exist!). For that reason, I'd prefer to store it here at this level. And honestly I've been meaning to take a look at that code and see if there's a better way to make sure there isn't confusing when using StructField in those data types.
|
||
FieldMetadata collationMetadata = fetchCollationMetadata(); | ||
this.metadata = | ||
new FieldMetadata.Builder().fromMetadata(metadata).fromMetadata(collationMetadata).build(); | ||
if (!this.typeChanges.isEmpty() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then we need to add the additional "fieldPath" to the TypeChange class right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I guess probably a very important question, how do we expect connectors to communicate type widening schema changes?
(1) We expect them to add the full metadata with the TypeChanges
class to the schema?
(2) They just provide a schema with the new type and behind the scenes we add this TypeChanges
metadata
- If this one, it seems like once again we come back to this being another good place where it would be preferred to have an internal vs external interface
Maybe we want to support both?
Which Delta project/connector is this regarding?
Description
This change adds a new
TypeChange
class and the ability store a list of type changes on FieldStruct. This will allow for persisting/updating type changes (and is modelled similar to how collation works due to the complexity of working with maps/arrays).How was this patch tested?
Add unit tests.
Does this PR introduce any user-facing changes?
This is a API change only.