Skip to content

[Kernel] Add ability to store type changes on StructField #4519

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
May 15, 2025

Conversation

emkornfield
Copy link
Collaborator

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

This change adds a new TypeChange class and the ability store a list of type changes on FieldStruct. This will allow for persisting/updating type changes (and is modelled similar to how collation works due to the complexity of working with maps/arrays).

How was this patch tested?

Add unit tests.

Does this PR introduce any user-facing changes?

This is a API change only.

@emkornfield emkornfield requested a review from vkorukanti May 9, 2025 19:24
@emkornfield emkornfield assigned emkornfield and unassigned vkorukanti May 9, 2025
@@ -0,0 +1,176 @@
/*
* Copyright (2023) The Delta Lake Project Authors.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Copyright (2023) The Delta Lake Project Authors.
* Copyright (2025) The Delta Lake Project Authors.

Copy link
Collaborator

@vkorukanti vkorukanti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly qns

* where the metadata is attached depend on if the change is for nested arrays/maps or primitive
* types.
*/
public static class TypeChange {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be pull this out, given this is going to be a public API

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

DataType dataType,
boolean nullable,
FieldMetadata metadata,
List<TypeChange> typeChanges) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm.. we need a builder for StructField soon. (not related to this PR)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep :)

@@ -95,6 +162,11 @@ public boolean isNullable() {
return nullable;
}

/** @return the list of type changes for this field */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a list because the filed can undergo more than one type change? Also is the last one the latest one?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, updated docs.


FieldMetadata collationMetadata = fetchCollationMetadata();
this.metadata =
new FieldMetadata.Builder().fromMetadata(metadata).fromMetadata(collationMetadata).build();
if (!this.typeChanges.isEmpty()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the map contains a struct either as key or value, is the type change allowed in this case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the idea of this assert is it enforces types changes are only recorded on leaf nodes of the schema. We could also assert this in the other way I think that DataType is a primitive type.

I think this is a core design question that we should answer which option we want to pursue here.

  1. By adding these to leaf nodes in the schema we have to add code similar to collations that will construct appropriate FieldMetadata on construction (and do the inverse operation for when parsing) for nested Map/Array types.
  2. The other option would be not add the field at all and solely work with TypeChanges are a function of FieldMetadata (which would require less book-keeping but might be more error prone in the long run. This is how field assignment works for icebergCompatV2

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the question is which pattern we want to follow (and then migrate all of these common elements to the common pattern).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@allisonport-db this is somewhat related to the bug on populating nested field IDs, not sure if you have thoughts on which approach you prefer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little confused because I think this is the level where we would want the type changes for map/array? From the protocol https://github.com/delta-io/delta/blob/master/PROTOCOL.md#type-change-metadata

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We wouldn't want to support TypeChanges on their element or key/map types though.

I think I maybe understand what you're saying however, we could choose to store them at either level and then serialize them at this level?

I honestly find the use of StructField for the key/value/element really confusing since there is no such thing as a key's fieldMetadata (it does not exist!). For that reason, I'd prefer to store it here at this level. And honestly I've been meaning to take a look at that code and see if there's a better way to make sure there isn't confusing when using StructField in those data types.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then we need to add the additional "fieldPath" to the TypeChange class right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also feel free to disagree, I think choosing where to store this info (and other stuff like the nested ids) is an important discussion to be had

@emkornfield emkornfield requested a review from vkorukanti May 13, 2025 20:46
this.name = name;
this.dataType = dataType;
this.nullable = nullable;
this.typeChanges = typeChanges == null ? Collections.emptyList() : typeChanges;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should discourage null ever being passed in. I agree this is a safe check but ... would be nice to never have to think about null?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 probably can either (1) requireNonNull if we want this to always be present but empty for no type widening or (2) make it optional

@scottsand-db scottsand-db merged commit 2dd5a37 into delta-io:master May 15, 2025
22 checks passed
Copy link
Collaborator

@allisonport-db allisonport-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I was behind on my reviews, left a few comments mostly about the nesting levels for array/map

this.name = name;
this.dataType = dataType;
this.nullable = nullable;
this.typeChanges = typeChanges == null ? Collections.emptyList() : typeChanges;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 probably can either (1) requireNonNull if we want this to always be present but empty for no type widening or (2) make it optional


FieldMetadata collationMetadata = fetchCollationMetadata();
this.metadata =
new FieldMetadata.Builder().fromMetadata(metadata).fromMetadata(collationMetadata).build();
if (!this.typeChanges.isEmpty()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little confused because I think this is the level where we would want the type changes for map/array? From the protocol https://github.com/delta-io/delta/blob/master/PROTOCOL.md#type-change-metadata


FieldMetadata collationMetadata = fetchCollationMetadata();
this.metadata =
new FieldMetadata.Builder().fromMetadata(metadata).fromMetadata(collationMetadata).build();
if (!this.typeChanges.isEmpty()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We wouldn't want to support TypeChanges on their element or key/map types though.

I think I maybe understand what you're saying however, we could choose to store them at either level and then serialize them at this level?

I honestly find the use of StructField for the key/value/element really confusing since there is no such thing as a key's fieldMetadata (it does not exist!). For that reason, I'd prefer to store it here at this level. And honestly I've been meaning to take a look at that code and see if there's a better way to make sure there isn't confusing when using StructField in those data types.


FieldMetadata collationMetadata = fetchCollationMetadata();
this.metadata =
new FieldMetadata.Builder().fromMetadata(metadata).fromMetadata(collationMetadata).build();
if (!this.typeChanges.isEmpty()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then we need to add the additional "fieldPath" to the TypeChange class right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I guess probably a very important question, how do we expect connectors to communicate type widening schema changes?

(1) We expect them to add the full metadata with the TypeChanges class to the schema?
(2) They just provide a schema with the new type and behind the scenes we add this TypeChanges metadata
- If this one, it seems like once again we come back to this being another good place where it would be preferred to have an internal vs external interface

Maybe we want to support both?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants