-
Couldn't load subscription status.
- Fork 118
feat: Include Parquet Field-IDs on ToSchema #1298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e038aad to
39bab70
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1298 +/- ##
==========================================
+ Coverage 84.42% 84.45% +0.02%
==========================================
Files 112 112
Lines 27819 27920 +101
Branches 27819 27920 +101
==========================================
+ Hits 23486 23579 +93
- Misses 3199 3204 +5
- Partials 1134 1137 +3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
39bab70 to
d642bf1
Compare
…to fd-struct-field-id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, tho it seems like the error checking is a bit baroque?
derive-macros/src/lib.rs
Outdated
| }); | ||
| // Validate field_id attribute and collect any errors | ||
| let mut field_id_errors = Vec::new(); | ||
| let _field_id: Option<i64> = field.attrs.iter().find_map(|attr| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The leading underscore was... misleading... given that it's actually used.
Also, rule of 30 says this should be a new function that can leverage ?
Also, it's a bit odd that we carefully collect multiple errors (which would imply multiple invalid field_id attributes were specified), but we stop at the first valid one without complaining of duplicates?
Seems like this operation should either stop at the first field id it finds (Result<i64, Error>) or find all field ids (Vec<Result<i64, Error>>) and then blow up unless the list is a single Ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with Result<Some(i64), Error> LMKWYT
917bf55 to
9da18b0
Compare
9da18b0 to
b473435
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoops sorry left an unfinished review, flushing a comment
| // Then, add field-id metadata if present | ||
| match get_field_id(&field.attrs) { | ||
| Ok(Some(id)) => { | ||
| quote_spanned! { field.span() => #base_call.add_metadata([("parquet.field.id", #id)]) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I worry that this won't generate the full column mapping data, making this an invalid schema.
Here's an example logical Delta schema:
{
"name" : "e",
"type" : {
"type" : "array",
"elementType" : {
"type" : "struct",
"fields" : [ {
"name" : "d",
"type" : "integer",
"nullable" : false,
"metadata" : {
"delta.columnMapping.id": 5,
"delta.columnMapping.physicalName": "col-a7f4159c-53be-4cb0-b81a-f7e5240cfc49"
}
} ]
},
"containsNull" : true
},
"nullable" : true,
"metadata" : {
"delta.columnMapping.id": 4,
"delta.columnMapping.physicalName": "col-5f422f40-de70-45b2-88ab-1d5c90e94db1"
}
}
Since name mode is the default, kernel has to decide to convert this schema into one that contains parquet.field.id.
Here's the corresponding physical schema if column mapping mode is ID:
{
"name" : "col-5f422f40-de70-45b2-88ab-1d5c90e94db1",
"type" : {
"type" : "array",
"elementType" : {
"type" : "struct",
"fields" : [ {
"name" : "col-a7f4159c-53be-4cb0-b81a-f7e5240cfc49",
"type" : "integer",
"nullable" : false,
"metadata" : {
"parquet.field.id": 5,
}
} ]
},
"containsNull" : true
},
"nullable" : true,
"metadata" : {
"parquet.field.id": 4,
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting parquet.field.id directly without checking column mapping mode will likely lead to misuse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the delta spec requires that both the physical name and column mapping id be present (even if column mapping mode is just one or the other)
The structs to represent the V4 metadata. - Will annotate the Field-IDs when delta-io#1298 gets in. - Decided to do this in a seprate module to avoid conflicts with OSS upstream.
What changes are proposed in this pull request?
This PR enables setting the Field-ID on a struct. This will be included in the
ToSchemaschema generation in the form of theparquet.field.id.How was this change tested?
With a new test