Skip to content

Conversation

@Fokko
Copy link
Collaborator

@Fokko Fokko commented Sep 15, 2025

What changes are proposed in this pull request?

This PR enables setting the Field-ID on a struct. This will be included in the ToSchema schema generation in the form of the parquet.field.id.

How was this change tested?

With a new test

@Fokko Fokko changed the title Include Parquet Field-IDs on ToSchema feat: Include Parquet Field-IDs on ToSchema Sep 15, 2025
@Fokko Fokko force-pushed the fd-struct-field-id branch from e038aad to 39bab70 Compare September 15, 2025 21:13
@codecov
Copy link

codecov bot commented Sep 15, 2025

Codecov Report

❌ Patch coverage is 86.25954% with 18 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.45%. Comparing base (1ddc026) to head (e270198).
⚠️ Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
derive-macros/src/lib.rs 85.47% 12 Missing and 5 partials ⚠️
kernel/src/schema/derive_macro_utils.rs 92.85% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1298      +/-   ##
==========================================
+ Coverage   84.42%   84.45%   +0.02%     
==========================================
  Files         112      112              
  Lines       27819    27920     +101     
  Branches    27819    27920     +101     
==========================================
+ Hits        23486    23579      +93     
- Misses       3199     3204       +5     
- Partials     1134     1137       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Fokko Fokko force-pushed the fd-struct-field-id branch from 39bab70 to d642bf1 Compare September 16, 2025 08:45
Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, tho it seems like the error checking is a bit baroque?

});
// Validate field_id attribute and collect any errors
let mut field_id_errors = Vec::new();
let _field_id: Option<i64> = field.attrs.iter().find_map(|attr| {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The leading underscore was... misleading... given that it's actually used.

Also, rule of 30 says this should be a new function that can leverage ?

Also, it's a bit odd that we carefully collect multiple errors (which would imply multiple invalid field_id attributes were specified), but we stop at the first valid one without complaining of duplicates?

Seems like this operation should either stop at the first field id it finds (Result<i64, Error>) or find all field ids (Vec<Result<i64, Error>>) and then blow up unless the list is a single Ok?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with Result<Some(i64), Error> LMKWYT

@Fokko Fokko force-pushed the fd-struct-field-id branch from 917bf55 to 9da18b0 Compare September 17, 2025 14:24
@Fokko Fokko force-pushed the fd-struct-field-id branch from 9da18b0 to b473435 Compare September 17, 2025 14:30
Copy link
Member

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops sorry left an unfinished review, flushing a comment

// Then, add field-id metadata if present
match get_field_id(&field.attrs) {
Ok(Some(id)) => {
quote_spanned! { field.span() => #base_call.add_metadata([("parquet.field.id", #id)]) }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I worry that this won't generate the full column mapping data, making this an invalid schema.

Here's an example logical Delta schema:

{
    "name" : "e",
    "type" : {
      "type" : "array",
      "elementType" : {
        "type" : "struct",
        "fields" : [ {
          "name" : "d",
          "type" : "integer",
          "nullable" : false,
          "metadata" : { 
            "delta.columnMapping.id": 5,
            "delta.columnMapping.physicalName": "col-a7f4159c-53be-4cb0-b81a-f7e5240cfc49"
          }
        } ]
      },
      "containsNull" : true
    },
    "nullable" : true,
    "metadata" : { 
      "delta.columnMapping.id": 4,
      "delta.columnMapping.physicalName": "col-5f422f40-de70-45b2-88ab-1d5c90e94db1"
    }
  }

Since name mode is the default, kernel has to decide to convert this schema into one that contains parquet.field.id.

Here's the corresponding physical schema if column mapping mode is ID:

{
    "name" : "col-5f422f40-de70-45b2-88ab-1d5c90e94db1",
    "type" : {
      "type" : "array",
      "elementType" : {
        "type" : "struct",
        "fields" : [ {
          "name" : "col-a7f4159c-53be-4cb0-b81a-f7e5240cfc49",
          "type" : "integer",
          "nullable" : false,
          "metadata" : { 
            "parquet.field.id": 5,
          }
        } ]
      },
      "containsNull" : true
    },
    "nullable" : true,
    "metadata" : { 
      "parquet.field.id": 4,
    }
  }

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting parquet.field.id directly without checking column mapping mode will likely lead to misuse.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the delta spec requires that both the physical name and column mapping id be present (even if column mapping mode is just one or the other)

Fokko added a commit to Fokko/delta-kernel-rs that referenced this pull request Sep 29, 2025
The structs to represent the V4 metadata.

- Will annotate the Field-IDs when delta-io#1298 gets in.
- Decided to do this in a seprate module to avoid conflicts with
  OSS upstream.
@Fokko Fokko closed this Oct 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants