Skip to content

feat: Re-work behavior of arrow_schema parameter on sink_parquet#26621

Draft
nameexhaustion wants to merge 18 commits intomainfrom
nxs/to-arrow-mode
Draft

feat: Re-work behavior of arrow_schema parameter on sink_parquet#26621
nameexhaustion wants to merge 18 commits intomainfrom
nxs/to-arrow-mode

Conversation

@nameexhaustion
Copy link
Collaborator

@nameexhaustion nameexhaustion commented Feb 19, 2026

Pre-work for supporting arrow schemas generated by PyIceberg.

Existing behavior

If arrow_schema is provided, its dtypes must match with the dtypes generated by us according to CompatLevel::oldest(). We will then copy any additional metadata in the provided schema.

This made it possible to correctly write PARQUET:field_id that is needed by Iceberg. However, on top of being able to write custom metadata, Iceberg also required the ability to specify the exact arrow type to export to (e.g. Binary -> FixedLenBinary) - a requirement that wasn't anticipated during the initial design.

New behavior after this PR

If arrow_schema is provided, we will convert to the exact types specified in the arrow schema, raising an error if this isn't possible.

Essentially, this makes it so that the exported arrow type is defined and controllable by the arrow_schema parameter, rather than being defined by a hardcoded CompatLevel::oldest(). We will use this later when writing Iceberg to e.g. write a Binary column as FixedLenBinary (Iceberg fixed(n) type).

Example

pl.DataFrame(
    {
        "large_utf8": "A",
        "large_binary": [b"B"],
        "utf8view": "C",
        "binaryview": [b"D"],
    }
).write_parquet(
    ...,
    arrow_schema=pa.schema(
        [
            pa.field("large_utf8", pa.large_string()),
            pa.field("large_binary", pa.large_binary()),
            pa.field("utf8view", pa.string_view()),
            pa.field("binaryview", pa.binary_view()),
        ]
    )
)

Example - Existing behavior
Errors with SchemaError: to_arrow(): provided dtype (Utf8View) does not match output dtype (LargeUtf8)

Example - New behavior
Successfully write a parquet file with the following schema

large_utf8: large_string
large_binary: large_binary
utf8view: string_view
binaryview: binary_view

@github-actions github-actions bot added A-io-parquet Area: reading/writing Parquet files enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Feb 19, 2026
@github-actions
Copy link
Contributor

The uncompressed lib size after this PR is 53.7169 MB.

@github-actions
Copy link
Contributor

The uncompressed lib size after this PR is 53.7229 MB.

@codecov
Copy link

codecov bot commented Feb 19, 2026

Codecov Report

❌ Patch coverage is 93.18885% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.19%. Comparing base (4929540) to head (c6a8754).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
crates/polars-core/src/series/into.rs 94.50% 10 Missing ⚠️
crates/polars-arrow/src/datatypes/mapper.rs 75.00% 6 Missing ⚠️
...tes/polars-core/src/series/categorical_to_arrow.rs 91.89% 3 Missing ⚠️
...lars-core/src/chunked_array/logical/categorical.rs 0.00% 1 Missing ⚠️
...es/polars-core/src/chunked_array/object/builder.rs 90.00% 1 Missing ⚠️
crates/polars-plan/src/plans/schema.rs 94.11% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #26621      +/-   ##
==========================================
- Coverage   81.37%   81.19%   -0.19%     
==========================================
  Files        1794     1795       +1     
  Lines      244998   245086      +88     
  Branches     3079     3080       +1     
==========================================
- Hits       199379   198989     -390     
- Misses      44833    45311     +478     
  Partials      786      786              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link
Contributor

The uncompressed lib size after this PR is 53.7225 MB.

@github-actions
Copy link
Contributor

The uncompressed lib size after this PR is 53.7228 MB.

@github-actions
Copy link
Contributor

The uncompressed lib size after this PR is 53.7237 MB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-io-parquet Area: reading/writing Parquet files enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments