feat: Re-work behavior of arrow_schema parameter on sink_parquet#26621
Draft
nameexhaustion wants to merge 18 commits intomainfrom
Draft
feat: Re-work behavior of arrow_schema parameter on sink_parquet#26621nameexhaustion wants to merge 18 commits intomainfrom
nameexhaustion wants to merge 18 commits intomainfrom
Conversation
Contributor
|
The uncompressed lib size after this PR is 53.7169 MB. |
Contributor
|
The uncompressed lib size after this PR is 53.7229 MB. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #26621 +/- ##
==========================================
- Coverage 81.37% 81.19% -0.19%
==========================================
Files 1794 1795 +1
Lines 244998 245086 +88
Branches 3079 3080 +1
==========================================
- Hits 199379 198989 -390
- Misses 44833 45311 +478
Partials 786 786 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Contributor
|
The uncompressed lib size after this PR is 53.7225 MB. |
42e6ea9 to
bb6af05
Compare
Contributor
|
The uncompressed lib size after this PR is 53.7228 MB. |
Contributor
|
The uncompressed lib size after this PR is 53.7237 MB. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pre-work for supporting arrow schemas generated by PyIceberg.
Existing behavior
If
arrow_schemais provided, its dtypes must match with the dtypes generated by us according toCompatLevel::oldest(). We will then copy any additional metadata in the provided schema.This made it possible to correctly write
PARQUET:field_idthat is needed by Iceberg. However, on top of being able to write custom metadata, Iceberg also required the ability to specify the exact arrow type to export to (e.g.Binary -> FixedLenBinary) - a requirement that wasn't anticipated during the initial design.New behavior after this PR
If
arrow_schemais provided, we will convert to the exact types specified in the arrow schema, raising an error if this isn't possible.Essentially, this makes it so that the exported arrow type is defined and controllable by the
arrow_schemaparameter, rather than being defined by a hardcodedCompatLevel::oldest(). We will use this later when writing Iceberg to e.g. write aBinarycolumn asFixedLenBinary(Icebergfixed(n)type).Example
Example - Existing behavior
Errors with
SchemaError: to_arrow(): provided dtype (Utf8View) does not match output dtype (LargeUtf8)Example - New behavior
Successfully write a parquet file with the following schema