fix: propagate inner-field metadata through composite-type constructors#21984
Draft
adriangb wants to merge 2 commits intoapache:mainfrom
Draft
fix: propagate inner-field metadata through composite-type constructors#21984adriangb wants to merge 2 commits intoapache:mainfrom
adriangb wants to merge 2 commits intoapache:mainfrom
Conversation
…rs (apache#21982) `make_array`, `array_agg`, `array_repeat`, `arrays_zip`, `map`, `range` / `generate_series`, and `struct` only overrode `return_type` and synthesized fresh inner fields at runtime, so each one silently dropped the input field's metadata when wrapping it into a `List`/`Struct`/`Map`. In practice this broke Arrow extension types (`ARROW:extension:name` / `ARROW:extension:metadata`) round-tripping through SQL list/struct/map constructors. Add two helpers in `datafusion-common::utils` — `list_inner_field_from` and `struct_inner_fields_from` — that wrap a source `Field` into the inner field of a list/struct while preserving its metadata, and extend `SingleRowListArrayBuilder::with_field` to copy metadata too. Use these helpers consistently from each affected function's `return_field_from_args` / `return_field`, and thread the resulting metadata-bearing inner field into the runtime construction paths (including the `array_agg` accumulators, which now carry a `FieldRef` instead of a bare `DataType`). Adds Rust unit tests for each affected function plus an end-to-end `array_metadata_propagation.slt` that asserts metadata survives every constructor by string-matching the rendered data type. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes loss of Arrow inner-Field metadata (notably ARROW:extension:*) when DataFusion UDFs/UDAFs construct composite types like List, Struct, and Map, ensuring extension-type identity survives through planning and runtime construction.
Changes:
- Add
list_inner_field_from/struct_inner_fields_fromhelpers and extendSingleRowListArrayBuilder::with_fieldto propagate metadata. - Update composite constructors (
make_array,array_repeat,range/generate_series,arrays_zip,map,struct) andarray_aggaccumulators to thread metadata-bearingFieldRefs through runtime array building. - Add unit tests plus an end-to-end sqllogictest covering metadata propagation across affected functions.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| datafusion/common/src/utils/mod.rs | Adds helpers for metadata-preserving list/struct field construction; extends SingleRowListArrayBuilder to carry metadata. |
| datafusion/functions-nested/src/make_array.rs | Uses planning-time inner FieldRef in return-field and runtime construction to preserve metadata. |
| datafusion/functions-nested/src/repeat.rs | Threads inner FieldRef through array_repeat runtime paths to preserve list inner-field metadata. |
| datafusion/functions-nested/src/range.rs | Adds return_field_from_args and runtime grafting of the planned inner field for range/generate_series. |
| datafusion/functions-nested/src/arrays_zip.rs | Preserves element-field metadata in planned struct members and reuses planned struct fields at runtime. |
| datafusion/functions-nested/src/map.rs | Builds map entries/key/value fields using metadata-bearing planned fields when available. |
| datafusion/functions/src/core/struct.rs | Adds return_field_from_args to preserve per-member metadata in struct(...). |
| datafusion/functions-aggregate/src/array_agg.rs | Preserves metadata by storing a FieldRef in accumulators and using it at all list-construction sites. |
| datafusion/spark/src/function/aggregate/collect.rs | Updates Spark collect accumulators to pass FieldRef into array_agg accumulators. |
| datafusion/functions-aggregate/benches/array_agg.rs | Adjusts bench to new accumulator constructor signature (&FieldRef). |
| datafusion/sqllogictest/test_files/array_metadata_propagation.slt | New SLT regression coverage for metadata propagation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
83
to
90
| fn accumulator(&self, acc_args: AccumulatorArgs) -> Result<Box<dyn Accumulator>> { | ||
| let field = &acc_args.expr_fields[0]; | ||
| let data_type = field.data_type().clone(); | ||
| let ignore_nulls = true; | ||
| Ok(Box::new(NullToEmptyListAccumulator::new( | ||
| ArrayAggAccumulator::try_new(&data_type, ignore_nulls)?, | ||
| ArrayAggAccumulator::try_new(field, ignore_nulls)?, | ||
| data_type, | ||
| ))) |
Comment on lines
141
to
148
| fn accumulator(&self, acc_args: AccumulatorArgs) -> Result<Box<dyn Accumulator>> { | ||
| let field = &acc_args.expr_fields[0]; | ||
| let data_type = field.data_type().clone(); | ||
| let ignore_nulls = true; | ||
| Ok(Box::new(NullToEmptyListAccumulator::new( | ||
| DistinctArrayAggAccumulator::try_new(&data_type, None, ignore_nulls)?, | ||
| DistinctArrayAggAccumulator::try_new(field, None, ignore_nulls)?, | ||
| data_type, | ||
| ))) |
|
Thank you for opening this pull request! Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch). Details |
Replace `list_inner_field_from` and `struct_inner_fields_from` with a single `nullable_inner_field_from(inner, name)` that renames + forces nullable to match Arrow's list/struct member conventions while preserving metadata. `nullable_list_item_field_from` is a thin wrapper using `Field::LIST_FIELD_DEFAULT_NAME`. The map function uses `Field::with_name` / `with_nullable` directly since key/value need different nullability. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
Several built-in UDFs and UDAFs that wrap an input column into a composite type (
List,Struct,Map, …) drop the input field's metadata when constructing the output's inner Field. In practice this matters most for Arrow extension types (ARROW:extension:name/ARROW:extension:metadata— e.g.arrow.json,arrow.uuid), because SQL-constructed lists/structs/maps silently lose extension-type identity. Any downstream operation that compares the producedDataTypeagainst a type carrying inner-field metadata (union, aggregate merging, IPC roundtrip) sees them as different types.Each affected function has the same shape: only
return_type(DataType-only) is overridden, and the runtime path synthesizes a fresh innerFieldwith no metadata. The fix is therefore systematic rather than per-function.What changes are included in this PR?
Two new helpers in
datafusion-common::utils:list_inner_field_from(&Field) -> FieldRef— builds the canonical inner field of aList/LargeList/FixedSizeListfrom a source field, preserving the source's data type and metadata.struct_inner_fields_from(...) -> Fields— builds named struct member fields from a sequence of(name, &Field)pairs, preserving each input's metadata.SingleRowListArrayBuilder::with_fieldwas extended to also propagate metadata.These helpers are then used consistently from each affected function's
return_field_from_args(UDF) /return_field(UDAF), and the metadata-bearing inner field is threaded into the runtime construction paths:make_arrayreturn_field_from_args; runtime usesargs.return_field's inner fieldarray_agg(incl. distinct, ordered, groups accumulator)return_fieldandstate_fields; accumulators now carry aFieldRefinstead of a bareDataTypeand uselist_inner_field_fromat every list-construction sitearray_repeatreturn_field_from_args; runtime threads inner field througharrays_zipreturn_field_from_args(preserves struct member metadata); runtime uses planning-time struct fieldsmapreturn_field_from_args;entriesfield threaded throughmake_map_array_internal/make_map_array_from_fixed_size_list/build_map_arrayrange/generate_seriesreturn_field_from_args; runtime grafts the planning-time inner field onto results fromListBuilder-based pathsstructreturn_field_from_args(preserves member field metadata)spark.collect_list/collect_set(which reuseArrayAggAccumulator/DistinctArrayAggAccumulator) were updated to pass the inputFieldRefthrough.Are these changes tested?
Yes:
return_field_from_args/return_field(e.g.make_array_preserves_inner_field_metadata,array_agg_preserves_inner_field_metadata,struct_preserves_member_metadata,arrays_zip_preserves_struct_member_metadata,array_repeat_preserves_inner_field_metadata,map_preserves_key_value_field_metadata,range_preserves_inner_field_metadata).datafusion/sqllogictest/test_files/array_metadata_propagation.sltexercises every affected constructor throughwith_metadata→ wrapping function →arrow_field(...)['data_type'], asserting the rendered data type contains the expected inner-field metadata. This coversmake_array,array_repeat,range,generate_series,struct,arrays_zip,map, andarray_agg(default,DISTINCT, andORDER BYpaths).datafusion-common,datafusion-functions,datafusion-functions-aggregate,datafusion-functions-nestedcontinue to pass; the full SLT suite (465 files) passes; clippy is clean.Are there any user-facing changes?
Yes — but they are bug fixes rather than breaking changes: SQL-constructed lists/structs/maps now retain Arrow extension-type identity from their input fields. The accumulator constructors (
ArrayAggAccumulator::try_new,DistinctArrayAggAccumulator::try_new,OrderSensitiveArrayAggAccumulator::try_new,ArrayAggGroupsAccumulator::new) now take&FieldRefinstead of&DataType; this is apubAPI change for downstream code that constructs these accumulators directly.🤖 Generated with Claude Code