Skip to content

fix: propagate inner-field metadata through composite-type constructors#21984

Draft
adriangb wants to merge 2 commits intoapache:mainfrom
pydantic:fix-list-field-metadata
Draft

fix: propagate inner-field metadata through composite-type constructors#21984
adriangb wants to merge 2 commits intoapache:mainfrom
pydantic:fix-list-field-metadata

Conversation

@adriangb
Copy link
Copy Markdown
Contributor

@adriangb adriangb commented May 2, 2026

Which issue does this PR close?

Rationale for this change

Several built-in UDFs and UDAFs that wrap an input column into a composite type (List, Struct, Map, …) drop the input field's metadata when constructing the output's inner Field. In practice this matters most for Arrow extension types (ARROW:extension:name / ARROW:extension:metadata — e.g. arrow.json, arrow.uuid), because SQL-constructed lists/structs/maps silently lose extension-type identity. Any downstream operation that compares the produced DataType against a type carrying inner-field metadata (union, aggregate merging, IPC roundtrip) sees them as different types.

Each affected function has the same shape: only return_type (DataType-only) is overridden, and the runtime path synthesizes a fresh inner Field with no metadata. The fix is therefore systematic rather than per-function.

What changes are included in this PR?

Two new helpers in datafusion-common::utils:

  • list_inner_field_from(&Field) -> FieldRef — builds the canonical inner field of a List/LargeList/FixedSizeList from a source field, preserving the source's data type and metadata.
  • struct_inner_fields_from(...) -> Fields — builds named struct member fields from a sequence of (name, &Field) pairs, preserving each input's metadata.

SingleRowListArrayBuilder::with_field was extended to also propagate metadata.

These helpers are then used consistently from each affected function's return_field_from_args (UDF) / return_field (UDAF), and the metadata-bearing inner field is threaded into the runtime construction paths:

Function Change
make_array new return_field_from_args; runtime uses args.return_field's inner field
array_agg (incl. distinct, ordered, groups accumulator) new return_field and state_fields; accumulators now carry a FieldRef instead of a bare DataType and use list_inner_field_from at every list-construction site
array_repeat new return_field_from_args; runtime threads inner field through
arrays_zip new return_field_from_args (preserves struct member metadata); runtime uses planning-time struct fields
map new return_field_from_args; entries field threaded through make_map_array_internal / make_map_array_from_fixed_size_list / build_map_array
range / generate_series new return_field_from_args; runtime grafts the planning-time inner field onto results from ListBuilder-based paths
struct new return_field_from_args (preserves member field metadata)

spark.collect_list / collect_set (which reuse ArrayAggAccumulator / DistinctArrayAggAccumulator) were updated to pass the input FieldRef through.

Are these changes tested?

Yes:

  • Rust unit tests in each affected file directly assert metadata propagation through return_field_from_args / return_field (e.g. make_array_preserves_inner_field_metadata, array_agg_preserves_inner_field_metadata, struct_preserves_member_metadata, arrays_zip_preserves_struct_member_metadata, array_repeat_preserves_inner_field_metadata, map_preserves_key_value_field_metadata, range_preserves_inner_field_metadata).
  • New end-to-end SLT datafusion/sqllogictest/test_files/array_metadata_propagation.slt exercises every affected constructor through with_metadata → wrapping function → arrow_field(...)['data_type'], asserting the rendered data type contains the expected inner-field metadata. This covers make_array, array_repeat, range, generate_series, struct, arrays_zip, map, and array_agg (default, DISTINCT, and ORDER BY paths).
  • All existing tests in datafusion-common, datafusion-functions, datafusion-functions-aggregate, datafusion-functions-nested continue to pass; the full SLT suite (465 files) passes; clippy is clean.

Are there any user-facing changes?

Yes — but they are bug fixes rather than breaking changes: SQL-constructed lists/structs/maps now retain Arrow extension-type identity from their input fields. The accumulator constructors (ArrayAggAccumulator::try_new, DistinctArrayAggAccumulator::try_new, OrderSensitiveArrayAggAccumulator::try_new, ArrayAggGroupsAccumulator::new) now take &FieldRef instead of &DataType; this is a pub API change for downstream code that constructs these accumulators directly.

🤖 Generated with Claude Code

…rs (apache#21982)

`make_array`, `array_agg`, `array_repeat`, `arrays_zip`, `map`, `range` /
`generate_series`, and `struct` only overrode `return_type` and synthesized
fresh inner fields at runtime, so each one silently dropped the input
field's metadata when wrapping it into a `List`/`Struct`/`Map`. In practice
this broke Arrow extension types (`ARROW:extension:name` /
`ARROW:extension:metadata`) round-tripping through SQL list/struct/map
constructors.

Add two helpers in `datafusion-common::utils` — `list_inner_field_from`
and `struct_inner_fields_from` — that wrap a source `Field` into the inner
field of a list/struct while preserving its metadata, and extend
`SingleRowListArrayBuilder::with_field` to copy metadata too. Use these
helpers consistently from each affected function's
`return_field_from_args` / `return_field`, and thread the resulting
metadata-bearing inner field into the runtime construction paths
(including the `array_agg` accumulators, which now carry a `FieldRef`
instead of a bare `DataType`).

Adds Rust unit tests for each affected function plus an end-to-end
`array_metadata_propagation.slt` that asserts metadata survives every
constructor by string-matching the rendered data type.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added sqllogictest SQL Logic Tests (.slt) common Related to common crate functions Changes to functions implementation spark labels May 2, 2026
@adriangb adriangb marked this pull request as draft May 2, 2026 06:39
@adriangb adriangb requested a review from Copilot May 2, 2026 06:40
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes loss of Arrow inner-Field metadata (notably ARROW:extension:*) when DataFusion UDFs/UDAFs construct composite types like List, Struct, and Map, ensuring extension-type identity survives through planning and runtime construction.

Changes:

  • Add list_inner_field_from / struct_inner_fields_from helpers and extend SingleRowListArrayBuilder::with_field to propagate metadata.
  • Update composite constructors (make_array, array_repeat, range/generate_series, arrays_zip, map, struct) and array_agg accumulators to thread metadata-bearing FieldRefs through runtime array building.
  • Add unit tests plus an end-to-end sqllogictest covering metadata propagation across affected functions.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
datafusion/common/src/utils/mod.rs Adds helpers for metadata-preserving list/struct field construction; extends SingleRowListArrayBuilder to carry metadata.
datafusion/functions-nested/src/make_array.rs Uses planning-time inner FieldRef in return-field and runtime construction to preserve metadata.
datafusion/functions-nested/src/repeat.rs Threads inner FieldRef through array_repeat runtime paths to preserve list inner-field metadata.
datafusion/functions-nested/src/range.rs Adds return_field_from_args and runtime grafting of the planned inner field for range/generate_series.
datafusion/functions-nested/src/arrays_zip.rs Preserves element-field metadata in planned struct members and reuses planned struct fields at runtime.
datafusion/functions-nested/src/map.rs Builds map entries/key/value fields using metadata-bearing planned fields when available.
datafusion/functions/src/core/struct.rs Adds return_field_from_args to preserve per-member metadata in struct(...).
datafusion/functions-aggregate/src/array_agg.rs Preserves metadata by storing a FieldRef in accumulators and using it at all list-construction sites.
datafusion/spark/src/function/aggregate/collect.rs Updates Spark collect accumulators to pass FieldRef into array_agg accumulators.
datafusion/functions-aggregate/benches/array_agg.rs Adjusts bench to new accumulator constructor signature (&FieldRef).
datafusion/sqllogictest/test_files/array_metadata_propagation.slt New SLT regression coverage for metadata propagation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 83 to 90
fn accumulator(&self, acc_args: AccumulatorArgs) -> Result<Box<dyn Accumulator>> {
let field = &acc_args.expr_fields[0];
let data_type = field.data_type().clone();
let ignore_nulls = true;
Ok(Box::new(NullToEmptyListAccumulator::new(
ArrayAggAccumulator::try_new(&data_type, ignore_nulls)?,
ArrayAggAccumulator::try_new(field, ignore_nulls)?,
data_type,
)))
Comment on lines 141 to 148
fn accumulator(&self, acc_args: AccumulatorArgs) -> Result<Box<dyn Accumulator>> {
let field = &acc_args.expr_fields[0];
let data_type = field.data_type().clone();
let ignore_nulls = true;
Ok(Box::new(NullToEmptyListAccumulator::new(
DistinctArrayAggAccumulator::try_new(&data_type, None, ignore_nulls)?,
DistinctArrayAggAccumulator::try_new(field, None, ignore_nulls)?,
data_type,
)))
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 2, 2026

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details
     Cloning origin/main
    Building datafusion-common v53.1.0 (current)
error: running cargo-doc on crate 'datafusion-common' failed with output:
-----
   Compiling proc-macro2 v1.0.106
   Compiling unicode-ident v1.0.24
   Compiling quote v1.0.45
   Compiling libc v0.2.186
   Compiling libm v0.2.16
   Compiling autocfg v1.5.0
    Checking cfg-if v1.0.4
   Compiling num-traits v0.2.19
   Compiling syn v2.0.117
    Checking memchr v2.8.0
   Compiling find-msvc-tools v0.1.9
   Compiling shlex v1.3.0
   Compiling zerocopy v0.8.48
   Compiling serde_core v1.0.228
    Checking itoa v1.0.18
   Compiling zmij v1.0.21
    Checking bytes v1.11.1
   Compiling jobserver v0.1.34
   Compiling serde_json v1.0.149
   Compiling cc v1.2.61
    Checking num-integer v0.1.46
   Compiling getrandom v0.3.4
    Checking stable_deref_trait v1.2.1
    Checking iana-time-zone v0.1.65
   Compiling version_check v0.9.5
    Checking siphasher v1.0.2
    Checking phf_shared v0.12.1
   Compiling ahash v0.8.12
    Checking chrono v0.4.44
    Checking num-bigint v0.4.6
   Compiling synstructure v0.13.2
   Compiling chrono-tz v0.10.4
    Checking arrow-schema v58.1.0
    Checking phf v0.12.1
    Checking once_cell v1.21.4
    Checking num-complex v0.4.6
    Checking hashbrown v0.16.1
   Compiling pkg-config v0.3.33
    Checking litemap v0.8.2
    Checking writeable v0.6.3
    Checking lexical-util v1.0.7
   Compiling zstd-sys v2.0.16+zstd.1.5.7
    Checking smallvec v1.15.1
   Compiling zerocopy-derive v0.8.48
   Compiling zerofrom-derive v0.1.7
   Compiling yoke-derive v0.8.2
   Compiling zerovec-derive v0.11.3
    Checking zerofrom v0.1.7
    Checking yoke v0.8.2
   Compiling displaydoc v0.2.5
    Checking zerotrie v0.2.4
    Checking zerovec v0.11.6
   Compiling object v0.37.3
    Checking utf8_iter v1.0.4
   Compiling icu_normalizer_data v2.2.0
   Compiling icu_properties_data v2.2.0
    Checking tinystr v0.8.3
    Checking icu_locale_core v2.2.0
    Checking potential_utf v0.1.5
    Checking icu_collections v2.2.0
    Checking icu_provider v2.2.0
   Compiling semver v1.0.28
   Compiling rustc_version v0.4.1
    Checking lexical-write-integer v1.0.6
    Checking lexical-parse-integer v1.0.6
   Compiling zstd-safe v7.2.4
    Checking lexical-parse-float v1.0.6
    Checking half v2.7.1
    Checking arrow-buffer v58.1.0
    Checking lexical-write-float v1.0.6
    Checking icu_properties v2.2.0
    Checking arrow-data v58.1.0
    Checking icu_normalizer v2.2.0
    Checking arrow-array v58.1.0
   Compiling flatbuffers v25.12.19
    Checking aho-corasick v1.1.4
   Compiling ar_archive_writer v0.5.1
    Checking ryu v1.0.23
    Checking pin-project-lite v0.2.17
    Checking base64 v0.22.1
   Compiling parking_lot_core v0.9.12
    Checking unicode-segmentation v1.13.2
    Checking arrow-select v58.1.0
    Checking unicode-width v0.2.2
    Checking futures-sink v0.3.32
    Checking futures-core v0.3.32
    Checking regex-syntax v0.8.10
    Checking comfy-table v7.2.2
    Checking futures-channel v0.3.32
   Compiling psm v0.1.31
    Checking arrow-ord v58.1.0
    Checking idna_adapter v1.2.2
    Checking lexical-core v1.0.6
   Compiling futures-macro v0.3.32
    Checking atoi v2.0.0
    Checking equivalent v1.0.2
    Checking foldhash v0.2.0
    Checking futures-io v0.3.32
    Checking scopeguard v1.2.0
    Checking twox-hash v2.1.2
    Checking allocator-api2 v0.2.21
    Checking slab v0.4.12
    Checking regex-automata v0.4.14
   Compiling thiserror v2.0.18
    Checking alloc-no-stdlib v2.0.4
    Checking futures-task v0.3.32
    Checking percent-encoding v2.3.2
    Checking bitflags v2.11.1
    Checking form_urlencoded v1.2.2
    Checking futures-util v0.3.32
    Checking alloc-stdlib v0.2.2
    Checking hashbrown v0.17.0
    Checking lz4_flex v0.13.0
    Checking lock_api v0.4.14
    Checking regex v1.12.3
    Checking arrow-cast v58.1.0
    Checking idna v1.1.0
   Compiling thiserror-impl v2.0.18
   Compiling ring v0.17.14
   Compiling stacker v0.1.24
    Checking csv-core v0.1.13
   Compiling getrandom v0.4.2
   Compiling paste v1.0.15
    Checking either v1.15.0
    Checking simdutf8 v0.1.5
   Compiling snap v1.1.1
    Checking itertools v0.14.0
    Checking csv v1.4.0
    Checking parking_lot v0.12.5
    Checking url v2.5.8
    Checking indexmap v2.14.0
    Checking brotli-decompressor v5.0.0
   Compiling tokio-macros v2.7.0
   Compiling async-trait v0.1.89
    Checking zstd v0.13.3
    Checking arrow-ipc v58.1.0
    Checking http v1.4.0
    Checking ordered-float v2.10.1
    Checking getrandom v0.2.17
    Checking integer-encoding v3.0.4
    Checking untrusted v0.9.0
    Checking humantime v2.3.0
    Checking byteorder v1.5.0
    Checking zlib-rs v0.6.3
    Checking object_store v0.13.2
    Checking thrift v0.17.0
    Checking tokio v1.52.1
    Checking brotli v8.0.2
    Checking flate2 v1.1.9
    Checking arrow-json v58.1.0
    Checking arrow-csv v58.1.0
    Checking futures v0.3.32
    Checking arrow-string v58.1.0
    Checking arrow-row v58.1.0
    Checking arrow-arith v58.1.0
   Compiling recursive-proc-macro-impl v0.1.1
   Compiling sqlparser_derive v0.5.0
    Checking log v0.4.29
   Compiling seq-macro v0.3.6
    Checking recursive v0.1.1
    Checking uuid v1.23.1
    Checking hex v0.4.3
    Checking arrow v58.1.0
    Checking parquet v58.1.0
    Checking sqlparser v0.61.0
error[E0432]: unresolved import `object_store::buffered`
   --> /home/runner/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/parquet-58.1.0/src/arrow/async_writer/store.rs:25:19
    |
 25 | use object_store::buffered::BufWriter;
    |                   ^^^^^^^^ could not find `buffered` in `object_store`
    |
note: found an item that was configured out
   --> /home/runner/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/object_store-0.13.2/src/lib.rs:545:9
    |
544 | #[cfg(feature = "tokio")]
    |       ----------------- the item is gated behind the `tokio` feature
545 | pub mod buffered;
    |         ^^^^^^^^

error[E0282]: type annotations needed
   --> /home/runner/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/parquet-58.1.0/src/arrow/async_writer/store.rs:98:13
    |
 98 | /             self.w
 99 | |                 .put(bs)
100 | |                 .await
    | |______________________^ cannot infer type

error[E0282]: type annotations needed
   --> /home/runner/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/parquet-58.1.0/src/arrow/async_writer/store.rs:107:13
    |
107 | /             self.w
108 | |                 .shutdown()
109 | |                 .await
    | |______________________^ cannot infer type

Some errors have detailed explanations: E0282, E0432.
For more information about an error, try `rustc --explain E0282`.
error: could not compile `parquet` (lib) due to 3 previous errors
warning: build failed, waiting for other jobs to finish...

-----

error: failed to build rustdoc for crate datafusion-common v53.1.0
note: this is usually due to a compilation error in the crate,
      and is unlikely to be a bug in cargo-semver-checks
note: the following command can be used to reproduce the error:
      cargo new --lib example &&
          cd example &&
          echo '[workspace]' >> Cargo.toml &&
          cargo add --path /home/runner/work/datafusion/datafusion/datafusion/common --features backtrace,force_hash_collisions,object_store,parquet,parquet_encryption,recursive_protection,sql,sqlparser &&
          cargo check &&
          cargo doc

    Building datafusion-functions v53.1.0 (current)
       Built [  27.599s] (current)
     Parsing datafusion-functions v53.1.0 (current)
      Parsed [   0.072s] (current)
    Building datafusion-functions v53.1.0 (baseline)
       Built [  27.283s] (baseline)
     Parsing datafusion-functions v53.1.0 (baseline)
      Parsed [   0.071s] (baseline)
    Checking datafusion-functions v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.482s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  57.670s] datafusion-functions
    Building datafusion-functions-aggregate v53.1.0 (current)
       Built [  27.822s] (current)
     Parsing datafusion-functions-aggregate v53.1.0 (current)
      Parsed [   0.042s] (current)
    Building datafusion-functions-aggregate v53.1.0 (baseline)
       Built [  27.881s] (baseline)
     Parsing datafusion-functions-aggregate v53.1.0 (baseline)
      Parsed [   0.042s] (baseline)
    Checking datafusion-functions-aggregate v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.219s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  57.224s] datafusion-functions-aggregate
    Building datafusion-functions-nested v53.1.0 (current)
       Built [  32.252s] (current)
     Parsing datafusion-functions-nested v53.1.0 (current)
      Parsed [   0.032s] (current)
    Building datafusion-functions-nested v53.1.0 (baseline)
       Built [  31.798s] (baseline)
     Parsing datafusion-functions-nested v53.1.0 (baseline)
      Parsed [   0.033s] (baseline)
    Checking datafusion-functions-nested v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.215s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  65.801s] datafusion-functions-nested
    Building datafusion-spark v53.1.0 (current)
       Built [  49.408s] (current)
     Parsing datafusion-spark v53.1.0 (current)
      Parsed [   0.054s] (current)
    Building datafusion-spark v53.1.0 (baseline)
       Built [  49.442s] (baseline)
     Parsing datafusion-spark v53.1.0 (baseline)
      Parsed [   0.055s] (baseline)
    Checking datafusion-spark v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.365s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [ 101.009s] datafusion-spark
    Building datafusion-sqllogictest v53.1.0 (current)
       Built [ 126.226s] (current)
     Parsing datafusion-sqllogictest v53.1.0 (current)
      Parsed [   0.021s] (current)
    Building datafusion-sqllogictest v53.1.0 (baseline)
       Built [ 124.539s] (baseline)
     Parsing datafusion-sqllogictest v53.1.0 (baseline)
      Parsed [   0.021s] (baseline)
    Checking datafusion-sqllogictest v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.099s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [ 255.034s] datafusion-sqllogictest
error: aborting due to failure to build rustdoc for crate datafusion-common v53.1.0

Replace `list_inner_field_from` and `struct_inner_fields_from` with a single
`nullable_inner_field_from(inner, name)` that renames + forces nullable to
match Arrow's list/struct member conventions while preserving metadata.
`nullable_list_item_field_from` is a thin wrapper using
`Field::LIST_FIELD_DEFAULT_NAME`. The map function uses `Field::with_name` /
`with_nullable` directly since key/value need different nullability.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate functions Changes to functions implementation spark sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

make_array / array_agg drop inner-Field metadata when constructing List<T>

2 participants