Describe the bug
Several built-in UDFs and UDAFs that wrap an input column/field into a composite output type (List, Struct, Map, …) drop the input field's metadata when constructing the output's inner Field.
In practice this matters most for Arrow extension types (ARROW:extension:name / ARROW:extension:metadata — e.g. arrow.json, arrow.uuid), because SQL-constructed lists/structs silently lose extension-type identity. Any downstream operation that compares the produced DataType against a type carrying inner-field metadata (union, aggregate merging, IPC roundtrip) will see them as different types.
MRE
Using the existing with_metadata and arrow_field SQL functions:
-- make_array: inner field's metadata is dropped.
SELECT arrow_field(make_array(with_metadata('a', 'key', 'value')));
+-----------------------------------------------------------------------------+
| arrow_field(make_array(with_metadata(Utf8("a"),Utf8("key"),Utf8("value")))) |
+-----------------------------------------------------------------------------+
| {name: lit, data_type: List(Utf8), nullable: false, metadata: {}} |
+-----------------------------------------------------------------------------+
array_agg suffers from the same issue.
Root cause
Both functions implement only the type-only return_type hook and never look at input fields, so they have no way to read source metadata:
datafusion/functions-nested/src/make_array.rs:97-106 — return_type returns DataType::new_list(element_type, true).
datafusion/functions-aggregate/src/array_agg.rs:107-112 — return_type returns DataType::List(Arc::new(Field::new_list_field(arg_types[0].clone(), true))).
The runtime paths construct fresh fields with no metadata as well:
array_array in make_array.rs:237 builds Arc::new(Field::new(field_name, data_type, true)).
- The various
array_agg accumulators (array_agg.rs:663, :734, etc.) build Arc::new(Field::new_list_field(self.datatype.clone(), true)).
Both UDF (return_field_from_args) and UDAF (return_field) hooks accept input FieldRefs and could propagate metadata, but neither function overrides them.
Audit: other built-in UDFs/UDAFs with the same pattern
The same shape — only return_type is overridden, runtime synthesizes a fresh inner Field — was found in:
| Function |
File:line of return_type |
Notes |
make_array |
functions-nested/src/make_array.rs:97 |
known; runtime at L237 builds Field::new(name, dt, true) |
array_agg |
functions-aggregate/src/array_agg.rs:107 |
known; runtime accumulators (L663, L734) use Field::new_list_field(self.datatype, true) |
map |
functions-nested/src/map.rs:381 |
constructs fresh Field::new("key", …) / Field::new("value", …) at L384–L393 |
range / generate_series |
functions-nested/src/range.rs:226 |
Field::new_list_field(element_type, true) at L234, L238, L243 (low-impact since args are usually scalars) |
arrays_zip |
functions-nested/src/arrays_zip.rs:117 |
builds struct field at L133 with Field::new(format!(...), element_type, true), dropping metadata for each zipped column |
repeat |
functions-nested/src/repeat.rs:118 |
Field::new_list_field(element_type, true) at L121, L125 |
struct |
functions/src/core/struct.rs:102 |
constructs Field::new(format!("c{pos}"), dt.clone(), true) at L110 with no metadata |
Verified to be correct (already override return_field_from_args and propagate input field metadata):
named_struct (functions/src/core/named_struct.rs:98)
coalesce (input field is returned at L89)
array_remove, array_remove_n, map_values
Not affected (output is a primitive scalar or input-independent type, so there is no inner field to lose metadata on): array_sort, array_replace, array_concat, array_append, array_prepend, array_distinct, array_slice, array_pop_front, array_pop_back, set_ops (union/intersect/except), min/max on arrays, string_agg.
flatten (functions-nested/src/flatten.rs:91) is borderline: it overrides only return_type but the runtime path mostly reuses the original inner field via destructuring, so simple cases preserve metadata. Worth a focused test pass to confirm.
Describe the bug
Several built-in UDFs and UDAFs that wrap an input column/field into a composite output type (
List,Struct,Map, …) drop the input field's metadata when constructing the output's inner Field.In practice this matters most for Arrow extension types (
ARROW:extension:name/ARROW:extension:metadata— e.g.arrow.json,arrow.uuid), because SQL-constructed lists/structs silently lose extension-type identity. Any downstream operation that compares the producedDataTypeagainst a type carrying inner-field metadata (union, aggregate merging, IPC roundtrip) will see them as different types.MRE
Using the existing
with_metadataandarrow_fieldSQL functions:array_aggsuffers from the same issue.Root cause
Both functions implement only the type-only
return_typehook and never look at input fields, so they have no way to read source metadata:datafusion/functions-nested/src/make_array.rs:97-106—return_typereturnsDataType::new_list(element_type, true).datafusion/functions-aggregate/src/array_agg.rs:107-112—return_typereturnsDataType::List(Arc::new(Field::new_list_field(arg_types[0].clone(), true))).The runtime paths construct fresh fields with no metadata as well:
array_arrayinmake_array.rs:237buildsArc::new(Field::new(field_name, data_type, true)).array_aggaccumulators (array_agg.rs:663,:734, etc.) buildArc::new(Field::new_list_field(self.datatype.clone(), true)).Both UDF (
return_field_from_args) and UDAF (return_field) hooks accept inputFieldRefs and could propagate metadata, but neither function overrides them.Audit: other built-in UDFs/UDAFs with the same pattern
The same shape — only
return_typeis overridden, runtime synthesizes a fresh inner Field — was found in:return_typemake_arrayfunctions-nested/src/make_array.rs:97Field::new(name, dt, true)array_aggfunctions-aggregate/src/array_agg.rs:107Field::new_list_field(self.datatype, true)mapfunctions-nested/src/map.rs:381Field::new("key", …)/Field::new("value", …)at L384–L393range/generate_seriesfunctions-nested/src/range.rs:226Field::new_list_field(element_type, true)at L234, L238, L243 (low-impact since args are usually scalars)arrays_zipfunctions-nested/src/arrays_zip.rs:117Field::new(format!(...), element_type, true), dropping metadata for each zipped columnrepeatfunctions-nested/src/repeat.rs:118Field::new_list_field(element_type, true)at L121, L125structfunctions/src/core/struct.rs:102Field::new(format!("c{pos}"), dt.clone(), true)at L110 with no metadataVerified to be correct (already override
return_field_from_argsand propagate input field metadata):named_struct(functions/src/core/named_struct.rs:98)coalesce(input field is returned at L89)array_remove,array_remove_n,map_valuesNot affected (output is a primitive scalar or input-independent type, so there is no inner field to lose metadata on):
array_sort,array_replace,array_concat,array_append,array_prepend,array_distinct,array_slice,array_pop_front,array_pop_back,set_ops(union/intersect/except),min/maxon arrays,string_agg.flatten(functions-nested/src/flatten.rs:91) is borderline: it overrides onlyreturn_typebut the runtime path mostly reuses the original inner field via destructuring, so simple cases preserve metadata. Worth a focused test pass to confirm.