feat: support min/max for struct #15667

chenkovsky · 2025-04-10T04:13:36Z

Which issue does this PR close?

Closes min/max aggregation function for struct #15666.

Rationale for this change

datafusion doesn't support min/max for struct.

What changes are included in this PR?

modify min_max.rs to support it.

the performance for struct is worse than primitive types. does anyone have suggestions about optimizing it.

Are these changes tested?

UT

Are there any user-facing changes?

NO

alamb · 2025-04-10T18:13:12Z

datafusion/functions-aggregate/src/min_max.rs

        _ => min_max_batch!(values, min),
    })
 }

+fn min_max_batch_struct(array: &ArrayRef, ordering: Ordering) -> Result<ScalarValue> {


What definition of "greater / less than" does this use? I think figuring out how to compare structs (like what does it mean for one struct to be bigger than the other) was not clear

I updated partial_cmp_struct to make it more clear.

alamb · 2025-04-10T18:13:33Z

the performance for struct is worse than primitive types. does anyone have suggestions about optimizing it.

Do make it faster you could implement a GroupsAccumulator

chenkovsky · 2025-04-29T06:09:47Z

@alamb could you please review it again ? I see min/max for dict is supported now. struct is similar.

alamb

Thanks @chenkovsky -- this looks really close to me. All I think is necessary is a few more tests (I left comments about which one specifically) and comments about safety. Otherwise this looks good to me

I also double checked that the definition of min/max for structs is consistent with duckdb (namely that each field is compared one at a time) which it is ✅

D select max(col0) from (values({"a":1,"b":2}), ({"a":2,"b":1}));
┌──────────────────────────────┐
│          max(col0)           │
│ struct(a integer, b integer) │
├──────────────────────────────┤
│ {'a': 2, 'b': 1}             │
└──────────────────────────────┘
D
D select max(col0) from (values({"a":1,"b":2,"c":3}), ({"a":1,"b":2,"c":4}));
┌─────────────────────────────────────────┐
│                max(col0)                │
│ struct(a integer, b integer, c integer) │
├─────────────────────────────────────────┤
│ {'a': 1, 'b': 2, 'c': 4}                │
└─────────────────────────────────────────┘

alamb · 2025-05-03T10:02:46Z

datafusion/common/src/scalar/mod.rs

@@ -616,7 +616,20 @@ fn partial_cmp_list(arr1: &dyn Array, arr2: &dyn Array) -> Option<Ordering> {
    Some(arr1.len().cmp(&arr2.len()))
 }

-fn partial_cmp_struct(s1: &Arc<StructArray>, s2: &Arc<StructArray>) -> Option<Ordering> {
+fn expand_struct_columns<'a>(array: &'a StructArray, columns: &mut Vec<&'a ArrayRef>) {


I think a more standard name for this operation is "flatten" but since this is an internal function I think it is fine too

alamb · 2025-05-03T10:07:50Z

datafusion/functions-aggregate-common/src/aggregate/groups_accumulator/nulls.rs

@@ -193,6 +193,14 @@ pub fn set_nulls_dyn(input: &dyn Array, nulls: Option<NullBuffer>) -> Result<Arr
                ))
            }
        }
+        DataType::Struct(_) => unsafe {
+            let input = input.as_struct();
+            Arc::new(StructArray::new_unchecked(


can you please add a safety note here (probably can just reuse the same as above)

alamb · 2025-05-03T10:10:15Z

datafusion/functions-aggregate/src/min_max.rs

+            $VALUE.clone()
+        } else {
+            match $VALUE.partial_cmp(&$DELTA) {
+                Some(choose_min_max!($OP)) => $DELTA.clone(),


should this also use deep_clone?

the $VALUE and $DELTA are all deep cloned. so there's no need to clone again.

alamb · 2025-05-03T10:11:58Z

datafusion/common/src/scalar/mod.rs

+    /// Performs a deep clone of the ScalarValue, creating new copies of all nested data structures.
+    /// This is different from the standard `clone()` which may share data through `Arc`.
+    /// Aggregation functions like `max` will cost a lot of memory if the data is not cloned.
+    pub fn deep_clone(&self) -> Self {


I think deep_clone is a fine name. Another one might be force_clone or something to indicate it is cloning the underlying data (not just array refs)

alamb · 2025-05-03T10:16:26Z

datafusion/functions-aggregate/src/min_max.rs

+        }
+    }};
+}
+
 /// dynamically-typed max(array) -> ScalarValue
 pub fn max_batch(values: &ArrayRef) -> Result<ScalarValue> {


I wonder if we could potentially remove the duplicated code in this function by calling into MinAccumulator / max accumulator as necessary. I can't remember why there is a second copy of this code

This might be something good to do as a follow on PR

I will create another PR.

both max accumulator and udf array_max use max_batch. so maybe it's not duplicated code.

alamb · 2025-05-03T10:21:41Z

datafusion/sqllogictest/test_files/aggregate.slt

+WITH t AS (SELECT i as c1, i + 1 as c2 FROM generate_series(1, 10) t(i))
+SELECT MIN(c), MAX(c) FROM (SELECT STRUCT(c1 AS 'a', c2 AS 'b') AS c, (c1 % 2) AS key FROM t LIMIT 0) GROUP BY key
+----
+


Can. you please add additional tests for:

A struct with just one field (all the tests in this file so far have 2 fields)

Structs where the first field is equal, so the comparison has to go to the second field

For 2 I am thinking something like this (where the first two fields are equal)

> select min(column1),max(column1) from ( values ({"a":1,"b":2,"c":3}), ({"a":1,"b":2,"c":4}) ); +--------------------+--------------------+ | min(column1) | max(column1) | +--------------------+--------------------+ | {a: 1, b: 2, c: 3} | {a: 1, b: 2, c: 4} | +--------------------+--------------------+ 1 row(s) fetched. Elapsed 0.020 seconds.

alamb

Thank you @chenkovsky

* feat: support min/max for struct * groups aggregator * update based on lamb's suggestion (cherry picked from commit e60b260)

feat: support min/max for struct

a52386f

chenkovsky marked this pull request as ready for review April 10, 2025 04:13

github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Apr 10, 2025

alamb reviewed Apr 10, 2025

View reviewed changes

groups aggregator

c0cf133

github-actions bot added the common Related to common crate label Apr 14, 2025

Merge branch 'main' into feature/min_max_struct2

a773cb1

Merge branch 'main' into feature/min_max_struct2

8b87911

alamb reviewed May 3, 2025

View reviewed changes

update based on lamb's suggestion

3dcf38d

alamb mentioned this pull request May 5, 2025

feat(datafusion-functions-aggregate): add support for lists and other nested types in min and max #15857

Closed

alamb approved these changes May 5, 2025

View reviewed changes

alamb merged commit e60b260 into apache:main May 6, 2025
27 checks passed

alamb mentioned this pull request May 6, 2025

fix: overcounting of memory in first/last. #15924

Merged

gabotechs mentioned this pull request May 12, 2025

Support MIN and MAX for DataType::List #16025

Merged

fmonjalet pushed a commit to DataDog/datafusion that referenced this pull request May 15, 2025

feat: support min/max for struct (apache#15667)

9f59afe

* feat: support min/max for struct * groups aggregator * update based on lamb's suggestion (cherry picked from commit e60b260)

feat: support min/max for struct #15667

feat: support min/max for struct #15667

Uh oh!

Conversation

chenkovsky commented Apr 10, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Apr 10, 2025

Uh oh!

chenkovsky commented Apr 29, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!