Skip to content

feat: support min/max for struct #15667

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 6, 2025
Merged

Conversation

chenkovsky
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

datafusion doesn't support min/max for struct.

What changes are included in this PR?

modify min_max.rs to support it.

the performance for struct is worse than primitive types. does anyone have suggestions about optimizing it.

Are these changes tested?

UT

Are there any user-facing changes?

NO

@chenkovsky chenkovsky marked this pull request as ready for review April 10, 2025 04:13
@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Apr 10, 2025
_ => min_max_batch!(values, min),
})
}

fn min_max_batch_struct(array: &ArrayRef, ordering: Ordering) -> Result<ScalarValue> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What definition of "greater / less than" does this use? I think figuring out how to compare structs (like what does it mean for one struct to be bigger than the other) was not clear

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated partial_cmp_struct to make it more clear.

@alamb
Copy link
Contributor

alamb commented Apr 10, 2025

the performance for struct is worse than primitive types. does anyone have suggestions about optimizing it.

Do make it faster you could implement a GroupsAccumulator

@github-actions github-actions bot added the common Related to common crate label Apr 14, 2025
@chenkovsky
Copy link
Contributor Author

@alamb could you please review it again ? I see min/max for dict is supported now. struct is similar.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @chenkovsky -- this looks really close to me. All I think is necessary is a few more tests (I left comments about which one specifically) and comments about safety. Otherwise this looks good to me

I also double checked that the definition of min/max for structs is consistent with duckdb (namely that each field is compared one at a time) which it is ✅

D select max(col0) from (values({"a":1,"b":2}), ({"a":2,"b":1}));
┌──────────────────────────────┐
│          max(col0)           │
│ struct(a integer, b integer) │
├──────────────────────────────┤
│ {'a': 2, 'b': 1}             │
└──────────────────────────────┘
D
D select max(col0) from (values({"a":1,"b":2,"c":3}), ({"a":1,"b":2,"c":4}));
┌─────────────────────────────────────────┐
│                max(col0)                │
│ struct(a integer, b integer, c integer) │
├─────────────────────────────────────────┤
│ {'a': 1, 'b': 2, 'c': 4}                │
└─────────────────────────────────────────┘

@@ -616,7 +616,20 @@ fn partial_cmp_list(arr1: &dyn Array, arr2: &dyn Array) -> Option<Ordering> {
Some(arr1.len().cmp(&arr2.len()))
}

fn partial_cmp_struct(s1: &Arc<StructArray>, s2: &Arc<StructArray>) -> Option<Ordering> {
fn expand_struct_columns<'a>(array: &'a StructArray, columns: &mut Vec<&'a ArrayRef>) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a more standard name for this operation is "flatten" but since this is an internal function I think it is fine too

@@ -193,6 +193,14 @@ pub fn set_nulls_dyn(input: &dyn Array, nulls: Option<NullBuffer>) -> Result<Arr
))
}
}
DataType::Struct(_) => unsafe {
let input = input.as_struct();
Arc::new(StructArray::new_unchecked(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please add a safety note here (probably can just reuse the same as above)

$VALUE.clone()
} else {
match $VALUE.partial_cmp(&$DELTA) {
Some(choose_min_max!($OP)) => $DELTA.clone(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this also use deep_clone?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the $VALUE and $DELTA are all deep cloned. so there's no need to clone again.

/// Performs a deep clone of the ScalarValue, creating new copies of all nested data structures.
/// This is different from the standard `clone()` which may share data through `Arc`.
/// Aggregation functions like `max` will cost a lot of memory if the data is not cloned.
pub fn deep_clone(&self) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think deep_clone is a fine name. Another one might be force_clone or something to indicate it is cloning the underlying data (not just array refs)

}
}};
}

/// dynamically-typed max(array) -> ScalarValue
pub fn max_batch(values: &ArrayRef) -> Result<ScalarValue> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could potentially remove the duplicated code in this function by calling into MinAccumulator / max accumulator as necessary. I can't remember why there is a second copy of this code

This might be something good to do as a follow on PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will create another PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both max accumulator and udf array_max use max_batch. so maybe it's not duplicated code.

WITH t AS (SELECT i as c1, i + 1 as c2 FROM generate_series(1, 10) t(i))
SELECT MIN(c), MAX(c) FROM (SELECT STRUCT(c1 AS 'a', c2 AS 'b') AS c, (c1 % 2) AS key FROM t LIMIT 0) GROUP BY key
----

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can. you please add additional tests for:

  1. A struct with just one field (all the tests in this file so far have 2 fields)
  2. Structs where the first field is equal, so the comparison has to go to the second field

For 2 I am thinking something like this (where the first two fields are equal)

> select min(column1),max(column1) from (
values
  ({"a":1,"b":2,"c":3}),
  ({"a":1,"b":2,"c":4})
);
+--------------------+--------------------+
| min(column1)       | max(column1)       |
+--------------------+--------------------+
| {a: 1, b: 2, c: 3} | {a: 1, b: 2, c: 4} |
+--------------------+--------------------+
1 row(s) fetched.
Elapsed 0.020 seconds.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @chenkovsky

@alamb alamb merged commit e60b260 into apache:main May 6, 2025
27 checks passed
fmonjalet pushed a commit to DataDog/datafusion that referenced this pull request May 15, 2025
* feat: support min/max for struct

* groups aggregator

* update based on lamb's suggestion

(cherry picked from commit e60b260)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

min/max aggregation function for struct
2 participants