-
Notifications
You must be signed in to change notification settings - Fork 1.5k
feat: support min/max for struct #15667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
_ => min_max_batch!(values, min), | ||
}) | ||
} | ||
|
||
fn min_max_batch_struct(array: &ArrayRef, ordering: Ordering) -> Result<ScalarValue> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What definition of "greater / less than" does this use? I think figuring out how to compare structs (like what does it mean for one struct to be bigger than the other) was not clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated partial_cmp_struct to make it more clear.
Do make it faster you could implement a |
@alamb could you please review it again ? I see min/max for dict is supported now. struct is similar. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @chenkovsky -- this looks really close to me. All I think is necessary is a few more tests (I left comments about which one specifically) and comments about safety. Otherwise this looks good to me
I also double checked that the definition of min/max for structs is consistent with duckdb (namely that each field is compared one at a time) which it is ✅
D select max(col0) from (values({"a":1,"b":2}), ({"a":2,"b":1}));
┌──────────────────────────────┐
│ max(col0) │
│ struct(a integer, b integer) │
├──────────────────────────────┤
│ {'a': 2, 'b': 1} │
└──────────────────────────────┘
D
D select max(col0) from (values({"a":1,"b":2,"c":3}), ({"a":1,"b":2,"c":4}));
┌─────────────────────────────────────────┐
│ max(col0) │
│ struct(a integer, b integer, c integer) │
├─────────────────────────────────────────┤
│ {'a': 1, 'b': 2, 'c': 4} │
└─────────────────────────────────────────┘
datafusion/common/src/scalar/mod.rs
Outdated
@@ -616,7 +616,20 @@ fn partial_cmp_list(arr1: &dyn Array, arr2: &dyn Array) -> Option<Ordering> { | |||
Some(arr1.len().cmp(&arr2.len())) | |||
} | |||
|
|||
fn partial_cmp_struct(s1: &Arc<StructArray>, s2: &Arc<StructArray>) -> Option<Ordering> { | |||
fn expand_struct_columns<'a>(array: &'a StructArray, columns: &mut Vec<&'a ArrayRef>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a more standard name for this operation is "flatten" but since this is an internal function I think it is fine too
@@ -193,6 +193,14 @@ pub fn set_nulls_dyn(input: &dyn Array, nulls: Option<NullBuffer>) -> Result<Arr | |||
)) | |||
} | |||
} | |||
DataType::Struct(_) => unsafe { | |||
let input = input.as_struct(); | |||
Arc::new(StructArray::new_unchecked( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please add a safety note here (probably can just reuse the same as above)
$VALUE.clone() | ||
} else { | ||
match $VALUE.partial_cmp(&$DELTA) { | ||
Some(choose_min_max!($OP)) => $DELTA.clone(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this also use deep_clone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the $VALUE and $DELTA are all deep cloned. so there's no need to clone again.
datafusion/common/src/scalar/mod.rs
Outdated
/// Performs a deep clone of the ScalarValue, creating new copies of all nested data structures. | ||
/// This is different from the standard `clone()` which may share data through `Arc`. | ||
/// Aggregation functions like `max` will cost a lot of memory if the data is not cloned. | ||
pub fn deep_clone(&self) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think deep_clone
is a fine name. Another one might be force_clone
or something to indicate it is cloning the underlying data (not just array refs)
} | ||
}}; | ||
} | ||
|
||
/// dynamically-typed max(array) -> ScalarValue | ||
pub fn max_batch(values: &ArrayRef) -> Result<ScalarValue> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could potentially remove the duplicated code in this function by calling into MinAccumulator / max accumulator as necessary. I can't remember why there is a second copy of this code
This might be something good to do as a follow on PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will create another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both max accumulator and udf array_max use max_batch. so maybe it's not duplicated code.
WITH t AS (SELECT i as c1, i + 1 as c2 FROM generate_series(1, 10) t(i)) | ||
SELECT MIN(c), MAX(c) FROM (SELECT STRUCT(c1 AS 'a', c2 AS 'b') AS c, (c1 % 2) AS key FROM t LIMIT 0) GROUP BY key | ||
---- | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can. you please add additional tests for:
- A struct with just one field (all the tests in this file so far have 2 fields)
- Structs where the first field is equal, so the comparison has to go to the second field
For 2 I am thinking something like this (where the first two fields are equal)
> select min(column1),max(column1) from (
values
({"a":1,"b":2,"c":3}),
({"a":1,"b":2,"c":4})
);
+--------------------+--------------------+
| min(column1) | max(column1) |
+--------------------+--------------------+
| {a: 1, b: 2, c: 3} | {a: 1, b: 2, c: 4} |
+--------------------+--------------------+
1 row(s) fetched.
Elapsed 0.020 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @chenkovsky
* feat: support min/max for struct * groups aggregator * update based on lamb's suggestion (cherry picked from commit e60b260)
Which issue does this PR close?
Rationale for this change
datafusion doesn't support min/max for struct.
What changes are included in this PR?
modify min_max.rs to support it.
the performance for struct is worse than primitive types. does anyone have suggestions about optimizing it.
Are these changes tested?
UT
Are there any user-facing changes?
NO