Skip to content

Support WITHIN GROUP syntax to standardize certain existing aggregate functions #13511

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

Garamda
Copy link
Contributor

@Garamda Garamda commented Nov 21, 2024

Which issue does this PR close?

Closes #11732. (cc. #12824)

Rationale for this change

As described in #11732, some certain aggregate functions need to be standardized as ordered set aggregate function.

What changes are included in this PR?

  • SQL
    • utilize WITHIN GROUP clause
  • Logical plan
    • Add and handle within_group field
  • Physical plan
    • handle within_group field with existing function arguments
    • support descending order (DESC) in accumulator
  • Dataframe
    • change function signature to get within_group as pararmeter
  • Session state
    • add ordered set aggregate function information in session (since this needs to be handled specifically in certain cases)
  • Substrait
    • add within_group field in proto
    • handle within_group in producer & consumer
  • Test
    • reorganize existing test cases for modified syntax
    • add new cases
  • Docs

Are these changes tested?

  • Yes. (with existing / modified / new test cases)

Are there any user-facing changes?

  • Yes
    • approx_percentile_cont
      • AS-IS : approx_percentile_cont(expression, percentile, centroids)
      • TO-BE : approx_percentile_cont(percentile, centroids) WITHIN GROUP (ORDER BY expression)
    • approx_percentile_cont_with_weight
      • AS-IS : approx_percentile_cont_with_weight(expression, weight, percentile)
      • TO-BE : approx_percentile_cont_with_weight(weight, percentile) WITHIN GROUP (ORDER BY expression)
  • Documents are updated upon those changes.
  • Adding api change label may be required, which I am not authorized to do.

@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions labels Nov 21, 2024
@alamb
Copy link
Contributor

alamb commented Nov 21, 2024

FYI @Dandandan

Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Jan 21, 2025
@Garamda
Copy link
Contributor Author

Garamda commented Jan 21, 2025

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

Task is done in my local repository, and I will commit changes and write comments this week after final self review.

@github-actions github-actions bot added core Core DataFusion crate functions Changes to functions implementation labels Jan 21, 2025
@github-actions github-actions bot added the proto Related to proto crate label Jan 21, 2025
@github-actions github-actions bot added the substrait Changes to the substrait crate label Jan 21, 2025
@github-actions github-actions bot removed the Stale PR has not had any activity for some time label Jan 22, 2025
@github-actions github-actions bot added catalog Related to the catalog crate execution Related to the execution crate labels Jan 23, 2025
* Ensure compatibility with new `within_group` and `order_by` handling.

* Adjust tests and examples to align with the new logic.
@github-actions github-actions bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Jan 25, 2025
* Add test cases for changed signature

* Update signature in docs
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jan 27, 2025
@github-actions github-actions bot removed the substrait Changes to the substrait crate label Feb 28, 2025
Copy link
Contributor

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating this @Garamda, it's definitely much simpler now ✨
I've left some additional comments.

/// Otherwise return None (the default)
fn supports_null_handling_clause(&self) -> Option<bool> {
None
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something we need? From what I know, there aren't any aggregate functions that have options for null handling. At the moment, the 2 overrides you have of this both return Some(false), which is what I would consider the default value anyways.

Speaking of which, if we do need this, do we need to return an Optional<bool> or could we just return bool directly?

Copy link
Contributor Author

@Garamda Garamda Mar 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some aggregate functions using null handling in current datafusion.
(cf. If this is something we need to discuss/fix, then I can make another git issue. Or, I can refactor it too in this PR. I left this comment because I am not 100% sure about the SQL standard.)

And I refactored the function to just return bool.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was smelling odd, so I dug a bit deeper. I think you've inadvertantly stumbled into something even weirder than you anticipated

The example you've linked is

SELECT FIRST_VALUE(column1) RESPECT NULLS FROM t;

which I don't think is a valid query because first_value should not be an aggregate function, or at the very least the above query is not valid in most SQL dialects. first_value is actually a window function in other engines (eg. Trino, Postgres, MySQL).

If you try running something like

SELECT first_value(column1) FROM t;

against Postgres you get an error like

Query Error: window function first_value requires an OVER clause

dbfiddle

The RESPECT NULLS | IGNORE NULLS options is only a property of certain window functions, hence we shouldn't need to track it for aggregate functions.

I'm going to file a ticket for the above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #15006

Copy link
Contributor

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks good to me ✨

I did leave a couple of minor comments, as well as some bigger ones, but I think this is ready for review by someone else. Thanks for bearing with me as a reviewed this, it was a good chance for me to look at new parts of DataFusion.

One big question we might need to answer before merging is if we need a migration strategy for this. Because we now require WITHIN GROUP for these functions, any users who have queries stored outside of DataFusion will experience breakages that they can't work around. If we want to provide a migration path, we may need to support having both forms of calling these functions, as in

SELECT approx_percentile_cont(column_name, 0.75, 100) FROM table_name;
SELECT approx_percentile_cont(0.75, 100) WITHIN GROUP (ORDER BY column_name) FROM table_name;

for at least 1 release so folks can migrate their queries.

/// Otherwise return None (the default)
fn supports_null_handling_clause(&self) -> Option<bool> {
None
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was smelling odd, so I dug a bit deeper. I think you've inadvertantly stumbled into something even weirder than you anticipated

The example you've linked is

SELECT FIRST_VALUE(column1) RESPECT NULLS FROM t;

which I don't think is a valid query because first_value should not be an aggregate function, or at the very least the above query is not valid in most SQL dialects. first_value is actually a window function in other engines (eg. Trino, Postgres, MySQL).

If you try running something like

SELECT first_value(column1) FROM t;

against Postgres you get an error like

Query Error: window function first_value requires an OVER clause

dbfiddle

The RESPECT NULLS | IGNORE NULLS options is only a property of certain window functions, hence we shouldn't need to track it for aggregate functions.

I'm going to file a ticket for the above.

Comment on lines +157 to +161
let percentile = if is_descending {
1.0 - percentile
} else {
percentile
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable to me, but I don't have that much experience on the execution side of things.

"[IGNORE | RESPECT] NULLS are not permitted for {}",
fm.name()
);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per my point about [IGNORE | RESPECT] NULLS being a property of window functions, I don't think we need this check here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bearing with me as a reviewed this, it was a good chance for me to look at new parts of DataFusion.

I appreciate your elaborate review again. 👍
This PR has become much simpler, clearer, and better now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One big question we might need to answer before merging is if we need a migration strategy for this. Because we now require WITHIN GROUP for these functions, any users who have queries stored outside of DataFusion will experience breakages that they can't work around. If we want to provide a migration path, we may need to support having both forms of calling these functions, as in

SELECT approx_percentile_cont(column_name, 0.75, 100) FROM table_name;
SELECT approx_percentile_cont(0.75, 100) WITHIN GROUP (ORDER BY column_name) FROM table_name;

for at least 1 release so folks can migrate their queries.

This is one of the biggest concerns when I started to work on this feature.
If the community decides the migration strategy like that, then I will make both syntax supported.
Also, I will file an issue to track the plan so that the current syntax can be excluded as scheduled. (if I am authorized to do so)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was smelling odd, so I dug a bit deeper. I think you've inadvertantly stumbled into something even weirder than you anticipated
...
The RESPECT NULLS | IGNORE NULLS options is only a property of certain window functions, hence we shouldn't need to track it for aggregate functions.

I'm going to file a ticket for the above.
...
Per my point about [IGNORE | RESPECT] NULLS being a property of window functions, I don't think we need this check here.

I understood and agree with your guidance.
I will track what is decided in the issue you filed, and will remove some codes out after determination.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cf) I have applied all reviews that you tagged 'minor', since I was also convinced.

@vbarua
Copy link
Contributor

vbarua commented Mar 5, 2025

Pinging @Dandandan for commiter review as they filed the ticket this fix is for.

Garamda added 2 commits March 5, 2025 14:56
* Uses order by consistently after done with sql

* Remove redundant comment

* Serve more clear error msg

* Handle error cases in the same code block
@jayzhan211
Copy link
Contributor

I didn't see the notification of this one. I will find time to review it

@jayzhan211 jayzhan211 added the api change Changes the API exposed to users of the crate label Apr 23, 2025
@jayzhan211
Copy link
Contributor

I will send the conflict fix

@jayzhan211 jayzhan211 merged commit e41c02c into apache:main Apr 23, 2025
29 checks passed
@jayzhan211
Copy link
Contributor

Thanks @Garamda and @vbarua!

@Garamda
Copy link
Contributor Author

Garamda commented Apr 23, 2025

@jayzhan211 Thank you for reviewing!

However, I have one concern.

Is it okay to merge this PR right away, considering #13511 (review) ?

One big question we might need to answer before merging is if we need a migration strategy for this. Because we now require WITHIN GROUP for these functions, any users who have queries stored outside of DataFusion will experience breakages that they can't work around. If we want to provide a migration path, we may need to support having both forms of calling these functions, as in

SELECT approx_percentile_cont(column_name, 0.75, 100) FROM table_name;
SELECT approx_percentile_cont(0.75, 100) WITHIN GROUP (ORDER BY column_name) FROM table_name;

for at least 1 release so folks can migrate their queries.

Also, please check my comment too.

#13511 (comment)

This is one of the biggest concerns when I started to work on this feature.
If the community decides the migration strategy like that, then I will make both syntax supported.
Also, I will file an issue to track the plan so that the current syntax can be excluded as scheduled.

I could not decide this on my own, so was waiting for the agreement/decision.

@jayzhan211
Copy link
Contributor

I think support both query would be confusing, if we plan to end up support the new syntax at the end, it is better not to keep the old syntax

@Garamda
Copy link
Contributor Author

Garamda commented Apr 24, 2025

@jayzhan211 Alright, I see. Thank you!

nirnayroy pushed a commit to nirnayroy/datafusion that referenced this pull request May 2, 2025
… functions (apache#13511)

* Add within group variable to aggregate function and arguments

* Support within group and disable null handling for ordered set aggregate functions (apache#13511)

* Refactored function to match updated signature

* Modify proto to support within group clause

* Modify physical planner and accumulator to support ordered set aggregate function

* Support session management for ordered set aggregate functions

* Align code, tests, and examples with changes to aggregate function logic

* Ensure compatibility with new `within_group` and `order_by` handling.

* Adjust tests and examples to align with the new logic.

* Fix typo in existing comments

* Enhance test

* Add test cases for changed signature

* Update signature in docs

* Fix bug : handle missing within_group when applying children tree node

* Change the signature of approx_percentile_cont for consistency

* Add missing within_group for expr display

* Handle edge case when over and within group clause are used together

* Apply clippy advice: avoids too many arguments

* Add new test cases using descending order

* Apply cargo fmt

* Revert unintended submodule changes

* Apply prettier guidance

* Apply doc guidance by update_function_doc.sh

* Rollback WITHIN GROUP and related logic after converting it into expr

* Make it not to handle redundant logic

* Rollback ordered set aggregate functions from session to save same info in udf itself

* Convert within group to order by when converting sql to expr

* Add function to determine it is ordered-set aggregate function

* Rollback within group from proto

* Utilize within group as order by in functions-aggregate

* Apply clippy

* Convert order by to within group

* Apply cargo fmt

* Remove plain line breaks

* Remove duplicated column arg in schema name

* Refactor boolean functions to just return primitive type

* Make within group necessary in the signature of existing ordered set aggr funcs

* Apply cargo fmt

* Support a single ordering expression in the signature

* Apply cargo fmt

* Add dataframe function test cases to verify descending ordering

* Apply cargo fmt

* Apply code reviews

* Uses order by consistently after done with sql

* Remove redundant comment

* Serve more clear error msg

* Handle error cases in the same code block

* Update error msg in test as corresponding code changed

* fix

---------

Co-authored-by: Jay Zhan <[email protected]>
@alamb
Copy link
Contributor

alamb commented May 2, 2025

One of the extended clickbench queries also needed to be updated to the new syntax. I made a PR to do this here:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate documentation Improvements or additions to documentation functions Changes to functions implementation logical-expr Logical plan and expressions proto Related to proto crate sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Standardize APPROX_PERCENTILE_CONT / PERCENTILE_CONT and similar aggregation functions
6 participants