feat: add grouping aggregate function #824

bestbeforetoday · 2025-07-03T08:15:26Z

Support for the SQL grouping function that is used with group by statements. The SQL grouping function typically has a single column parameter. However, many dialects also provide a grouping_id function, which behaves as an extended grouping function, allowing multiple parameters to be specified.

Calcite treats grouping and grouing_id as synonyms and allows multiple parameters for either function name. This implementation follows the Calcite model by allowing multiple parameters. The behavior for a single parameter should be the same as grouping, while multiple parameters should behave the same as grouping_id.

Relates to substrait-io/substrait-java#71

wackywendell · 2025-07-03T15:16:11Z

Hey everyone,

I've been looking into the GROUPING function mentioned in the recent PR, and since it was something new to me, I decided to dive in and understand how it works.

It turns out that the GROUPING function (and its related GROUPING_ID) is quite widely supported across various SQL dialects. Here's what I found:

MySQL:
- GROUPING(expr [, expr] ...) (doc link)
PostgreSQL:
- GROUPING ( group_by_expression(s) ) → integer (doc link)
- Here's a quick example query if you want to play with it!
Snowflake:
- GROUPING( <expr1> [, <expr2> , ... ] ) (doc link)
- GROUPING_ID is an alias for GROUPING.
SQL Server:
- GROUPING ( <column_expression> ) - This version takes only one argument.
- GROUPING_ID ( <column_expression> [ , ...n ] ) (link) - This takes multiple arguments and behaves like the multi-argument GROUPING in other engines.

(And possibly others; I wasn't entirely exhaustive).

My current understanding is that GROUPING (and GROUPING_ID) can generally take one or more columns and returns a value (0 or 1 for a single column, or a bitmask for multiple columns) to indicate if that column (or set of columns) was part of the current grouping set or "rolled up" into a subtotal/grand total. I'm also curious if it's in the SQL standard!

One thought that came to mind, given how GROUPING operates, is its placement within the Substrait model. Unlike "normal" aggregate functions (like SUM or COUNT) that process a series of input values, GROUPING acts more like a metadata marker. Its output is determined by the grouping keys and the specific grouping set that generated the row, rather than aggregating data values themselves. For instance, properties like decomposable and intermediate aren't really meaningful for it.

This makes me wonder if, from a semantic purity standpoint, it might fit more cleanly as an additional (perhaps optional) output field directly on the Aggregate relation or within the Measure protobuf, rather than as a new aggregate function. The trade-off, as I understand it, would be that changing the core protobuf definitions is a bigger lift than just adding a new function. That might be a conceptually cleaner representation in the long run, although adding it like this is quite pragmatic.

Anyway, that's what I've learned from digging into this! Hope these thoughts are helpful for the discussion.

EpsilonPrime · 2025-07-04T07:16:30Z

How does grouping id correspond to the additional output of the aggregation relation described (grouping set) here?

https://substrait.io/relations/logical_relations/#aggregate-operation

bestbeforetoday · 2025-07-04T09:26:58Z

Let me experiment to see if I can make it work using the additional disambiguation column value instead of adding a new extension function.

Support for the SQL grouping function that is used with group by statements. The SQL grouping function typically has a single column parameter. However, many dialects also provide a grouping_id function, which behaves as an extended grouping function, allowing multiple parameters to be specified. Calcite treats grouping and grouing_id as synonyms and allows multiple parameters for either function name. This implementation follows the Calcite model by allowing multiple parameters. The behavior for a single parameter should be the same as grouping, while multiple parameters should behave the same as grouping_id. Signed-off-by: Mark S. Lewis <[email protected]>

bestbeforetoday · 2025-07-22T13:47:15Z

I pushed a change to avoid the Python test failure due to the change in function coverage.

bestbeforetoday · 2025-07-22T15:29:11Z

I did try to implement a solution in substrait-java using only the existing Substrait capability. Assuming I am correctly understanding the suggestions above, this involves essentially rewriting the query to a different form. An SQL query similar to:

SELECT GROUPING(c1) FROM t GROUP BY ROLLUP (c1, c2)

could be rewritten as a UNION ALL of a set of SELECT ... GROUP BY statements. For each of those statements, the value of the column to which the GROUPING function is applied is known based on the set of columns specified in the GROUP BY clause.

What gets generated by Calcite for the initial query is something like:

LogicalAggregate(group=[{0, 1}], groups=[[{0, 1}, {0}, {}]], EXPR$0=[GROUPING($1)])

I have not been successful in implementing a working implementation to convert from Calcite to Substrait without using a GROUPING aggregate function, as described above. I don't doubt that this is down to my lack of expertise in the codebase. I do not see any clear path to producing equivalent SQL from the Substrait representation generated by this method.

I propose adding a Substrait GROUPING aggregate function, as implemented by this pull request. This is an approach that maps closely to the Calcite representation, and that I can make work pretty easily in both directions.

tokoko · 2025-07-22T20:45:57Z

@bestbeforetoday I don't think query rewrite to a set of unions was the suggestion. you can specify multiple groupings in Substrait Aggregate Rel similar to the Calcite example you provided. If you set grouping_expressions field to refer to fields c1 and c2 and then set groupings to a 3-long array containing [[0, 1], [0], []] (expression_references field), Aggregate will effectively function like a rollup.

Most importantly, according to the spec that was linked above: To further disambiguate which record belongs to which grouping set, an aggregate relation with more than one grouping set receives an extra i32 column on the right-hand side. The value of this field will be the zero-based index of the grouping set that yielded the record. This should produce something similar to what you're looking for... you will get an integer that refers to an index rather than a bitmask though.

bestbeforetoday · 2025-07-23T09:42:25Z

Thank you for the explanation. It isn't the GROUP BY ROLLUP I am having trouble with, and I wasn't actually trying to rewrite the SQL. Instead I tried to use the disambiguation field to determine the actual value to insert in place of the GROUPING(...) aggregate function applied to the result, similar to the rewriting of the query in SQL where the value is known for each SELECT without actually having to use the GROUPING aggregate function. I am happy to take pointers on how to implement this successfully in the substrait-java codebase.

There is still the issue that I don't see an easy way to convert back from a Substrait representation (using specific values and the disambiguation field) to SQL using a GROUPING aggregate function. Any ideas?

tokoko · 2025-07-24T06:26:44Z

@bestbeforetoday I misread the comment, sorry about that. If we are talking about going from calcite to substrait, I think you should be able to rewrite grouping functions into a single case statement that relies on the disambiguation field. In your rollup example:

GROUPING(c1, c2) will become

CASE WHEN DisambiguationField = 0 THEN 0 -- 00 because 1st grouping set contains both fields
             WHEN DisambiguationField = 1 THEN 1 -- 01 because 2nd grouping set doesn't contain c2 
             WHEN DisambiguationField = 2 THEN 3 -- 11 because 3rd grouping set contains neither of them

GROUPING(c1) will become

CASE WHEN DisambiguationField = 0 THEN 0 -- because 1st grouping set contains c1
             WHEN DisambiguationField = 1 THEN 0 -- because 2nd grouping also contains c1
             WHEN DisambiguationField = 2 THEN 1 -- because 3rd grouping doesn't contain c1

GROUPING(c2) will become

CASE WHEN DisambiguationField = 0 THEN 0 -- because 1st grouping set contains c2
             WHEN DisambiguationField = 1 THEN 1 -- because 2nd grouping set doesn't contain c2 
             WHEN DisambiguationField = 2 THEN 1 -- because 3rd grouping set doesn't contain c2

I also modified @wackywendell's postgres sqlfiddle example to show these case statements.

Going the other should also be possible with a case statement. You would essentially have to run a case statement over sql GROUPING(c1, c2) value and use the same cases as in my GROUPING(c1, c2) example except the values will be flipped to go from bit values to indices. It should work fine as long as there no exact duplicates in groupings, which is an interesting edge case 😆 (if it's even valid SQL.. fwiw postgres seems to have no problem running a grouping set with duplicate groupings)

bestbeforetoday requested review from EpsilonPrime, cpcloud, jacques-n, vbarua and westonpace as code owners July 3, 2025 08:15

bestbeforetoday marked this pull request as draft July 4, 2025 09:27

bestbeforetoday force-pushed the grouping branch from ae0c24f to 42e1639 Compare July 22, 2025 13:36

bestbeforetoday marked this pull request as ready for review July 22, 2025 15:30

bestbeforetoday marked this pull request as draft September 23, 2025 09:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add grouping aggregate function #824

feat: add grouping aggregate function #824

Uh oh!

bestbeforetoday commented Jul 3, 2025 •

edited

Loading

Uh oh!

wackywendell commented Jul 3, 2025

Uh oh!

EpsilonPrime commented Jul 4, 2025 •

edited

Loading

Uh oh!

bestbeforetoday commented Jul 4, 2025

Uh oh!

bestbeforetoday commented Jul 22, 2025

Uh oh!

bestbeforetoday commented Jul 22, 2025 •

edited

Loading

Uh oh!

tokoko commented Jul 22, 2025

Uh oh!

bestbeforetoday commented Jul 23, 2025

Uh oh!

tokoko commented Jul 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: add grouping aggregate function #824

Are you sure you want to change the base?

feat: add grouping aggregate function #824

Uh oh!

Conversation

bestbeforetoday commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wackywendell commented Jul 3, 2025

Uh oh!

EpsilonPrime commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bestbeforetoday commented Jul 4, 2025

Uh oh!

bestbeforetoday commented Jul 22, 2025

Uh oh!

bestbeforetoday commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tokoko commented Jul 22, 2025

Uh oh!

bestbeforetoday commented Jul 23, 2025

Uh oh!

tokoko commented Jul 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bestbeforetoday commented Jul 3, 2025 •

edited

Loading

EpsilonPrime commented Jul 4, 2025 •

edited

Loading

bestbeforetoday commented Jul 22, 2025 •

edited

Loading