Skip to content

Conversation

@bestbeforetoday
Copy link

@bestbeforetoday bestbeforetoday commented Jul 3, 2025

Support for the SQL grouping function that is used with group by statements. The SQL grouping function typically has a single column parameter. However, many dialects also provide a grouping_id function, which behaves as an extended grouping function, allowing multiple parameters to be specified.

Calcite treats grouping and grouing_id as synonyms and allows multiple parameters for either function name. This implementation follows the Calcite model by allowing multiple parameters. The behavior for a single parameter should be the same as grouping, while multiple parameters should behave the same as grouping_id.

Relates to substrait-io/substrait-java#71

@wackywendell
Copy link
Contributor

Hey everyone,

I've been looking into the GROUPING function mentioned in the recent PR, and since it was something new to me, I decided to dive in and understand how it works.

It turns out that the GROUPING function (and its related GROUPING_ID) is quite widely supported across various SQL dialects. Here's what I found:

  • MySQL:

    • GROUPING(expr [, expr] ...) (doc link)
  • PostgreSQL:

  • Snowflake:

  • SQL Server:

    • GROUPING ( <column_expression> ) - This version takes only one argument.

    • GROUPING_ID ( <column_expression> [ , ...n ] ) (link) - This takes multiple arguments and behaves like the multi-argument GROUPING in other engines.

(And possibly others; I wasn't entirely exhaustive).

My current understanding is that GROUPING (and GROUPING_ID) can generally take one or more columns and returns a value (0 or 1 for a single column, or a bitmask for multiple columns) to indicate if that column (or set of columns) was part of the current grouping set or "rolled up" into a subtotal/grand total. I'm also curious if it's in the SQL standard!

One thought that came to mind, given how GROUPING operates, is its placement within the Substrait model. Unlike "normal" aggregate functions (like SUM or COUNT) that process a series of input values, GROUPING acts more like a metadata marker. Its output is determined by the grouping keys and the specific grouping set that generated the row, rather than aggregating data values themselves. For instance, properties like decomposable and intermediate aren't really meaningful for it.

This makes me wonder if, from a semantic purity standpoint, it might fit more cleanly as an additional (perhaps optional) output field directly on the Aggregate relation or within the Measure protobuf, rather than as a new aggregate function. The trade-off, as I understand it, would be that changing the core protobuf definitions is a bigger lift than just adding a new function. That might be a conceptually cleaner representation in the long run, although adding it like this is quite pragmatic.

Anyway, that's what I've learned from digging into this! Hope these thoughts are helpful for the discussion.

@EpsilonPrime
Copy link
Member

EpsilonPrime commented Jul 4, 2025

How does grouping id correspond to the additional output of the aggregation relation described (grouping set) here?

https://substrait.io/relations/logical_relations/#aggregate-operation

@bestbeforetoday
Copy link
Author

Let me experiment to see if I can make it work using the additional disambiguation column value instead of adding a new extension function.

@bestbeforetoday bestbeforetoday marked this pull request as draft July 4, 2025 09:27
Support for the SQL grouping function that is used with group by
statements. The SQL grouping function typically has a single column
parameter. However, many dialects also provide a grouping_id function,
which behaves as an extended grouping function, allowing multiple
parameters to be specified.

Calcite treats grouping and grouing_id as synonyms and allows multiple
parameters for either function name. This implementation follows the
Calcite model by allowing multiple parameters. The behavior for a single
parameter should be the same as grouping, while multiple parameters
should behave the same as grouping_id.

Signed-off-by: Mark S. Lewis <[email protected]>
@bestbeforetoday
Copy link
Author

I pushed a change to avoid the Python test failure due to the change in function coverage.

@bestbeforetoday
Copy link
Author

bestbeforetoday commented Jul 22, 2025

I did try to implement a solution in substrait-java using only the existing Substrait capability. Assuming I am correctly understanding the suggestions above, this involves essentially rewriting the query to a different form. An SQL query similar to:

SELECT GROUPING(c1) FROM t GROUP BY ROLLUP (c1, c2)

could be rewritten as a UNION ALL of a set of SELECT ... GROUP BY statements. For each of those statements, the value of the column to which the GROUPING function is applied is known based on the set of columns specified in the GROUP BY clause.

What gets generated by Calcite for the initial query is something like:

LogicalAggregate(group=[{0, 1}], groups=[[{0, 1}, {0}, {}]], EXPR$0=[GROUPING($1)])

I have not been successful in implementing a working implementation to convert from Calcite to Substrait without using a GROUPING aggregate function, as described above. I don't doubt that this is down to my lack of expertise in the codebase. I do not see any clear path to producing equivalent SQL from the Substrait representation generated by this method.

I propose adding a Substrait GROUPING aggregate function, as implemented by this pull request. This is an approach that maps closely to the Calcite representation, and that I can make work pretty easily in both directions.

@bestbeforetoday bestbeforetoday marked this pull request as ready for review July 22, 2025 15:30
@tokoko
Copy link
Contributor

tokoko commented Jul 22, 2025

@bestbeforetoday I don't think query rewrite to a set of unions was the suggestion. you can specify multiple groupings in Substrait Aggregate Rel similar to the Calcite example you provided. If you set grouping_expressions field to refer to fields c1 and c2 and then set groupings to a 3-long array containing [[0, 1], [0], []] (expression_references field), Aggregate will effectively function like a rollup.

Most importantly, according to the spec that was linked above: To further disambiguate which record belongs to which grouping set, an aggregate relation with more than one grouping set receives an extra i32 column on the right-hand side. The value of this field will be the zero-based index of the grouping set that yielded the record. This should produce something similar to what you're looking for... you will get an integer that refers to an index rather than a bitmask though.

@bestbeforetoday
Copy link
Author

Thank you for the explanation. It isn't the GROUP BY ROLLUP I am having trouble with, and I wasn't actually trying to rewrite the SQL. Instead I tried to use the disambiguation field to determine the actual value to insert in place of the GROUPING(...) aggregate function applied to the result, similar to the rewriting of the query in SQL where the value is known for each SELECT without actually having to use the GROUPING aggregate function. I am happy to take pointers on how to implement this successfully in the substrait-java codebase.

There is still the issue that I don't see an easy way to convert back from a Substrait representation (using specific values and the disambiguation field) to SQL using a GROUPING aggregate function. Any ideas?

@tokoko
Copy link
Contributor

tokoko commented Jul 24, 2025

@bestbeforetoday I misread the comment, sorry about that. If we are talking about going from calcite to substrait, I think you should be able to rewrite grouping functions into a single case statement that relies on the disambiguation field. In your rollup example:

  • GROUPING(c1, c2) will become
CASE WHEN DisambiguationField = 0 THEN 0 -- 00 because 1st grouping set contains both fields
             WHEN DisambiguationField = 1 THEN 1 -- 01 because 2nd grouping set doesn't contain c2 
             WHEN DisambiguationField = 2 THEN 3 -- 11 because 3rd grouping set contains neither of them
  • GROUPING(c1) will become
CASE WHEN DisambiguationField = 0 THEN 0 -- because 1st grouping set contains c1
             WHEN DisambiguationField = 1 THEN 0 -- because 2nd grouping also contains c1
             WHEN DisambiguationField = 2 THEN 1 -- because 3rd grouping doesn't contain c1
  • GROUPING(c2) will become
CASE WHEN DisambiguationField = 0 THEN 0 -- because 1st grouping set contains c2
             WHEN DisambiguationField = 1 THEN 1 -- because 2nd grouping set doesn't contain c2 
             WHEN DisambiguationField = 2 THEN 1 -- because 3rd grouping set doesn't contain c2 

I also modified @wackywendell's postgres sqlfiddle example to show these case statements.

Going the other should also be possible with a case statement. You would essentially have to run a case statement over sql GROUPING(c1, c2) value and use the same cases as in my GROUPING(c1, c2) example except the values will be flipped to go from bit values to indices. It should work fine as long as there no exact duplicates in groupings, which is an interesting edge case 😆 (if it's even valid SQL.. fwiw postgres seems to have no problem running a grouping set with duplicate groupings)

@bestbeforetoday bestbeforetoday marked this pull request as draft September 23, 2025 09:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants