-
Notifications
You must be signed in to change notification settings - Fork 187
feat: add grouping aggregate function #824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Hey everyone, I've been looking into the It turns out that the
(And possibly others; I wasn't entirely exhaustive). My current understanding is that One thought that came to mind, given how This makes me wonder if, from a semantic purity standpoint, it might fit more cleanly as an additional (perhaps optional) output field directly on the Anyway, that's what I've learned from digging into this! Hope these thoughts are helpful for the discussion. |
|
How does grouping id correspond to the additional output of the aggregation relation described (grouping set) here? https://substrait.io/relations/logical_relations/#aggregate-operation |
|
Let me experiment to see if I can make it work using the additional disambiguation column value instead of adding a new extension function. |
Support for the SQL grouping function that is used with group by statements. The SQL grouping function typically has a single column parameter. However, many dialects also provide a grouping_id function, which behaves as an extended grouping function, allowing multiple parameters to be specified. Calcite treats grouping and grouing_id as synonyms and allows multiple parameters for either function name. This implementation follows the Calcite model by allowing multiple parameters. The behavior for a single parameter should be the same as grouping, while multiple parameters should behave the same as grouping_id. Signed-off-by: Mark S. Lewis <[email protected]>
ae0c24f to
42e1639
Compare
|
I pushed a change to avoid the Python test failure due to the change in function coverage. |
|
I did try to implement a solution in substrait-java using only the existing Substrait capability. Assuming I am correctly understanding the suggestions above, this involves essentially rewriting the query to a different form. An SQL query similar to: SELECT GROUPING(c1) FROM t GROUP BY ROLLUP (c1, c2)could be rewritten as a UNION ALL of a set of What gets generated by Calcite for the initial query is something like: I have not been successful in implementing a working implementation to convert from Calcite to Substrait without using a GROUPING aggregate function, as described above. I don't doubt that this is down to my lack of expertise in the codebase. I do not see any clear path to producing equivalent SQL from the Substrait representation generated by this method. I propose adding a Substrait GROUPING aggregate function, as implemented by this pull request. This is an approach that maps closely to the Calcite representation, and that I can make work pretty easily in both directions. |
|
@bestbeforetoday I don't think query rewrite to a set of unions was the suggestion. you can specify multiple groupings in Substrait Most importantly, according to the spec that was linked above: |
|
Thank you for the explanation. It isn't the GROUP BY ROLLUP I am having trouble with, and I wasn't actually trying to rewrite the SQL. Instead I tried to use the disambiguation field to determine the actual value to insert in place of the GROUPING(...) aggregate function applied to the result, similar to the rewriting of the query in SQL where the value is known for each SELECT without actually having to use the GROUPING aggregate function. I am happy to take pointers on how to implement this successfully in the substrait-java codebase. There is still the issue that I don't see an easy way to convert back from a Substrait representation (using specific values and the disambiguation field) to SQL using a GROUPING aggregate function. Any ideas? |
|
@bestbeforetoday I misread the comment, sorry about that. If we are talking about going from calcite to substrait, I think you should be able to rewrite grouping functions into a single case statement that relies on the disambiguation field. In your rollup example:
I also modified @wackywendell's postgres sqlfiddle example to show these case statements. Going the other should also be possible with a case statement. You would essentially have to run a case statement over sql |
Support for the SQL grouping function that is used with group by statements. The SQL grouping function typically has a single column parameter. However, many dialects also provide a grouping_id function, which behaves as an extended grouping function, allowing multiple parameters to be specified.
Calcite treats grouping and grouing_id as synonyms and allows multiple parameters for either function name. This implementation follows the Calcite model by allowing multiple parameters. The behavior for a single parameter should be the same as grouping, while multiple parameters should behave the same as grouping_id.
Relates to substrait-io/substrait-java#71