Description
Background
Velox's AggregationNode
was likely designed following Presto style, where step is a property of the aggregation operator. As a comparison, Spark's AggregateMode is bound to a specific AggreagateFunction
, making it possible that one Spark aggregate operator has aggregate functions that are with different modes (steps). There was also an old discussion around this difference.
Apache Gluten has been relying on some tricks to align Spark with Velox on this: the planner interprets Spark's aggregation modes into different types of Velox companion functions, then assigns single
step to Velox's AggregationNode
constantly. By doing this we can get rid of a lot of relevant issues caused of the mismatch. For example, we can naturally support spill-able partial aggregation which is unsupported by Velox. We can also avoid unwanted flushing causing result mismatches because of some specific query plans generated by Spark.
The solution basically worked as expected until we met that sometimes the intermediate / final companion functions are not resolvable given intermediate input types for some functions. Because the functions accept different types of input but use the same type of intermediate data. For example this one.
The issue is literally hard to fix completely because, the companion functions are all treated as normal aggregation functions in Velox, even we have the input types of the original aggregation function from Spark, we can't always use them to find intermediate / final companion functions accurately because they are resolved with intermediate data types in Velox. Although Velox provides suffixed version of final companion functions to distinguish between them, but theoretically they're not reliable since in SQL world overloaded functions can only be distinguished by function name and input data types.
Proposed Changes
- Move
AggregationNode::Step
toAggregationNode::Aggregate::Step
and make it work. - Add a flag to
AggregationNode
to allow user disable flushing actively.
The changes will make Velox's aggregation API completely compatible with both Spark and Presto, and possibly with other databases because the API is made finer grained. In Presto we can just pass the same aggregation step to all aggregate functions in the operator, and in Spark we are now able to interpret Spark AggregateMode
into Velox AggregationNode::Aggregate::Step
for each aggregate function. Spark does't do flushing so normally in Spark we can just disable flushing. We can also stop relying on companion functions.