Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CALCITE-6830] AVG() returns double precision by default when its argument type is INT_TYPES #4193

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

LantaoJin
Copy link
Contributor

To reproduce (agg.iq)

!use post
!set outputformat mysql

SELECT avg(deptno) as a FROM emp;

Should return 22.1429, but returns 22.

The current default logic of Calcite setting the return type to be same as the input column. For example, if the input column is INT, the return type for AVG() is INT.
This behavior is different from other databases. For example:

Postgres

numeric for any integer type argument, double precision for a floating-point argument, otherwise the same as the argument data type

MySQL

The SUM() and AVG() functions return a DECIMAL value for exact-value arguments (integer or DECIMAL), and a DOUBLE value for approximate-value arguments (FLOAT or DOUBLE).

Presto

avg(x) → double. (https://prestodb.io/docs/current/functions/aggregate.html)

Besides, since the AVG agg with integer argument returns integer type, the behaviors of var_pop, var_samp, stddev_pop, stddev_samp and stddev are same as avg.

This PR targets to change the default return type of avg to double precision when its argument type are TINYINT, SMALLINT, INTEGER or BIGINT.

@LantaoJin LantaoJin marked this pull request as ready for review February 13, 2025 14:51
@mihaibudiu
Copy link
Contributor

DOUBLE is almost never the right type in SQL, since computations on double values are in general non-deterministic. If any type is right, a variant of DECIMAL should be used for AVG. Can you propose an algorithm to choose the precision and scale for the result? Ideally Calcite would follow what other databases do - at least databases that have proper DECIMAL types; the Postgres DECIMAL type is not a standard one, for example.

For functions like STDDEV it's a bit more complicated, since they involve a square root at the last step.

Moreover, this change is backwards-incompatible, and it may cause problems for all projects that have relied on this behavior. At the very least this change should be documented in history.md as such.

@LantaoJin
Copy link
Contributor Author

LantaoJin commented Feb 14, 2025

DOUBLE is almost never the right type in SQL, since computations on double values are in general non-deterministic. If any type is right, a variant of DECIMAL should be used for AVG. Can you propose an algorithm to choose the precision and scale for the result?

The CUME_DIST and PERCENT_RANK return DOUBLE type in Calcite too. And in Calcite, the default precision is 15 for DOUBLE, 17 for DECIMAL. Will using the default DOUBLE precision in AVG result in non-deterministic? If not, any specific reason to use DECIMAL for AVG? In Postgres, the DOUBLE precision for AVG calculated based on the input values as follows (not quite sure for now):

precision = max(precision(input_values) + ceil(log10(count(input_values))), scale(input_values) + decimal_places)

Not sure what precision algorithm would be the best, but definitely not return integer for AVG by default (with INT_TYPES argument).

@mihaibudiu
Copy link
Contributor

In general, the result of double addition aggregation depends on the order in which the numbers are processed, whereas the addition of decimals gives always the same result.

It's true that this can only happen for very large integers, since double has 53 bits of mantissa. So any integer value which requires less than 53 bits will be represented exactly, and if all intermediate addition results are less than 53 bits the result will be deterministic and exact; the only imprecision will be produced by the final division. The result of division in FP and DECIMAL can also be slightly different. In general for financial-type computations you should always use DECIMAL.

Your formula depends on the number of input values, which is not known when you compile the SQL program. Calcite is statically-typed, so it has to choose the type at compile-time.

@@ -1619,9 +1619,9 @@ LogicalAggregate(group=[{0}], EXPR$1=[STDDEV_POP($1)], EXPR$2=[AVG($1)], EXPR$3=
</Resource>
<Resource name="planAfter">
<![CDATA[
LogicalProject(NAME=[$0], EXPR$1=[CAST(POWER(/(-($1, /(*($2, $2), $3)), $3), 0.5:DECIMAL(2, 1))):INTEGER NOT NULL], EXPR$2=[CAST(/($2, $3)):INTEGER NOT NULL], EXPR$3=[CAST(POWER(/(-($1, /(*($2, $2), $3)), CASE(=($3, 1), null:BIGINT, -($3, 1))), 0.5:DECIMAL(2, 1))):INTEGER], EXPR$4=[CAST(/(-($1, /(*($2, $2), $3)), $3)):INTEGER NOT NULL], EXPR$5=[CAST(/(-($1, /(*($2, $2), $3)), CASE(=($3, 1), null:BIGINT, -($3, 1)))):INTEGER NOT NULL])
LogicalProject(NAME=[$0], EXPR$1=[POWER(/(-($1, /(*(CAST($2):DOUBLE NOT NULL, CAST($2):DOUBLE NOT NULL), $3)), $3), 0.5:DECIMAL(2, 1))], EXPR$2=[/(CAST($2):DOUBLE NOT NULL, $3)], EXPR$3=[POWER(/(-($1, /(*(CAST($2):DOUBLE NOT NULL, CAST($2):DOUBLE NOT NULL), $3)), CASE(=($3, 1), null:BIGINT, -($3, 1))), 0.5:DECIMAL(2, 1))], EXPR$4=[/(-($1, /(*(CAST($2):DOUBLE NOT NULL, CAST($2):DOUBLE NOT NULL), $3)), $3)], EXPR$5=[CAST(/(-($1, /(*(CAST($2):DOUBLE NOT NULL, CAST($2):DOUBLE NOT NULL), $3)), CASE(=($3, 1), null:BIGINT, -($3, 1)))):DOUBLE NOT NULL])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you see, the type inferred for the result will cascade through the query plan, influencing the type all values that depend on the aggregation result.

@@ -363,7 +363,12 @@ && getDefaultPrecision(typeName) != RelDataType.PRECISION_NOT_SPECIFIED) {

@Override public RelDataType deriveAvgAggType(RelDataTypeFactory typeFactory,
RelDataType argumentType) {
return argumentType;
if (SqlTypeName.INT_TYPES.contains(argumentType.getSqlTypeName())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the type system is actually pluggable. See feldera/feldera#3588 for the right way to do this. We should not change RelDataTypeSystemImpl.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants