[CALCITE-6830] AVG() returns double precision by default when its argument type is INT_TYPES #4193

LantaoJin · 2025-02-13T08:50:49Z

To reproduce (agg.iq)

!use post
!set outputformat mysql

SELECT avg(deptno) as a FROM emp;

Should return 22.1429, but returns 22.

The current default logic of Calcite setting the return type to be same as the input column. For example, if the input column is INT, the return type for AVG() is INT.
This behavior is different from other databases. For example:

Postgres

numeric for any integer type argument, double precision for a floating-point argument, otherwise the same as the argument data type

MySQL

The SUM() and AVG() functions return a DECIMAL value for exact-value arguments (integer or DECIMAL), and a DOUBLE value for approximate-value arguments (FLOAT or DOUBLE).

Presto

avg(x) → double. (https://prestodb.io/docs/current/functions/aggregate.html)

Besides, since the AVG agg with integer argument returns integer type, the behaviors of var_pop, var_samp, stddev_pop, stddev_samp and stddev are same as avg.

This PR targets to change the default return type of avg to double precision when its argument type are TINYINT, SMALLINT, INTEGER or BIGINT.

mihaibudiu · 2025-02-13T17:14:43Z

DOUBLE is almost never the right type in SQL, since computations on double values are in general non-deterministic. If any type is right, a variant of DECIMAL should be used for AVG. Can you propose an algorithm to choose the precision and scale for the result? Ideally Calcite would follow what other databases do - at least databases that have proper DECIMAL types; the Postgres DECIMAL type is not a standard one, for example.

For functions like STDDEV it's a bit more complicated, since they involve a square root at the last step.

Moreover, this change is backwards-incompatible, and it may cause problems for all projects that have relied on this behavior. At the very least this change should be documented in history.md as such.

LantaoJin · 2025-02-14T09:27:08Z

DOUBLE is almost never the right type in SQL, since computations on double values are in general non-deterministic. If any type is right, a variant of DECIMAL should be used for AVG. Can you propose an algorithm to choose the precision and scale for the result?

The CUME_DIST and PERCENT_RANK return DOUBLE type in Calcite too. And in Calcite, the default precision is 15 for DOUBLE, 17 for DECIMAL. Will using the default DOUBLE precision in AVG result in non-deterministic? If not, any specific reason to use DECIMAL for AVG? In Postgres, the DOUBLE precision for AVG calculated based on the input values as follows (not quite sure for now):

precision = max(precision(input_values) + ceil(log10(count(input_values))), scale(input_values) + decimal_places)

Not sure what precision algorithm would be the best, but definitely not return integer for AVG by default (with INT_TYPES argument).

sonarqubecloud · 2025-02-14T10:06:14Z

Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
100.0% Coverage on New Code
23.1% Duplication on New Code

See analysis details on SonarQube Cloud

…ument type is INT_TYPES

mihaibudiu · 2025-02-14T18:42:01Z

In general, the result of double addition aggregation depends on the order in which the numbers are processed, whereas the addition of decimals gives always the same result.

It's true that this can only happen for very large integers, since double has 53 bits of mantissa. So any integer value which requires less than 53 bits will be represented exactly, and if all intermediate addition results are less than 53 bits the result will be deterministic and exact; the only imprecision will be produced by the final division. The result of division in FP and DECIMAL can also be slightly different. In general for financial-type computations you should always use DECIMAL.

Your formula depends on the number of input values, which is not known when you compile the SQL program. Calcite is statically-typed, so it has to choose the type at compile-time.

mihaibudiu · 2025-02-14T18:44:52Z

core/src/test/resources/org/apache/calcite/test/RelOptRulesTest.xml

@@ -1619,9 +1619,9 @@ LogicalAggregate(group=[{0}], EXPR$1=[STDDEV_POP($1)], EXPR$2=[AVG($1)], EXPR$3=
    </Resource>
    <Resource name="planAfter">
      <![CDATA[
-LogicalProject(NAME=[$0], EXPR$1=[CAST(POWER(/(-($1, /(*($2, $2), $3)), $3), 0.5:DECIMAL(2, 1))):INTEGER NOT NULL], EXPR$2=[CAST(/($2, $3)):INTEGER NOT NULL], EXPR$3=[CAST(POWER(/(-($1, /(*($2, $2), $3)), CASE(=($3, 1), null:BIGINT, -($3, 1))), 0.5:DECIMAL(2, 1))):INTEGER], EXPR$4=[CAST(/(-($1, /(*($2, $2), $3)), $3)):INTEGER NOT NULL], EXPR$5=[CAST(/(-($1, /(*($2, $2), $3)), CASE(=($3, 1), null:BIGINT, -($3, 1)))):INTEGER NOT NULL])
+LogicalProject(NAME=[$0], EXPR$1=[POWER(/(-($1, /(*(CAST($2):DOUBLE NOT NULL, CAST($2):DOUBLE NOT NULL), $3)), $3), 0.5:DECIMAL(2, 1))], EXPR$2=[/(CAST($2):DOUBLE NOT NULL, $3)], EXPR$3=[POWER(/(-($1, /(*(CAST($2):DOUBLE NOT NULL, CAST($2):DOUBLE NOT NULL), $3)), CASE(=($3, 1), null:BIGINT, -($3, 1))), 0.5:DECIMAL(2, 1))], EXPR$4=[/(-($1, /(*(CAST($2):DOUBLE NOT NULL, CAST($2):DOUBLE NOT NULL), $3)), $3)], EXPR$5=[CAST(/(-($1, /(*(CAST($2):DOUBLE NOT NULL, CAST($2):DOUBLE NOT NULL), $3)), CASE(=($3, 1), null:BIGINT, -($3, 1)))):DOUBLE NOT NULL])


As you see, the type inferred for the result will cascade through the query plan, influencing the type all values that depend on the aggregation result.

mihaibudiu · 2025-02-21T05:45:41Z

core/src/main/java/org/apache/calcite/rel/type/RelDataTypeSystemImpl.java

@@ -363,7 +363,12 @@ && getDefaultPrecision(typeName) != RelDataType.PRECISION_NOT_SPECIFIED) {

  @Override public RelDataType deriveAvgAggType(RelDataTypeFactory typeFactory,
      RelDataType argumentType) {
-    return argumentType;
+    if (SqlTypeName.INT_TYPES.contains(argumentType.getSqlTypeName())) {


the type system is actually pluggable. See feldera/feldera#3588 for the right way to do this. We should not change RelDataTypeSystemImpl.

LantaoJin marked this pull request as ready for review February 13, 2025 14:51

[CALCITE-6830] AVG() returns double precision by default when its arg…

e481c75

…ument type is INT_TYPES

LantaoJin force-pushed the pr/CALCITE-6830 branch from dc5a42b to e481c75 Compare February 14, 2025 10:09

mihaibudiu reviewed Feb 14, 2025

View reviewed changes

F21 force-pushed the main branch from 7d38212 to cacf36a Compare February 17, 2025 03:33

mihaibudiu reviewed Feb 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CALCITE-6830] AVG() returns double precision by default when its argument type is INT_TYPES #4193

[CALCITE-6830] AVG() returns double precision by default when its argument type is INT_TYPES #4193

LantaoJin commented Feb 13, 2025

mihaibudiu commented Feb 13, 2025

LantaoJin commented Feb 14, 2025 •

edited

Loading

sonarqubecloud bot commented Feb 14, 2025

mihaibudiu commented Feb 14, 2025

mihaibudiu Feb 14, 2025

mihaibudiu Feb 21, 2025

[CALCITE-6830] AVG() returns double precision by default when its argument type is INT_TYPES #4193

Are you sure you want to change the base?

[CALCITE-6830] AVG() returns double precision by default when its argument type is INT_TYPES #4193

Conversation

LantaoJin commented Feb 13, 2025

mihaibudiu commented Feb 13, 2025

LantaoJin commented Feb 14, 2025 • edited Loading

sonarqubecloud bot commented Feb 14, 2025

Quality Gate passed

mihaibudiu commented Feb 14, 2025

mihaibudiu Feb 14, 2025

Choose a reason for hiding this comment

mihaibudiu Feb 21, 2025

Choose a reason for hiding this comment

LantaoJin commented Feb 14, 2025 •

edited

Loading