Summary
Improve optimizer cardinality estimation for multi-column GROUP BY by introducing optional Hilbert-based joint statistics.
Currently, aggregate cardinality estimation in src/query/sql/src/planner/plans/aggregate.rs uses per-column NDV and a fixed correlation coefficient:
estimated_groups = min(input_rows, ndv(c1) * ndv(c2) * ndv(c3) * correlation_factor ...)
This can be inaccurate for strongly correlated columns such as (country, city), (date, month), (x, y), or multi-column cluster keys.
Databend already has Hilbert clustering infrastructure:
- hilbert_range_index(...) maps discretized multi-dimensional values into a one-dimensional Hilbert key.
- Hilbert recluster already computes per-dimension range bounds and Hilbert index bounds.
- Current optimizer statistics only expose single-column ColumnStat, so this information is not available for aggregate cost estimation.
Proposal
Introduce optional table-level multi-column statistics for selected column groups, especially Hilbert cluster keys and frequent multi-column GROUP BY keys.
A practical design could store:
JointStats {
columns: Vec<ColumnId>,
dimension_bounds: Vec<Vec<Scalar>>,
hilbert_histogram: Vec<Bucket>,
}
Bucket {
hilbert_lower: bytes,
hilbert_upper: bytes,
row_count: u64,
ndv: Option<u64>,
tuple_hll: Option<MetaHLL>,
}
The optimizer could use these statistics to estimate:
- GROUP BY (a, b, c) cardinality.
- Group cardinality after filters on part of the same column group.
- Partial aggregate output rows, shuffle rows, and hash table memory cost.
Expected space cost:
- Minimal tuple HLL per column group: about 128B ~ 1KB.
- Hilbert joint histogram with current default hilbert_num_range_ids = 1000: about 50KB ~ 100KB per column group.
- Stronger version with bucket-level tuple HLL: about 180KB ~ 300KB per column group.
This should be stored only for selected column groups, not all combinations. Good candidates are:
- The table cluster key when it is Hilbert clustered.
- Column groups explicitly requested by ANALYZE.
- High-frequency multi-column GROUP BY keys from query history.
This would provide a more principled alternative to the current independent-NDV product model while keeping metadata size bounded.
Summary
Improve optimizer cardinality estimation for multi-column
GROUP BYby introducing optional Hilbert-based joint statistics.Currently, aggregate cardinality estimation in
src/query/sql/src/planner/plans/aggregate.rsuses per-column NDV and a fixed correlation coefficient:This can be inaccurate for strongly correlated columns such as (country, city), (date, month), (x, y), or multi-column cluster keys.
Databend already has Hilbert clustering infrastructure:
Proposal
Introduce optional table-level multi-column statistics for selected column groups, especially Hilbert cluster keys and frequent multi-column GROUP BY keys.
A practical design could store:
The optimizer could use these statistics to estimate:
Expected space cost:
This should be stored only for selected column groups, not all combinations. Good candidates are:
This would provide a more principled alternative to the current independent-NDV product model while keeping metadata size bounded.