Feature:  Use Hilbert joint statistics for multi-column aggregate cardinality estimation

**Summary**

Improve optimizer cardinality estimation for multi-column `GROUP BY` by introducing optional Hilbert-based joint statistics.

  Currently, aggregate cardinality estimation in `src/query/sql/src/planner/plans/aggregate.rs` uses per-column NDV and a fixed correlation coefficient:

  ```text
  estimated_groups = min(input_rows, ndv(c1) * ndv(c2) * ndv(c3) * correlation_factor ...)
```

  This can be inaccurate for strongly correlated columns such as (country, city), (date, month), (x, y), or multi-column cluster keys.

  Databend already has Hilbert clustering infrastructure:

  - hilbert_range_index(...) maps discretized multi-dimensional values into a one-dimensional Hilbert key.
  - Hilbert recluster already computes per-dimension range bounds and Hilbert index bounds.
  - Current optimizer statistics only expose single-column ColumnStat, so this information is not available for aggregate cost estimation.

**Proposal**

  Introduce optional table-level multi-column statistics for selected column groups, especially Hilbert cluster keys and frequent multi-column GROUP BY keys.

  A practical design could store:

```rust
  JointStats {
    columns: Vec<ColumnId>,
    dimension_bounds: Vec<Vec<Scalar>>,
    hilbert_histogram: Vec<Bucket>,
  }

  Bucket {
    hilbert_lower: bytes,
    hilbert_upper: bytes,
    row_count: u64,
    ndv: Option<u64>,
    tuple_hll: Option<MetaHLL>,
  }
```

  The optimizer could use these statistics to estimate:

  - GROUP BY (a, b, c) cardinality.
  - Group cardinality after filters on part of the same column group.
  - Partial aggregate output rows, shuffle rows, and hash table memory cost.

  Expected space cost:

  - Minimal tuple HLL per column group: about 128B ~ 1KB.
  - Hilbert joint histogram with current default hilbert_num_range_ids = 1000: about 50KB ~ 100KB per column group.
  - Stronger version with bucket-level tuple HLL: about 180KB ~ 300KB per column group.

  This should be stored only for selected column groups, not all combinations. Good candidates are:

  - The table cluster key when it is Hilbert clustered.
  - Column groups explicitly requested by ANALYZE.
  - High-frequency multi-column GROUP BY keys from query history.

  This would provide a more principled alternative to the current independent-NDV product model while keeping metadata size bounded.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Use Hilbert joint statistics for multi-column aggregate cardinality estimation #19788

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Feature: Use Hilbert joint statistics for multi-column aggregate cardinality estimation #19788

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions