Skip to content

Feature: Use Hilbert joint statistics for multi-column aggregate cardinality estimation #19788

@forsaken628

Description

@forsaken628

Summary

Improve optimizer cardinality estimation for multi-column GROUP BY by introducing optional Hilbert-based joint statistics.

Currently, aggregate cardinality estimation in src/query/sql/src/planner/plans/aggregate.rs uses per-column NDV and a fixed correlation coefficient:

estimated_groups = min(input_rows, ndv(c1) * ndv(c2) * ndv(c3) * correlation_factor ...)

This can be inaccurate for strongly correlated columns such as (country, city), (date, month), (x, y), or multi-column cluster keys.

Databend already has Hilbert clustering infrastructure:

  • hilbert_range_index(...) maps discretized multi-dimensional values into a one-dimensional Hilbert key.
  • Hilbert recluster already computes per-dimension range bounds and Hilbert index bounds.
  • Current optimizer statistics only expose single-column ColumnStat, so this information is not available for aggregate cost estimation.

Proposal

Introduce optional table-level multi-column statistics for selected column groups, especially Hilbert cluster keys and frequent multi-column GROUP BY keys.

A practical design could store:

  JointStats {
    columns: Vec<ColumnId>,
    dimension_bounds: Vec<Vec<Scalar>>,
    hilbert_histogram: Vec<Bucket>,
  }

  Bucket {
    hilbert_lower: bytes,
    hilbert_upper: bytes,
    row_count: u64,
    ndv: Option<u64>,
    tuple_hll: Option<MetaHLL>,
  }

The optimizer could use these statistics to estimate:

  • GROUP BY (a, b, c) cardinality.
  • Group cardinality after filters on part of the same column group.
  • Partial aggregate output rows, shuffle rows, and hash table memory cost.

Expected space cost:

  • Minimal tuple HLL per column group: about 128B ~ 1KB.
  • Hilbert joint histogram with current default hilbert_num_range_ids = 1000: about 50KB ~ 100KB per column group.
  • Stronger version with bucket-level tuple HLL: about 180KB ~ 300KB per column group.

This should be stored only for selected column groups, not all combinations. Good candidates are:

  • The table cluster key when it is Hilbert clustered.
  • Column groups explicitly requested by ANALYZE.
  • High-frequency multi-column GROUP BY keys from query history.

This would provide a more principled alternative to the current independent-NDV product model while keeping metadata size bounded.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions