Skip to content

Latest commit

 

History

History
490 lines (358 loc) · 17.2 KB

File metadata and controls

490 lines (358 loc) · 17.2 KB

RFC: KLL-Based Histogram Generation for ANALYZE TABLE

Summary

Databend can optionally generate column histograms during ANALYZE TABLE, but the current implementation builds each supported column histogram by planning and executing a SQL query with NTILE(100) OVER (ORDER BY col), followed by GROUP BY quantile.

This produces row-order equi-depth buckets, but it requires a full column scan, a global sort/window step, and distinct aggregation for every histogram column. The implementation is expensive enough that enable_analyze_histogram is disabled by default.

This RFC proposes adding a selectable histogram generation algorithm for ANALYZE TABLE. The existing SQL/window implementation remains available, and a new storage-native KLL algorithm can be selected through settings, table options, or new ANALYZE TABLE syntax. The KLL builder estimates equi-depth bucket boundaries with bounded rank error, then computes bucket statistics through a streaming aggregation pass. The result keeps the existing Histogram metadata format and optimizer-facing semantics, while avoiding the global sort and reducing memory pressure when the KLL algorithm is selected.

Motivation

The current histogram path is implemented in AnalyzeTableInterpreter. When both enable_analyze_histogram and enable_table_snapshot_stats are enabled, Databend creates one query per supported column:

SELECT
    quantile,
    COUNT(DISTINCT col) AS ndv,
    MAX(col) AS max_value,
    MIN(col) AS min_value,
    COUNT() AS count
FROM (
    SELECT
        col,
        NTILE(100) OVER (ORDER BY col) AS quantile
    FROM db.table
    WHERE col IS DISTINCT FROM NULL
)
GROUP BY quantile
ORDER BY quantile

This has several problems:

  • The window step needs a global order for each analyzed column.
  • Large tables may spill or OOM while sorting.
  • The work is repeated independently for every histogram column.
  • The generated buckets are row-order equi-depth buckets, not disjoint value-domain partitions. Hot values may cross bucket boundaries and produce repeated or overlapping bucket ranges.

The equi-depth model itself is useful for skewed data. The problem is the way Databend currently constructs the buckets.

KLL sketches provide a mergeable, bounded-memory approximation of quantile positions. They fit Databend's pipeline model better than ORDER BY + NTILE: workers can build local sketches, merge them, then derive approximate bucket boundaries without materializing and sorting all values.

Goals

  • Add a selectable histogram generation algorithm for ANALYZE TABLE.
  • Allow users to choose between the current SQL/window algorithm and a new KLL algorithm.
  • Generate KLL-based ANALYZE histograms without a global sort/window query per column.
  • Keep histogram generation streaming and bounded-memory.
  • Keep the existing Histogram and HistogramBucket metadata representation.
  • Preserve the optimizer-facing bucket fields: lower_bound, upper_bound, num_values, and num_distinct.
  • Support mergeable partial states so the implementation works with parallel and distributed execution.
  • Allow incremental adoption behind existing or new settings.

Non-Goals

  • This RFC does not require exact equi-depth bucket boundaries.
  • This RFC does not require changing optimizer selectivity formulas in the first implementation.
  • This RFC does not require solving fully incremental histogram maintenance in the first implementation.

Existing Behavior

ANALYZE TABLE has two related but separate statistics paths:

  • The NDV path reads segment/block metadata and builds or merges column HLLs. Missing block HLLs can be generated by scanning blocks.
  • The histogram path is built by AnalyzeTableInterpreter as extra query pipelines. Each supported column has a separate SQL query whose result is sent to SinkAnalyzeState through HistogramInfoSink.

SinkAnalyzeState::create_histogram converts the query output rows into HistogramBuckets. During commit, those buckets are converted into Histogram::try_from_buckets(true, buckets, None) and stored in TableSnapshotStatistics.

Supported histogram columns currently follow the same type filter as range statistics:

  • Number
  • Date
  • Timestamp
  • TimestampTz
  • String
  • Decimal

Internally, histogram bounds are stored through Datum variants: Int, UInt, Float, and Bytes.

Proposed Design

Introduce an analyze histogram algorithm choice.

The first two algorithms are:

  • window: the current implementation based on NTILE(DEFAULT_HISTOGRAM_BUCKETS) OVER (ORDER BY col) and GROUP BY quantile;
  • kll: a new analyze-native streaming implementation based on KLL sketches.

The default algorithm can remain window for compatibility while the KLL path is experimental. Users can opt into kll through settings, table options, or new ANALYZE TABLE syntax.

The KLL builder has two phases.

Phase 1: Build KLL Sketches

Scan supported histogram columns and update one KLL sketch per column.

In parallel execution, each worker builds local sketches:

worker 1: col_a -> KLL A1, col_b -> KLL B1
worker 2: col_a -> KLL A2, col_b -> KLL B2
worker 3: col_a -> KLL A3, col_b -> KLL B3

Then sketches are merged by column:

merge(A1, A2, A3) -> global KLL for col_a
merge(B1, B2, B3) -> global KLL for col_b

The merge is only across partial states for the same column. Columns remain independent.

After merge, query the global sketch for bucket boundary ranks:

q0, q1, q2, ..., qN

where N is DEFAULT_HISTOGRAM_BUCKETS.

The first implementation can use the existing bucket count of 100.

Phase 2: Fill Buckets

Scan the same supported columns again and assign each non-null value to a bucket based on the KLL-derived boundaries.

For each bucket, maintain:

  • row count;
  • minimum observed value;
  • maximum observed value;
  • per-bucket distinct-value statistics.

At the end, emit the existing histogram bucket representation:

HistogramBucket::try_from_bounds(
    lower_bound,
    upper_bound,
    count as f64,
    bucket_ndv as f64,
)

The accuracy flag must reflect the semantics of num_distinct. Today, TypedHistogram::ndv() treats accurate histograms as exact by summing bucket num_distinct values and returning NdvEstimate::exact. Therefore, if per-bucket NDV is produced by HLL, Databend must not publish those approximate values as accuracy = true.

The implementation should choose one of these policies:

  • compute exact per-bucket distinct counts and keep accuracy = true; or
  • use per-bucket HLL and store the histogram as non-exact, for example accuracy = false; or
  • extend the histogram accuracy semantics so bucket NDV can distinguish exact counts from approximate HLL estimates.

The bucket boundaries are approximate quantile boundaries. Bucket row counts are computed from the second pass over actual rows, while bucket NDV exactness depends on the policy above.

Why Two Passes

KLL can estimate quantile boundaries, but it does not directly provide all histogram bucket fields required by Databend's optimizer.

Databend needs:

  • lower_bound;
  • upper_bound;
  • num_values;
  • num_distinct.

The first KLL pass gives approximate rank boundaries. The second pass fills the actual bucket statistics under those boundaries. This avoids global sorting while preserving useful bucket payloads.

A one-pass version could assume each bucket has approximately row_count / bucket_count rows, but that would make bucket counts and NDVs too weak for selectivity and join estimation.

Handling Repeated Boundaries

KLL-derived adjacent boundaries may be equal when a value is very frequent. This is not a bug; it reflects the data distribution.

The bucket assignment logic must define stable behavior for equal boundaries. For example:

  • assign values by rank target when possible during the second pass; or
  • assign values by boundary intervals with deterministic tie handling; or
  • collapse adjacent equal-boundary buckets into a single bucket when bucket ranges cannot be made meaningful.

The first implementation should prefer correctness and stable metadata over always producing exactly DEFAULT_HISTOGRAM_BUCKETS buckets. Producing fewer non-empty buckets is acceptable.

The resulting histogram is still an equi-depth approximation, not a disjoint value-domain histogram. Optimizer consumers should continue treating it as a row-distribution summary.

Integration Points

Analyze Pipeline

AnalyzeTableInterpreter should choose the histogram builder from the selected algorithm.

For window, Databend can keep the current query-pipeline based behavior. For kll, FuseTable::do_analyze can receive a histogram collection mode, or the KLL histogram builder can be integrated into the existing analyze source/sink flow.

Potential placement:

  • Extend AnalyzeCollectNDVSource or add a sibling source that reads projected histogram columns and emits KLL states.
  • Extend SinkAnalyzeState to collect merged KLL states, derive boundaries, request/fill bucket statistics, and commit histograms.
  • Keep histogram commit in TableSnapshotStatistics, matching the existing storage format.

Storage Metadata

No metadata format change is required for the first implementation. The final stored output remains HashMap<ColumnId, Histogram>.

If future work stores reusable KLL sketches at block or segment level, that should be a separate metadata extension.

This future extension is valuable because KLL sketches are naturally incremental. Newly written blocks can produce block-level sketches, and table or segment-level boundary discovery can merge persisted sketches instead of scanning raw column data again. Even if bucket filling still needs to scan rows to compute num_values and per-bucket NDV under the final boundaries, persisted KLL sketches can avoid the first pass used only to discover approximate equi-depth boundaries.

Algorithm Selection

The implementation can reuse these existing feature gates:

  • enable_analyze_histogram
  • enable_table_snapshot_stats

In addition, Databend should expose a way to choose the histogram generation algorithm. Possible configuration surfaces:

  • Session or global setting:
analyze_histogram_algorithm = 'window' | 'kll'
  • Table option, so a table can keep a stable analyze policy.

  • New ANALYZE TABLE syntax, for one-off control:

ANALYZE TABLE db.table WITH HISTOGRAM ALGORITHM = 'kll';

The exact syntax can be decided during implementation, but the API should make the algorithm choice explicit. The current SQL/window path should remain a valid algorithm rather than an implicit secondary path.

Error Parameter Configuration

KLL exposes a tunable accuracy/memory tradeoff. When the selected algorithm is kll, the implementation should provide a conservative default, but users and tests should be able to override the error target explicitly.

Possible configuration surfaces:

  • Session or global setting, for example:

    analyze_kll_error_rate = 0.01
    
  • Table option, so frequently analyzed tables can keep a stable histogram accuracy policy.

  • New ANALYZE TABLE syntax, for one-off control:

    ANALYZE TABLE db.table WITH HISTOGRAM ALGORITHM = 'kll', ERROR_RATE = 0.01;

The exact syntax can be decided during implementation. The important API property is that KLL-specific parameters are explicit, validated, and recorded in the analyze path rather than hidden as implementation constants.

Type Support

The first implementation can be conservative.

Recommended first step:

  • numeric types;
  • decimal types if converted with the same care as current histogram Datum conversion;
  • date and timestamp types through their existing integer-like scalar representation.

String support can come later. KLL over strings is possible for ordered byte values, but it needs careful memory accounting because values have variable length and sketch samples may retain owned bytes.

Error Model

KLL provides bounded rank error, not exact value error.

For histogram generation, that means:

  • bucket boundaries are approximate quantile boundaries;
  • bucket row counts may deviate from exact equi-depth targets;
  • the second pass computes exact observed counts under the chosen boundaries;
  • per-bucket NDV must be marked according to how it is computed. Exact distinct counts may keep exact histogram NDV semantics, while HLL-derived bucket NDV must not be published as exact.

The RFC intentionally keeps the stored histogram format unchanged. The KLL error parameters should be controlled by implementation constants or settings, and documented as rank-error based.

If the user selects the KLL algorithm and specifies an error parameter through settings, table options, or new ANALYZE TABLE syntax, Databend should validate that value before building the analyze pipeline. Invalid values should fail fast rather than silently falling back to a default.

Rollout Plan

  1. Implement an internal KLL sketch state for numeric-like histogram values. The state must support update, merge, serialization, and quantile boundary extraction.

  2. Add analyze pipeline support for an explicit histogram algorithm option. Keep window as a supported algorithm and add kll as an experimental algorithm.

  3. Generate KLL boundaries in phase 1 and fill bucket stats in phase 2. Store the result in TableSnapshotStatistics using the existing histogram format.

  4. Add tests comparing KLL histograms against exact NTILE histograms with tolerance on bucket row counts and boundary ranks.

  5. Evaluate whether the default algorithm should remain window or move to kll after correctness and performance validation. This should be a separate compatibility decision.

Testing Strategy

Unit tests:

  • KLL update and quantile extraction on sorted, reverse-sorted, random, and duplicate-heavy inputs.
  • KLL merge equivalence between single-state and multi-state ingestion.
  • Serialization and deserialization of KLL state.
  • Boundary extraction for empty input and all-equal input.

Analyze tests:

  • Numeric column histogram generation with approximate equi-depth buckets.
  • Duplicate-heavy data where repeated boundaries are expected.
  • Nullable columns: nulls are excluded from histogram buckets.
  • Multi-block and multi-segment tables produce the same shape as single-block tables within tolerance.

Optimizer smoke tests:

  • Existing histogram consumers can read KLL-generated histograms.
  • Range predicate selectivity remains finite and bounded in [0, 1].
  • Join estimation does not produce invalid cardinalities for duplicate-heavy histograms.

Performance tests:

  • Compare current SQL/window path and KLL path on wide and large tables.
  • Track peak memory, spill behavior, and elapsed time.
  • Validate that memory scales with columns * sketch_size, not with table row count.

Compatibility

The final metadata remains compatible with existing histogram readers because TableSnapshotStatistics.histograms is unchanged.

The generated bucket boundaries may differ from the current exact NTILE implementation, but both represent equi-depth-style row distributions. Query plans may change because selectivity estimates may change.

This should be treated as an optimizer statistics change, not a SQL behavior change.

Persisted KLL sketches, if added later, would be additional metadata rather than a replacement for stored histograms. They would improve incremental boundary discovery but would not by themselves make the whole histogram fully incremental, because bucket counts and per-bucket NDV depend on the final boundaries.

Open Questions

  • What KLL error parameter should Databend use by default for 100 histogram buckets?
  • Which configuration surface should be supported first: setting, table option, new ANALYZE TABLE syntax, or a combination of them?
  • What should the default histogram algorithm be during and after the experimental period?
  • Should KLL bucket NDV use exact distinct counting, HLL with non-exact histogram semantics, or an extended accuracy model?
  • Should all-equal input produce one bucket or many equal-boundary buckets?
  • Should string histograms be supported in the first version or deferred?
  • Should KLL sketches be persisted at block or segment level in future metadata?
  • If KLL sketches are persisted, when should Databend reuse them and when should it rebuild them from raw data?
  • Should table-level algorithm options override session settings, or should one-off ANALYZE TABLE syntax always have highest priority?

Alternatives

Keep the Current SQL Path

This requires no new sketch implementation, but keeps the global sort/window cost and memory risk.

Use TDigest

TDigest is useful for approximate percentile estimation, but KLL provides a clearer rank-error model for equi-depth histogram boundaries. Since histograms are fundamentally about rank partitioning, KLL is a better fit.

Build Histograms From Block Min/Max

Block min/max can be read cheaply, but it cannot describe the distribution inside each block well enough to build equi-depth histograms or per-bucket NDV.

One-Pass KLL-Only Histogram

A one-pass implementation could use KLL boundaries and assume uniform bucket counts, but it would not produce reliable num_values or num_distinct. Databend's optimizer already consumes those fields, so this RFC proposes the two-phase design.