Databend can optionally generate column histograms during ANALYZE TABLE, but
the current implementation builds each supported column histogram by planning
and executing a SQL query with NTILE(100) OVER (ORDER BY col), followed by
GROUP BY quantile.
This produces row-order equi-depth buckets, but it requires a full column scan,
a global sort/window step, and distinct aggregation for every histogram column.
The implementation is expensive enough that enable_analyze_histogram is
disabled by default.
This RFC proposes adding a selectable histogram generation algorithm for
ANALYZE TABLE. The existing SQL/window implementation remains available, and
a new storage-native KLL algorithm can be selected through settings, table
options, or new ANALYZE TABLE syntax. The KLL builder estimates equi-depth
bucket boundaries with bounded rank error, then computes bucket statistics
through a streaming aggregation pass. The result keeps the existing Histogram
metadata format and optimizer-facing semantics, while avoiding the global sort
and reducing memory pressure when the KLL algorithm is selected.
The current histogram path is implemented in
AnalyzeTableInterpreter. When both enable_analyze_histogram and
enable_table_snapshot_stats are enabled, Databend creates one query per
supported column:
SELECT
quantile,
COUNT(DISTINCT col) AS ndv,
MAX(col) AS max_value,
MIN(col) AS min_value,
COUNT() AS count
FROM (
SELECT
col,
NTILE(100) OVER (ORDER BY col) AS quantile
FROM db.table
WHERE col IS DISTINCT FROM NULL
)
GROUP BY quantile
ORDER BY quantileThis has several problems:
- The window step needs a global order for each analyzed column.
- Large tables may spill or OOM while sorting.
- The work is repeated independently for every histogram column.
- The generated buckets are row-order equi-depth buckets, not disjoint value-domain partitions. Hot values may cross bucket boundaries and produce repeated or overlapping bucket ranges.
The equi-depth model itself is useful for skewed data. The problem is the way Databend currently constructs the buckets.
KLL sketches provide a mergeable, bounded-memory approximation of quantile
positions. They fit Databend's pipeline model better than ORDER BY + NTILE:
workers can build local sketches, merge them, then derive approximate bucket
boundaries without materializing and sorting all values.
- Add a selectable histogram generation algorithm for
ANALYZE TABLE. - Allow users to choose between the current SQL/window algorithm and a new KLL algorithm.
- Generate KLL-based ANALYZE histograms without a global sort/window query per column.
- Keep histogram generation streaming and bounded-memory.
- Keep the existing
HistogramandHistogramBucketmetadata representation. - Preserve the optimizer-facing bucket fields:
lower_bound,upper_bound,num_values, andnum_distinct. - Support mergeable partial states so the implementation works with parallel and distributed execution.
- Allow incremental adoption behind existing or new settings.
- This RFC does not require exact equi-depth bucket boundaries.
- This RFC does not require changing optimizer selectivity formulas in the first implementation.
- This RFC does not require solving fully incremental histogram maintenance in the first implementation.
ANALYZE TABLE has two related but separate statistics paths:
- The NDV path reads segment/block metadata and builds or merges column HLLs. Missing block HLLs can be generated by scanning blocks.
- The histogram path is built by
AnalyzeTableInterpreteras extra query pipelines. Each supported column has a separate SQL query whose result is sent toSinkAnalyzeStatethroughHistogramInfoSink.
SinkAnalyzeState::create_histogram converts the query output rows into
HistogramBuckets. During commit, those buckets are converted into
Histogram::try_from_buckets(true, buckets, None) and stored in
TableSnapshotStatistics.
Supported histogram columns currently follow the same type filter as range statistics:
- Number
- Date
- Timestamp
- TimestampTz
- String
- Decimal
Internally, histogram bounds are stored through Datum variants:
Int, UInt, Float, and Bytes.
Introduce an analyze histogram algorithm choice.
The first two algorithms are:
window: the current implementation based onNTILE(DEFAULT_HISTOGRAM_BUCKETS) OVER (ORDER BY col)andGROUP BY quantile;kll: a new analyze-native streaming implementation based on KLL sketches.
The default algorithm can remain window for compatibility while the KLL
path is experimental. Users can opt into kll through settings, table options,
or new ANALYZE TABLE syntax.
The KLL builder has two phases.
Scan supported histogram columns and update one KLL sketch per column.
In parallel execution, each worker builds local sketches:
worker 1: col_a -> KLL A1, col_b -> KLL B1
worker 2: col_a -> KLL A2, col_b -> KLL B2
worker 3: col_a -> KLL A3, col_b -> KLL B3
Then sketches are merged by column:
merge(A1, A2, A3) -> global KLL for col_a
merge(B1, B2, B3) -> global KLL for col_b
The merge is only across partial states for the same column. Columns remain independent.
After merge, query the global sketch for bucket boundary ranks:
q0, q1, q2, ..., qN
where N is DEFAULT_HISTOGRAM_BUCKETS.
The first implementation can use the existing bucket count of 100.
Scan the same supported columns again and assign each non-null value to a bucket based on the KLL-derived boundaries.
For each bucket, maintain:
- row count;
- minimum observed value;
- maximum observed value;
- per-bucket distinct-value statistics.
At the end, emit the existing histogram bucket representation:
HistogramBucket::try_from_bounds(
lower_bound,
upper_bound,
count as f64,
bucket_ndv as f64,
)The accuracy flag must reflect the semantics of num_distinct. Today,
TypedHistogram::ndv() treats accurate histograms as exact by summing bucket
num_distinct values and returning NdvEstimate::exact. Therefore, if
per-bucket NDV is produced by HLL, Databend must not publish those approximate
values as accuracy = true.
The implementation should choose one of these policies:
- compute exact per-bucket distinct counts and keep
accuracy = true; or - use per-bucket HLL and store the histogram as non-exact, for example
accuracy = false; or - extend the histogram accuracy semantics so bucket NDV can distinguish exact counts from approximate HLL estimates.
The bucket boundaries are approximate quantile boundaries. Bucket row counts are computed from the second pass over actual rows, while bucket NDV exactness depends on the policy above.
KLL can estimate quantile boundaries, but it does not directly provide all histogram bucket fields required by Databend's optimizer.
Databend needs:
lower_bound;upper_bound;num_values;num_distinct.
The first KLL pass gives approximate rank boundaries. The second pass fills the actual bucket statistics under those boundaries. This avoids global sorting while preserving useful bucket payloads.
A one-pass version could assume each bucket has approximately
row_count / bucket_count rows, but that would make bucket counts and NDVs too
weak for selectivity and join estimation.
KLL-derived adjacent boundaries may be equal when a value is very frequent. This is not a bug; it reflects the data distribution.
The bucket assignment logic must define stable behavior for equal boundaries. For example:
- assign values by rank target when possible during the second pass; or
- assign values by boundary intervals with deterministic tie handling; or
- collapse adjacent equal-boundary buckets into a single bucket when bucket ranges cannot be made meaningful.
The first implementation should prefer correctness and stable metadata over
always producing exactly DEFAULT_HISTOGRAM_BUCKETS buckets. Producing fewer
non-empty buckets is acceptable.
The resulting histogram is still an equi-depth approximation, not a disjoint value-domain histogram. Optimizer consumers should continue treating it as a row-distribution summary.
AnalyzeTableInterpreter should choose the histogram builder from the selected
algorithm.
For window, Databend can keep the current query-pipeline based behavior.
For kll, FuseTable::do_analyze can receive a histogram collection mode, or
the KLL histogram builder can be integrated into the existing analyze
source/sink flow.
Potential placement:
- Extend
AnalyzeCollectNDVSourceor add a sibling source that reads projected histogram columns and emits KLL states. - Extend
SinkAnalyzeStateto collect merged KLL states, derive boundaries, request/fill bucket statistics, and commit histograms. - Keep histogram commit in
TableSnapshotStatistics, matching the existing storage format.
No metadata format change is required for the first implementation. The final
stored output remains HashMap<ColumnId, Histogram>.
If future work stores reusable KLL sketches at block or segment level, that should be a separate metadata extension.
This future extension is valuable because KLL sketches are naturally
incremental. Newly written blocks can produce block-level sketches, and table or
segment-level boundary discovery can merge persisted sketches instead of
scanning raw column data again. Even if bucket filling still needs to scan rows
to compute num_values and per-bucket NDV under the final boundaries, persisted
KLL sketches can avoid the first pass used only to discover approximate
equi-depth boundaries.
The implementation can reuse these existing feature gates:
enable_analyze_histogramenable_table_snapshot_stats
In addition, Databend should expose a way to choose the histogram generation algorithm. Possible configuration surfaces:
- Session or global setting:
analyze_histogram_algorithm = 'window' | 'kll'
-
Table option, so a table can keep a stable analyze policy.
-
New
ANALYZE TABLEsyntax, for one-off control:
ANALYZE TABLE db.table WITH HISTOGRAM ALGORITHM = 'kll';The exact syntax can be decided during implementation, but the API should make the algorithm choice explicit. The current SQL/window path should remain a valid algorithm rather than an implicit secondary path.
KLL exposes a tunable accuracy/memory tradeoff. When the selected algorithm is
kll, the implementation should provide a conservative default, but users and
tests should be able to override the error target explicitly.
Possible configuration surfaces:
-
Session or global setting, for example:
analyze_kll_error_rate = 0.01 -
Table option, so frequently analyzed tables can keep a stable histogram accuracy policy.
-
New
ANALYZE TABLEsyntax, for one-off control:ANALYZE TABLE db.table WITH HISTOGRAM ALGORITHM = 'kll', ERROR_RATE = 0.01;
The exact syntax can be decided during implementation. The important API property is that KLL-specific parameters are explicit, validated, and recorded in the analyze path rather than hidden as implementation constants.
The first implementation can be conservative.
Recommended first step:
- numeric types;
- decimal types if converted with the same care as current histogram Datum conversion;
- date and timestamp types through their existing integer-like scalar representation.
String support can come later. KLL over strings is possible for ordered byte values, but it needs careful memory accounting because values have variable length and sketch samples may retain owned bytes.
KLL provides bounded rank error, not exact value error.
For histogram generation, that means:
- bucket boundaries are approximate quantile boundaries;
- bucket row counts may deviate from exact equi-depth targets;
- the second pass computes exact observed counts under the chosen boundaries;
- per-bucket NDV must be marked according to how it is computed. Exact distinct counts may keep exact histogram NDV semantics, while HLL-derived bucket NDV must not be published as exact.
The RFC intentionally keeps the stored histogram format unchanged. The KLL error parameters should be controlled by implementation constants or settings, and documented as rank-error based.
If the user selects the KLL algorithm and specifies an error parameter through
settings, table options, or new ANALYZE TABLE syntax, Databend should validate
that value before building the analyze pipeline. Invalid values should fail fast
rather than silently falling back to a default.
-
Implement an internal KLL sketch state for numeric-like histogram values. The state must support update, merge, serialization, and quantile boundary extraction.
-
Add analyze pipeline support for an explicit histogram algorithm option. Keep
windowas a supported algorithm and addkllas an experimental algorithm. -
Generate KLL boundaries in phase 1 and fill bucket stats in phase 2. Store the result in
TableSnapshotStatisticsusing the existing histogram format. -
Add tests comparing KLL histograms against exact
NTILEhistograms with tolerance on bucket row counts and boundary ranks. -
Evaluate whether the default algorithm should remain
windowor move tokllafter correctness and performance validation. This should be a separate compatibility decision.
Unit tests:
- KLL update and quantile extraction on sorted, reverse-sorted, random, and duplicate-heavy inputs.
- KLL merge equivalence between single-state and multi-state ingestion.
- Serialization and deserialization of KLL state.
- Boundary extraction for empty input and all-equal input.
Analyze tests:
- Numeric column histogram generation with approximate equi-depth buckets.
- Duplicate-heavy data where repeated boundaries are expected.
- Nullable columns: nulls are excluded from histogram buckets.
- Multi-block and multi-segment tables produce the same shape as single-block tables within tolerance.
Optimizer smoke tests:
- Existing histogram consumers can read KLL-generated histograms.
- Range predicate selectivity remains finite and bounded in
[0, 1]. - Join estimation does not produce invalid cardinalities for duplicate-heavy histograms.
Performance tests:
- Compare current SQL/window path and KLL path on wide and large tables.
- Track peak memory, spill behavior, and elapsed time.
- Validate that memory scales with
columns * sketch_size, not with table row count.
The final metadata remains compatible with existing histogram readers because
TableSnapshotStatistics.histograms is unchanged.
The generated bucket boundaries may differ from the current exact NTILE
implementation, but both represent equi-depth-style row distributions. Query
plans may change because selectivity estimates may change.
This should be treated as an optimizer statistics change, not a SQL behavior change.
Persisted KLL sketches, if added later, would be additional metadata rather than a replacement for stored histograms. They would improve incremental boundary discovery but would not by themselves make the whole histogram fully incremental, because bucket counts and per-bucket NDV depend on the final boundaries.
- What KLL error parameter should Databend use by default for 100 histogram buckets?
- Which configuration surface should be supported first: setting, table option,
new
ANALYZE TABLEsyntax, or a combination of them? - What should the default histogram algorithm be during and after the experimental period?
- Should KLL bucket NDV use exact distinct counting, HLL with non-exact histogram semantics, or an extended accuracy model?
- Should all-equal input produce one bucket or many equal-boundary buckets?
- Should string histograms be supported in the first version or deferred?
- Should KLL sketches be persisted at block or segment level in future metadata?
- If KLL sketches are persisted, when should Databend reuse them and when should it rebuild them from raw data?
- Should table-level algorithm options override session settings, or should
one-off
ANALYZE TABLEsyntax always have highest priority?
This requires no new sketch implementation, but keeps the global sort/window cost and memory risk.
TDigest is useful for approximate percentile estimation, but KLL provides a clearer rank-error model for equi-depth histogram boundaries. Since histograms are fundamentally about rank partitioning, KLL is a better fit.
Block min/max can be read cheaply, but it cannot describe the distribution inside each block well enough to build equi-depth histograms or per-bucket NDV.
A one-pass implementation could use KLL boundaries and assume uniform bucket
counts, but it would not produce reliable num_values or num_distinct.
Databend's optimizer already consumes those fields, so this RFC proposes the
two-phase design.