feat: add aggregate-stats foundation types and helpers by discord9 · Pull Request #8112 · GreptimeTeam/greptimedb

discord9 · 2026-05-13T10:52:19Z

I hereby agree to the terms of the GreptimeDB CLA.

Refer to a related PR or issue link (optional)

Split from feat: aggr stat query at runtime #8046 (aggregate-stats runtime)
Tracking issue: tracking: aggregate-stats scan optimization #8126
This is PR1 of a multi-PR split: data types and foundation only

What's changed and what's your intention?

Add the aggregate-stats type system and foundation helpers required by the aggregate-stats runtime feature (PR2/PR3). No execution path is wired yet.

store-api

New aggregate-stats data types in region_engine.rs:

SupportedStatAggr — optimizer-visible requirement enum (CountRows, CountNonNull, MinValue, MaxValue)
StatsAwareFallbackReason — why a file stays on row-scan path
StatsAwareFileDecision + StatsAwareFileDecisionSnapshot — per-file decisions and display summary
FileStatsItem / RowGroupStatsItem — parquet-metadata-backed file and row-group structures
SendableFileStatsStream — type alias for async stats transport

common-query

New aggr_stats module that interprets parquet row-group metadata into typed column statistics:

FileColumnStats / StatsCandidateFile — collects null counts, min/max per column from row groups
Requirement matching and partition-expression filtering
Arrow DataType-driven conversion from parquet physical stats to logical scalar values (handles Timestamp, Date32/64, etc.)
Bounds checks for schema-evolved parquet files (column may be absent in older files)

table

RowGroupPruningStatistics — PruningStatistics implementation backed by parquet row groups
Predicate stats module promoted from #[cfg(test)] to pub(crate), with matching logical-type-aware conversion

What's NOT in this PR (deferred)

StatsAwareFileDecisionState (runtime mutex cache)
Scanner Properties / PrepareRequest stats fields
RegionScanner::scan_stats trait method and implementations
Optimizer AggrStats rewrite rule
SQLness tests

PR Checklist

I have written the necessary rustdoc comments.
I have added the necessary unit tests and integration tests.
This PR requires documentation updates.
API changes are backward compatible.
Schema or data changes are backward compatible.

gemini-code-assist

Code Review

This pull request introduces infrastructure for aggregate statistics, including new data structures like FileStatsItem and StatsCandidateFile to support efficient query execution by leveraging Parquet metadata. The changes also include updates to the RowGroupPruningStatistics logic to handle various data types more robustly. The review identified a potential compilation issue with the use of is_none_or in older Rust versions and suggested refactoring the duplicated logic between aggr_stats.rs and predicate/stats.rs into a shared helper function to improve maintainability.

gemini-code-assist · 2026-05-13T10:59:57Z

+        let should_replace = best.as_ref().is_none_or(|current| {
+            value
+                .partial_cmp(current)
+                .is_some_and(|ordering| ordering == target)
+        });


The method is_none_or is not a standard method on Option<T> in older Rust versions and will cause a compilation error. You can achieve this with map_or(true, |...|) or a match statement.

Suggested change

let should_replace = best.as_ref().is_none_or(|current| {

value

.partial_cmp(current)

.is_some_and(|ordering| ordering == target)

});

let should_replace = best.as_ref().map_or(true, |current| {

value

.partial_cmp(current)

.is_some_and(|ordering| ordering == target)

});

gemini-code-assist · 2026-05-13T10:59:57Z

+fn parquet_bound_scalar(
+    stats: &ParquetStats,
+    target: Ordering,
+    arrow_type: &DataType,
+) -> Option<ScalarValue> {
+    let use_min = target == Ordering::Less;
+
+    match stats {
+        ParquetStats::Boolean(stats) => {
+            if !matches!(arrow_type, DataType::Boolean) {
+                return None;
+            }
+            Some(ScalarValue::Boolean(Some(if use_min {
+                *stats.min_opt()?
+            } else {
+                *stats.max_opt()?
+            })))
+        }
+        ParquetStats::Int32(stats) => {
+            let raw = if use_min {
+                *stats.min_opt()?
+            } else {
+                *stats.max_opt()?
+            };
+            match arrow_type {
+                DataType::Int32
+                | DataType::UInt32
+                | DataType::Int16
+                | DataType::UInt16
+                | DataType::Int8
+                | DataType::UInt8
+                | DataType::Time32(_) => Some(ScalarValue::Int32(Some(raw))),
+                DataType::Date32 => Some(ScalarValue::Date32(Some(raw))),
+                _ => None,
+            }
+        }
+        ParquetStats::Int64(stats) => {
+            let raw = if use_min {
+                *stats.min_opt()?
+            } else {
+                *stats.max_opt()?
+            };
+            match arrow_type {
+                DataType::Int64 | DataType::UInt64 => Some(ScalarValue::Int64(Some(raw))),
+                DataType::Timestamp(TimeUnit::Second, _) => {
+                    Some(ScalarValue::TimestampSecond(Some(raw), None))
+                }
+                DataType::Timestamp(TimeUnit::Millisecond, _) => {
+                    Some(ScalarValue::TimestampMillisecond(Some(raw), None))
+                }
+                DataType::Timestamp(TimeUnit::Microsecond, _) => {
+                    Some(ScalarValue::TimestampMicrosecond(Some(raw), None))
+                }
+                DataType::Timestamp(TimeUnit::Nanosecond, _) => {
+                    Some(ScalarValue::TimestampNanosecond(Some(raw), None))
+                }
+                DataType::Date64 => Some(ScalarValue::Date64(Some(raw))),
+                DataType::Duration(_) => Some(ScalarValue::Int64(Some(raw))),
+                _ => None,
+            }
+        }
+        ParquetStats::Int96(_) => None,
+        ParquetStats::Float(stats) => {
+            if !matches!(arrow_type, DataType::Float32) {
+                return None;
+            }
+            Some(ScalarValue::Float32(Some(if use_min {
+                *stats.min_opt()?
+            } else {
+                *stats.max_opt()?
+            })))
+        }
+        ParquetStats::Double(stats) => {
+            if !matches!(arrow_type, DataType::Float64) {
+                return None;
+            }
+            Some(ScalarValue::Float64(Some(if use_min {
+                *stats.min_opt()?
+            } else {
+                *stats.max_opt()?
+            })))
+        }
+        ParquetStats::ByteArray(stats) => {
+            let bytes = if use_min {
+                stats.min_bytes_opt()?
+            } else {
+                stats.max_bytes_opt()?
+            };
+            match arrow_type {
+                DataType::Utf8 | DataType::LargeUtf8 => String::from_utf8(bytes.to_owned())
+                    .ok()
+                    .map(|s| ScalarValue::Utf8(Some(s))),
+                _ => None,
+            }
+        }
+        ParquetStats::FixedLenByteArray(_) => None,
+    }
+}


The logic in parquet_bound_scalar for converting Parquet statistics to a datafusion::scalar::ScalarValue is nearly identical to the logic inside the impl_min_max_values! macro in src/table/src/predicate/stats.rs. To improve maintainability and reduce code duplication, refactor this logic into a shared helper function.

pub fn parquet_stats_to_scalar( stats: &ParquetStats, data_type: &DataType, target: Ordering, ) -> Option<ScalarValue>

References

Refactor duplicated logic into shared helper functions to improve maintainability and reduce code duplication.

- Add SupportedStatAggr, StatsAwareFallbackReason, StatsAwareFileDecision, StatsAwareFileDecisionSnapshot, FileStatsItem, RowGroupStatsItem, and SendableFileStatsStream to store-api as pure data types - Add aggr_stats foundation module to common-query with requirement matching, partition filtering, and column-stats collection from parquet metadata - Add RowGroupPruningStatistics to table::predicate for parquet row-group pruning - Promote table predicate stats from cfg(test) to pub(crate) - Drive parquet physical-to-logical scalar conversion with Arrow DataType - Add bounds checks for schema-evolved parquet columns This is PR1 of a multi-PR split: data types and foundation only. No scanner runtime, optimizer rewrites, or sqlness. Signed-off-by: discord9 <discord9@163.com>

github-actions Bot added size/M docs-not-required This change does not impact docs. labels May 13, 2026

gemini-code-assist Bot reviewed May 13, 2026

View reviewed changes

discord9 force-pushed the feat/aggr-stat-foundation-pr1 branch from 270466f to 40011c5 Compare May 14, 2026 03:33

discord9 marked this pull request as ready for review May 15, 2026 03:36

discord9 requested review from a team, evenyag and v0y4g3r as code owners May 15, 2026 03:36

discord9 force-pushed the feat/aggr-stat-foundation-pr1 branch from 40011c5 to 9b379a1 Compare May 15, 2026 11:54

github-actions Bot added size/L and removed size/M labels May 15, 2026

discord9 mentioned this pull request May 18, 2026

tracking: aggregate-stats scan optimization #8126

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add aggregate-stats foundation types and helpers#8112

feat: add aggregate-stats foundation types and helpers#8112
discord9 wants to merge 1 commit into
GreptimeTeam:mainfrom
discord9:feat/aggr-stat-foundation-pr1

discord9 commented May 13, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Uh oh!

gemini-code-assist Bot May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

discord9 commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Refer to a related PR or issue link (optional)

What's changed and what's your intention?

store-api

common-query

table

What's NOT in this PR (deferred)

PR Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

discord9 commented May 13, 2026 •

edited

Loading