feat: aggr stat query at runtime by discord9 · Pull Request #8046 · GreptimeTeam/greptimedb

discord9 · 2026-04-29T03:09:54Z

I hereby agree to the terms of the GreptimeDB CLA.

Refer to a related PR or issue link (optional)

What's changed and what's your intention?

PR Checklist

Please convert it to a draft if some of the following conditions are not met.

I have written the necessary rustdoc comments.
I have added the necessary unit tests and integration tests.
This PR requires documentation updates.
API changes are backward compatible.
Schema or data changes are backward compatible.

Signed-off-by: discord9 <discord9@163.com>

gemini-code-assist

Code Review

This pull request introduces a significant optimization for aggregate queries by leveraging file statistics (min/max, null counts, row counts) to compute partial aggregates without scanning actual row data. This is achieved through a new physical optimizer rule, AggrStatsPhysicalRule, which rewrites aggregate query plans into a StatsScanExec for stats-based aggregation and a fallback RegionScanExec for files not covered by statistics. The RegionScanner trait and its implementations are extended to provide file statistics and manage the stats-aware skip logic. Review comments suggest improvements for correctness and efficiency in handling Parquet byte array statistics, better compatibility with older Rust toolchains by replacing is_none_or with map_or, and refining partition expression matching to support unpartitioned regions. Additionally, code hygiene improvements were suggested by removing unnecessary #[allow(dead_code)] and #[allow(unused)] attributes.

gemini-code-assist · 2026-04-29T03:16:19Z

+        ParquetStats::ByteArray(stats) => {
+            let bytes = if use_min {
+                stats.min_bytes_opt()?
+            } else {
+                stats.max_bytes_opt()?
+            };
+            Some(ScalarValue::Utf8(String::from_utf8(bytes.to_owned()).ok()))
+        }


There are two issues here:

Correctness: If String::from_utf8 fails (e.g., for binary data), it returns None, resulting in ScalarValue::Utf8(None). This represents a SQL NULL. Since the file actually contains data, returning NULL is incorrect. It should return None from parquet_bound_scalar to trigger a fallback to row scanning.

Efficiency: bytes.to_owned() performs an unnecessary allocation. Using std::str::from_utf8 is more efficient as it validates the bytes in place.

Suggested change

ParquetStats::ByteArray(stats) => {

let bytes = if use_min {

stats.min_bytes_opt()?

} else {

stats.max_bytes_opt()?

};

Some(ScalarValue::Utf8(String::from_utf8(bytes.to_owned()).ok()))

}

ParquetStats::ByteArray(stats) => {

let bytes = if use_min {

stats.min_bytes_opt()?

} else {

stats.max_bytes_opt()?

};

let s = std::str::from_utf8(bytes).ok()?;

Some(ScalarValue::Utf8(Some(s.to_string())))

}

gemini-code-assist · 2026-04-29T03:16:19Z

+fn matches_partition_expr(
+    file_partition_expr: Option<&str>,
+    region_partition_expr: Option<&str>,
+) -> bool {
+    matches!(
+        (file_partition_expr, region_partition_expr),
+        (Some(file_expr), Some(region_expr)) if file_expr == region_expr
+    )
+}


The current implementation of matches_partition_expr only returns true if both expressions are Some. This prevents stats-aware skip from working on regions that are not partitioned (where region_partition_expr is None). Comparing the Options directly allows the optimization to work when both are None.

fn matches_partition_expr( file_partition_expr: Option<&str>, region_partition_expr: Option<&str>, ) -> bool { file_partition_expr == region_partition_expr }

gemini-code-assist · 2026-04-29T03:16:19Z

+        let should_replace = best.as_ref().is_none_or(|current| {
+            value
+                .partial_cmp(current)
+                .is_some_and(|ordering| ordering == target)
+        });


is_none_or was stabilized in Rust 1.82. To maintain compatibility with older toolchains (e.g., 1.80 or 1.81), consider using map_or(true, ...) or a match statement.

Suggested change

let should_replace = best.as_ref().is_none_or(|current| {

value

.partial_cmp(current)

.is_some_and(|ordering| ordering == target)

});

let should_replace = best.as_ref().map_or(true, |current| {

value

.partial_cmp(current)

.is_some_and(|ordering| ordering == target)

});

gemini-code-assist · 2026-04-29T03:16:19Z

+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#![allow(dead_code)]


The #![allow(dead_code)] attribute should be removed if all functions in this file are used. This allows the compiler to help identify truly unused code in the future.

gemini-code-assist · 2026-04-29T03:16:20Z

+        }
+    }
+
+    if columns.first().is_none_or(|column| column.is_empty()) {


is_none_or is a very recent addition to the standard library (Rust 1.82). Using map_or(true, ...) ensures better compatibility with older toolchains.

Suggested change

if columns.first().is_none_or(|column| column.is_empty()) {

if columns.first().map_or(true, |column| column.is_empty()) {

gemini-code-assist · 2026-04-29T03:16:20Z

+}
+
+/// TODO(discord9): support more kind of aggr
+#[allow(unused)]


The #[allow(unused)] attribute on RewriteTarget is unnecessary as the enum is used in the optimization logic. It's better to remove it.

discord9 added 4 commits April 23, 2026 12:41

feat: stats scan support

18590e3

Signed-off-by: discord9 <discord9@163.com>

feat: region scan skip&rewrite

08c66ab

Signed-off-by: discord9 <discord9@163.com>

feat: support flat fields

481b21e

Signed-off-by: discord9 <discord9@163.com>

refactor: per ai review

97993ce

Signed-off-by: discord9 <discord9@163.com>

github-actions Bot added size/XXL docs-not-required This change does not impact docs. labels Apr 29, 2026

discord9 mentioned this pull request Apr 29, 2026

feat: aggr stats partial eval #7995

Draft

5 tasks

gemini-code-assist Bot reviewed Apr 29, 2026

View reviewed changes

This was referenced May 13, 2026

feat: add aggregate-stats foundation types and helpers #8112

Open

tracking: aggregate-stats scan optimization #8126

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: aggr stat query at runtime#8046

feat: aggr stat query at runtime#8046
discord9 wants to merge 4 commits into
mainfrom
feat/aggr-stat-runtime

discord9 commented Apr 29, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	if columns.first().is_none_or(\|column\| column.is_empty()) {
	if columns.first().map_or(true, \|column\| column.is_empty()) {

Conversation

discord9 commented Apr 29, 2026

Refer to a related PR or issue link (optional)

What's changed and what's your intention?

PR Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant