Skip to content

feat: aggr stat query at runtime#8046

Draft
discord9 wants to merge 4 commits into
mainfrom
feat/aggr-stat-runtime
Draft

feat: aggr stat query at runtime#8046
discord9 wants to merge 4 commits into
mainfrom
feat/aggr-stat-runtime

Conversation

@discord9
Copy link
Copy Markdown
Contributor

I hereby agree to the terms of the GreptimeDB CLA.

Refer to a related PR or issue link (optional)

What's changed and what's your intention?

PR Checklist

Please convert it to a draft if some of the following conditions are not met.

  • I have written the necessary rustdoc comments.
  • I have added the necessary unit tests and integration tests.
  • This PR requires documentation updates.
  • API changes are backward compatible.
  • Schema or data changes are backward compatible.

Signed-off-by: discord9 <discord9@163.com>
Signed-off-by: discord9 <discord9@163.com>
Signed-off-by: discord9 <discord9@163.com>
Signed-off-by: discord9 <discord9@163.com>
@github-actions github-actions Bot added size/XXL docs-not-required This change does not impact docs. labels Apr 29, 2026
@discord9 discord9 mentioned this pull request Apr 29, 2026
5 tasks
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant optimization for aggregate queries by leveraging file statistics (min/max, null counts, row counts) to compute partial aggregates without scanning actual row data. This is achieved through a new physical optimizer rule, AggrStatsPhysicalRule, which rewrites aggregate query plans into a StatsScanExec for stats-based aggregation and a fallback RegionScanExec for files not covered by statistics. The RegionScanner trait and its implementations are extended to provide file statistics and manage the stats-aware skip logic. Review comments suggest improvements for correctness and efficiency in handling Parquet byte array statistics, better compatibility with older Rust toolchains by replacing is_none_or with map_or, and refining partition expression matching to support unpartitioned regions. Additionally, code hygiene improvements were suggested by removing unnecessary #[allow(dead_code)] and #[allow(unused)] attributes.

Comment on lines +246 to +253
ParquetStats::ByteArray(stats) => {
let bytes = if use_min {
stats.min_bytes_opt()?
} else {
stats.max_bytes_opt()?
};
Some(ScalarValue::Utf8(String::from_utf8(bytes.to_owned()).ok()))
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There are two issues here:

  1. Correctness: If String::from_utf8 fails (e.g., for binary data), it returns None, resulting in ScalarValue::Utf8(None). This represents a SQL NULL. Since the file actually contains data, returning NULL is incorrect. It should return None from parquet_bound_scalar to trigger a fallback to row scanning.
  2. Efficiency: bytes.to_owned() performs an unnecessary allocation. Using std::str::from_utf8 is more efficient as it validates the bytes in place.
Suggested change
ParquetStats::ByteArray(stats) => {
let bytes = if use_min {
stats.min_bytes_opt()?
} else {
stats.max_bytes_opt()?
};
Some(ScalarValue::Utf8(String::from_utf8(bytes.to_owned()).ok()))
}
ParquetStats::ByteArray(stats) => {
let bytes = if use_min {
stats.min_bytes_opt()?
} else {
stats.max_bytes_opt()?
};
let s = std::str::from_utf8(bytes).ok()?;
Some(ScalarValue::Utf8(Some(s.to_string())))
}

Comment on lines +109 to +117
fn matches_partition_expr(
file_partition_expr: Option<&str>,
region_partition_expr: Option<&str>,
) -> bool {
matches!(
(file_partition_expr, region_partition_expr),
(Some(file_expr), Some(region_expr)) if file_expr == region_expr
)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation of matches_partition_expr only returns true if both expressions are Some. This prevents stats-aware skip from working on regions that are not partitioned (where region_partition_expr is None). Comparing the Options directly allows the optimization to work when both are None.

fn matches_partition_expr(
    file_partition_expr: Option<&str>,
    region_partition_expr: Option<&str>,
) -> bool {
    file_partition_expr == region_partition_expr
}

Comment on lines +203 to +207
let should_replace = best.as_ref().is_none_or(|current| {
value
.partial_cmp(current)
.is_some_and(|ordering| ordering == target)
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

is_none_or was stabilized in Rust 1.82. To maintain compatibility with older toolchains (e.g., 1.80 or 1.81), consider using map_or(true, ...) or a match statement.

Suggested change
let should_replace = best.as_ref().is_none_or(|current| {
value
.partial_cmp(current)
.is_some_and(|ordering| ordering == target)
});
let should_replace = best.as_ref().map_or(true, |current| {
value
.partial_cmp(current)
.is_some_and(|ordering| ordering == target)
});

// See the License for the specific language governing permissions and
// limitations under the License.

#![allow(dead_code)]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The #![allow(dead_code)] attribute should be removed if all functions in this file are used. This allows the compiler to help identify truly unused code in the future.

}
}

if columns.first().is_none_or(|column| column.is_empty()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

is_none_or is a very recent addition to the standard library (Rust 1.82). Using map_or(true, ...) ensures better compatibility with older toolchains.

Suggested change
if columns.first().is_none_or(|column| column.is_empty()) {
if columns.first().map_or(true, |column| column.is_empty()) {

}

/// TODO(discord9): support more kind of aggr
#[allow(unused)]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The #[allow(unused)] attribute on RewriteTarget is unnecessary as the enum is used in the optimization logic. It's better to remove it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs-not-required This change does not impact docs. size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant