Skip to content

Comments

feat: integrate aisle for metadata-driven Parquet pruning#568

Merged
ethe merged 16 commits intodevfrom
feat/aisle-migration
Jan 27, 2026
Merged

feat: integrate aisle for metadata-driven Parquet pruning#568
ethe merged 16 commits intodevfrom
feat/aisle-migration

Conversation

@belveryin
Copy link
Collaborator

@belveryin belveryin commented Jan 11, 2026

Summary

Integrates aisle for metadata-driven Parquet pruning, enabling significant I/O reduction for selective queries by evaluating filter predicates against Parquet metadata before reading data.

Key Changes

  • Row-group pruning: Skip entire row groups based on min/max statistics
  • Page-level pruning: Fine-grained skipping using column/offset indexes
  • Row filtering: Push predicates into Parquet reader for in-stream filtering
  • Commit timestamp pruning: Skip SSTs entirely when min_commit_ts > read_ts

API Changes (Breaking)

// Before
Predicate::eq(ColumnRef::new("score"), ScalarValue::from(80_i64))
Predicate::gte(ColumnRef::new("age"), ScalarValue::from(18))

// After
Expr::eq("score", ScalarValue::from(80_i64))
Expr::gt_eq("age", ScalarValue::from(18))
  • Predicate → Expr (re-exported from aisle)
  • ColumnRef::new("col") → "col" (direct string)
  • ScalarValue now from datafusion_common
  • Method renames: gte → gt_eq, lte → lt_eq
  • expr.not() → Expr::not(expr)

Dependencies

  • Added aisle with row_filter feature
  • Added datafusion-common for ScalarValue
  • Upgraded arrow/parquet 56 → 57, fusio 0.5 → 0.6
  • Removed tonbo-predicate crate

Implementation Details

  1. SSTable writes now emit page-level statistics and offset indexes
  2. Scan planning pre-computes row group/page selections and caches Parquet metadata
  3. Residual evaluation rewritten with precise int/float comparisons (no precision loss)
  4. Delete sidecars reuse metadata from plan phase

Test Plan

  • Existing tests pass (226 tests)
  • New tests for row-group/page pruning
  • New tests for commit_ts pruning at plan time
  • New tests for missing page index errors
  • New tests for limit with residual predicates
  • New tests for row filter with tombstones

Future Work

  • Query metrics: Track pruning effectiveness (row groups/pages skipped, I/O bytes saved)
  • Benchmarks: Measure I/O reduction and latency improvements for selective queries at varying selectivity levels (1%, 10%, 50%)

@belveryin belveryin requested review from ethe and removed request for ethe January 11, 2026 20:23
@belveryin belveryin marked this pull request as ready for review January 11, 2026 20:39
if scalar_matches_column(schema, column, value) {
Some(predicate.clone())
} else {
eprintln!(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eprintln should be avoided, using logs or returning Result explicitly should be bnetter

inclusive,
} => self.evaluate_between(column, low, high, *inclusive, row),
Expr::InList { column, values } => self.evaluate_in_list(column, values, row),
Expr::BloomFilterEq { .. } | Expr::BloomFilterInList { .. } => Ok(TriState::True),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The residual evaluation treats BloomFilterEq/InList as always true, which would produce incorrect results if these expressions were publicly available. Therefore, exposure should be restricted or true semantics should be implemented.

@ethe
Copy link
Member

ethe commented Jan 26, 2026

Basically the direction is good, there are some blockers we should fix before we merge this.

@belveryin
Copy link
Collaborator Author

addressed comments @ethe

@belveryin belveryin requested a review from ethe January 27, 2026 13:13
@ethe ethe merged commit ef9c2c1 into dev Jan 27, 2026
6 checks passed
@ethe ethe deleted the feat/aisle-migration branch January 27, 2026 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants