Skip to content

feat: globally reorder files and row groups by statistics for TopK queries#21956

Open
zhuqi-lucas wants to merge 1 commit intoapache:mainfrom
zhuqi-lucas:feat/rg-reorder-by-statistics
Open

feat: globally reorder files and row groups by statistics for TopK queries#21956
zhuqi-lucas wants to merge 1 commit intoapache:mainfrom
zhuqi-lucas:feat/rg-reorder-by-statistics

Conversation

@zhuqi-lucas
Copy link
Copy Markdown
Contributor

@zhuqi-lucas zhuqi-lucas commented Apr 30, 2026

Which issue does this PR close?

Rationale for this change

TopK queries (ORDER BY col LIMIT K) on parquet files with multiple out-of-order row groups are suboptimal — the dynamic filter threshold converges slowly because row groups are read in arbitrary order. By reordering row groups so the "best" ones (containing optimal values) are read first, the threshold tightens quickly and subsequent row groups are pruned at runtime.

What changes are included in this PR?

Row group reorder by statistics:

  • PreparedAccessPlan::reorder_by_statistics(): sorts row groups by min values (ASC) using parquet column statistics. Direction (DESC) is handled by existing reverse() applied after reorder. The two compose correctly for both sorted and unsorted data.
  • AccessPlanOptimizer trait: extensible interface for row group access plan optimizations applied after pruning.

DynamicFilter sort metadata:

  • DynamicFilterPhysicalExpr gains sort_options and fetch fields, set by SortExec::create_filter(). This lets the parquet reader determine reorder direction for any TopK query (not just sort-pushdown path).
  • Fix: SortExec::with_fetch now sets fetch before calling create_filter() so the DynamicFilter gets the correct K value.

File-level reorder in shared work queue:

  • FileSource::reorder_files() trait method + parquet implementation: reorders files in the shared work queue by statistics so multi-file TopK reads the most promising files first across all partitions.

Are these changes tested?

  • SLT tests: Tests H (mixed RGs), J (scrambled non-overlapping), K (overlapping), L (multi-key ORDER BY), P (multi-file reorder)
  • All existing sort_pushdown SLTs pass
  • 98 parquet lib unit tests pass
  • Clippy clean, rustdoc clean

Are there any user-facing changes?

No API changes. TopK queries on parquet with multiple row groups will automatically benefit from better row group ordering. This is a performance optimization only — query results are unchanged.

Copilot AI review requested due to automatic review settings April 30, 2026 09:23
@github-actions github-actions Bot added physical-expr Changes to the physical-expr crates sqllogictest SQL Logic Tests (.slt) datasource Changes to the datasource crate physical-plan Changes to the physical-plan crate labels Apr 30, 2026
@zhuqi-lucas zhuqi-lucas force-pushed the feat/rg-reorder-by-statistics branch from 864a3d3 to 6e56cae Compare April 30, 2026 09:28
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves TopK (ORDER BY ... LIMIT K) performance for Parquet scans by using statistics to reorder work (files and row groups) so the dynamic filter threshold converges faster and more data can be pruned during execution.

Changes:

  • Propagate TopK sort metadata (sort_options, fetch) via DynamicFilterPhysicalExpr and fix SortExec::with_fetch ordering so the dynamic filter sees the correct K.
  • Add file-level reordering hook (FileSource::reorder_files) and implement statistics-based file reordering for Parquet.
  • Add row-group access plan optimization plumbing and statistics-based row-group reordering (composable with reverse scanning), plus SLT coverage.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
datafusion/sqllogictest/test_files/sort_pushdown.slt Adds SLT coverage for row-group/file reordering behavior across multiple TopK scenarios
datafusion/physical-plan/src/sorts/sort.rs Passes sort options + fetch into dynamic filter and adjusts fetch/filter initialization order
datafusion/physical-expr/src/expressions/dynamic_filters.rs Extends DynamicFilterPhysicalExpr with optional sort metadata and fetch limit
datafusion/datasource/src/file_stream/work_source.rs Calls FileSource::reorder_files before queueing shared work
datafusion/datasource/src/file.rs Introduces FileSource::reorder_files default extension point
datafusion/datasource-parquet/src/source.rs Implements Parquet file reordering using column statistics; wires fallback sort info
datafusion/datasource-parquet/src/opener.rs Applies row-group access plan optimizers (reorder + reverse) based on TopK metadata
datafusion/datasource-parquet/src/mod.rs Registers new access plan optimizer module
datafusion/datasource-parquet/src/access_plan_optimizer.rs Adds AccessPlanOptimizer trait plus ReverseRowGroups / ReorderByStatistics implementations
datafusion/datasource-parquet/src/access_plan.rs Adds PreparedAccessPlan::reorder_by_statistics (min-stat based RG ordering)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread datafusion/physical-plan/src/sorts/sort.rs
Comment thread datafusion/datasource/src/file_stream/work_source.rs
Comment thread datafusion/datasource-parquet/src/opener.rs
Comment thread datafusion/datasource-parquet/src/opener.rs
Comment thread datafusion/sqllogictest/test_files/sort_pushdown.slt Outdated
@zhuqi-lucas zhuqi-lucas force-pushed the feat/rg-reorder-by-statistics branch from 6e56cae to f0f4058 Compare April 30, 2026 09:34
@zhuqi-lucas zhuqi-lucas marked this pull request as draft April 30, 2026 09:36
@zhuqi-lucas zhuqi-lucas force-pushed the feat/rg-reorder-by-statistics branch 3 times, most recently from 587297d to 235c4e1 Compare April 30, 2026 09:46
@zhuqi-lucas zhuqi-lucas changed the title feat: reorder row groups by statistics for TopK queries feat: reorder files and row groups by statistics for TopK queries Apr 30, 2026
@zhuqi-lucas zhuqi-lucas changed the title feat: reorder files and row groups by statistics for TopK queries feat: globally reorder files and row groups by statistics for TopK queries Apr 30, 2026
@zhuqi-lucas zhuqi-lucas marked this pull request as ready for review April 30, 2026 09:47
@zhuqi-lucas
Copy link
Copy Markdown
Contributor Author

run benchmark sort_pushdown_inexact

@adriangbot
Copy link
Copy Markdown

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4351456416-1947-jscm6 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing feat/rg-reorder-by-statistics (235c4e1) to 0144570 (merge-base) diff using: sort_pushdown_inexact
Results will be posted here when complete


File an issue against this benchmark runner

@zhuqi-lucas zhuqi-lucas force-pushed the feat/rg-reorder-by-statistics branch from 235c4e1 to 0c9c3d8 Compare April 30, 2026 10:02
@github-actions github-actions Bot added the core Core DataFusion crate label Apr 30, 2026
@adriangbot
Copy link
Copy Markdown

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

Comparing HEAD and feat_rg-reorder-by-statistics
--------------------
Benchmark sort_pushdown_inexact.json
--------------------
┏━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query ┃                           HEAD ┃  feat_rg-reorder-by-statistics ┃        Change ┃
┡━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Q1    │    7.22 / 8.08 ±0.90 / 9.82 ms │   6.71 / 7.47 ±1.27 / 10.00 ms │ +1.08x faster │
│ Q2    │    6.79 / 6.88 ±0.08 / 7.01 ms │    6.87 / 7.08 ±0.23 / 7.48 ms │     no change │
│ Q3    │ 21.76 / 22.52 ±0.64 / 23.37 ms │ 21.99 / 22.88 ±0.57 / 23.66 ms │     no change │
│ Q4    │ 20.82 / 21.86 ±0.84 / 22.82 ms │ 20.73 / 21.74 ±0.77 / 22.71 ms │     no change │
└───────┴────────────────────────────────┴────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ Benchmark Summary                            ┃         ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ Total Time (HEAD)                            │ 59.34ms │
│ Total Time (feat_rg-reorder-by-statistics)   │ 59.17ms │
│ Average Time (HEAD)                          │ 14.83ms │
│ Average Time (feat_rg-reorder-by-statistics) │ 14.79ms │
│ Queries Faster                               │       1 │
│ Queries Slower                               │       0 │
│ Queries with No Change                       │       3 │
│ Queries with Failure                         │       0 │
└──────────────────────────────────────────────┴─────────┘

Resource Usage

sort_pushdown_inexact — base (merge-base)

Metric Value
Wall time 5.0s
Peak memory 4.6 GiB
Avg memory 4.6 GiB
CPU user 2.5s
CPU sys 0.4s
Peak spill 0 B

sort_pushdown_inexact — branch

Metric Value
Wall time 5.0s
Peak memory 4.6 GiB
Avg memory 4.6 GiB
CPU user 2.6s
CPU sys 0.3s
Peak spill 0 B

File an issue against this benchmark runner

When a parquet file has multiple row groups with out-of-order or
overlapping statistics, TopK queries benefit from reading "best" row
groups first so the dynamic filter threshold tightens quickly.

This PR adds:

1. `reorder_by_statistics`: sorts row groups by min values (ASC) based
   on parquet column statistics. Direction (DESC) is handled by the
   existing `reverse()` applied after reorder. The two steps compose:
   - Sorted data: reorder is a no-op, reverse gives perfect DESC order
   - Unsorted data: reorder fixes the order, reverse flips for DESC

2. `AccessPlanOptimizer` trait: extensible interface for row group
   access plan optimizations (reorder, reverse) applied after pruning.

3. `DynamicFilterPhysicalExpr.sort_options/fetch`: SortExec now passes
   sort direction and fetch limit to the dynamic filter, enabling the
   parquet reader to determine reorder direction for any TopK query.

4. `FileSource::reorder_files`: file-level reordering in the shared
   work queue so multi-file TopK reads the most promising files first.

5. Fix `SortExec::with_fetch` ordering: fetch must be set before
   `create_filter()` so the DynamicFilter gets the correct K value.
@zhuqi-lucas zhuqi-lucas force-pushed the feat/rg-reorder-by-statistics branch from 0c9c3d8 to f7d9156 Compare April 30, 2026 10:13
@zhuqi-lucas
Copy link
Copy Markdown
Contributor Author

The benchmark results are expected — RG reorder alone doesn't skip any row groups, it only changes the read order so that TopK's dynamic filter threshold converges faster.

The significant speedup (2-3x on sort_pushdown_inexact) comes from stats init + cumulative RG prune which will be in the follow-up PR. Those optimizations depend on RG reorder as a foundation:

  1. RG reorder: put best RGs first (this PR)
  2. Stats init: initialize TopK threshold from RG statistics before reading → prune RGs upfront (next PR)
  3. Cumulative prune: after reorder, truncate remaining RGs once enough rows are collected (next PR)

Without reorder, cumulative prune might truncate the wrong RGs. Reorder ensures the best RGs come first, making truncation safe and effective.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate datasource Changes to the datasource crate physical-expr Changes to the physical-expr crates physical-plan Changes to the physical-plan crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: global file reorder in shared work queue for TopK optimization Sort pushdown: reorder row groups by statistics within each file

3 participants