Skip to content

feat: OFFSET pushdown for multi-file parquet scans #21915

@zhuqi-lucas

Description

@zhuqi-lucas

Background

#21828 implements OFFSET pushdown for single-file parquet queries. Multi-file queries still use GlobalLimitExec for offset handling.

Problem

For multi-file queries like SELECT * FROM directory/ LIMIT 5 OFFSET 1000000, the offset is handled by GlobalLimitExec which reads all rows then discards the first 1M. With multiple files, we could skip entire files whose cumulative row count falls within the offset.

Challenge

File read order is non-deterministic with target_partitions > 1 and dynamic scheduling (#21351). A shared counter (Arc<AtomicUsize>) across file openers could work for single-partition sequential reads, but multi-partition ordering is undefined.

Proposed approach

  1. Single partition (preserve_order=true): files read in deterministic order → shared counter tracks consumed offset across files → skip entire files + RGs
  2. Multi-partition: keep GlobalLimitExec (order undefined without ORDER BY)
  3. Use file-level statistics (PartitionedFile.statistics.num_rows) to skip entire files before opening

Related

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions