Skip to content

fix: Improve TopK filter pushdown#28

Merged
peasee merged 6 commits into
spiceai-52.5from
peasee/260421-topk-improvements
Apr 22, 2026
Merged

fix: Improve TopK filter pushdown#28
peasee merged 6 commits into
spiceai-52.5from
peasee/260421-topk-improvements

Conversation

@peasee
Copy link
Copy Markdown

@peasee peasee commented Apr 21, 2026

  • TopK filters weren't reliably being pushed down into scans as their dynamic filter link over Arc<T> was breaking when they were serialized/deserialized into executors. This resulted in each executor creating its own independent TopK filter, resulting in more rows scanned than necessary
  • Collapsed plans with multiple stages where a TopK LIMIT N has a small N - if the N is small enough, there is no need to shuffle data as we can merge it all from the single stage. This avoids the overhead of 1) coordinating launching a new stage and 2) reading/writing from shuffle for only a handful of rows.

@peasee peasee self-assigned this Apr 21, 2026
Copilot AI review requested due to automatic review settings April 21, 2026 07:44
@peasee peasee added the bug Something isn't working label Apr 21, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves distributed TopK execution in Ballista by (1) ensuring TopK-related dynamic filter links survive scheduler→executor protobuf roundtrips, and (2) avoiding unnecessary stage breaks/shuffles for small TopK limits.

Changes:

  • Skip the SortPreservingMergeExec stage break for “small” TopK fetch/limits to avoid shuffle overhead.
  • Re-run DataFusion’s post-optimization filter pushdown on executors to re-establish dynamic filter linkages after deserialization.
  • Add a regression test for TopK stage planning and introduce tempfile as a scheduler dev-dependency.

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
ballista/scheduler/src/planner.rs Adds TopK stage-collapsing logic for small fetch on SortPreservingMergeExec and a new test validating stage counts.
ballista/scheduler/Cargo.toml Adds tempfile for the new planner test.
ballista/executor/src/execution_engine.rs Re-runs FilterPushdown(Post) on the executor side to restore dynamic filter wiring after serde.
ballista/core/src/execution_plans/shuffle_writer.rs Removes manual write-time timer handling for the non-partitioned shuffle write path.
Cargo.lock Records the added tempfile dependency.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ballista/executor/src/execution_engine.rs Outdated
Comment thread ballista/scheduler/src/planner.rs
Comment thread ballista/core/src/execution_plans/shuffle_writer.rs
Copilot AI review requested due to automatic review settings April 22, 2026 03:26
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves distributed TopK execution in Ballista by (1) ensuring dynamic filter pushdown is re-established after plan serialization/deserialization and (2) avoiding shuffle-stage overhead for small ORDER BY ... LIMIT N queries by collapsing certain stage breaks.

Changes:

  • Skip creating a stage break at SortPreservingMergeExec when fetch/limit is small to avoid shuffle overhead for TopK.
  • Re-run DataFusion’s post-optimization FilterPushdown on the executor to restore dynamic filter links lost during protobuf round-trip.
  • Add a scheduler-side regression test for TopK stage-planning behavior and include tempfile for the test.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 3 comments.

File Description
ballista/scheduler/src/planner.rs Adds TopK-specific stage-planning heuristic and a regression test asserting stage count behavior.
ballista/scheduler/Cargo.toml Adds tempfile as a dev-dependency for the new test.
ballista/executor/src/execution_engine.rs Re-runs FilterPushdown after deserialization to restore dynamic filter pushdown links.
Cargo.lock Locks tempfile inclusion.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ballista/scheduler/src/planner.rs
Comment thread ballista/executor/src/execution_engine.rs Outdated
Comment thread ballista/executor/src/execution_engine.rs
@peasee peasee merged commit 8bc4d75 into spiceai-52.5 Apr 22, 2026
30 checks passed
@peasee peasee deleted the peasee/260421-topk-improvements branch April 22, 2026 03:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working needs-upstream

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants