Skip to content

fix(stats): close race in RuntimeStatsManager shutdown#6891

Merged
samstokes merged 4 commits into
mainfrom
claude/fix-limit-test-flaky-623yY
May 6, 2026
Merged

fix(stats): close race in RuntimeStatsManager shutdown#6891
samstokes merged 4 commits into
mainfrom
claude/fix-limit-test-flaky-623yY

Conversation

@samstokes
Copy link
Copy Markdown
Collaborator

@samstokes samstokes commented May 6, 2026

Summary

take_input_snapshot could return empty stats for fast-finishing pipelines (e.g. a hit LIMIT), surfacing as the flaky test_limit_without_estimated_rows failure on rust-tests-platform.

Race

When a pipeline finishes early, run_execution_loop fires finish_tx. The manager's biased select! could pick finish_rx over still-queued RegisterRuntimeStats / TakeInputSnapshot messages on node_rx, then break the loop and only build finished_snapshots after the break. A concurrent take_input_snapshot landing in that window saw finished_snapshots == None, sent on a channel the manager no longer read, and got Err back — empty stats.

Fix (src/daft-local-execution/src/runtime_stats/mod.rs)

  • Drain node_rx (try_recv) inside the finish_rx arm so late register messages aren't dropped.
  • Build & publish finished_snapshots before break, so it's visible the moment the loop exits.
  • In take_input_snapshot, fall back to finished_snapshots after a channel send/recv failure.

Locally repro'd at ~10% on the test pre-fix; 0/100 with the fix.

https://claude.ai/code/session_01NVUu3zCZwk3rdaiXpPnYKD

… lost stats for fast-finishing pipelines

When a pipeline finishes early (e.g. a hit LIMIT), `run_execution_loop`
fires `finish_tx`, which the manager's `select!` can pick over pending
`RegisterRuntimeStats` / `TakeInputSnapshot` messages on `node_rx`. The
old shutdown path broke out of the loop without draining `node_rx` and
only built `finished_snapshots` after the break, leaving a window in
which `take_input_snapshot` saw `None` snapshots, sent on a channel the
manager no longer read, and got `Err` back — surfacing as zero-stats
events for completed tasks.

Fix:
- drain `node_rx` (try_recv) inside the `finish_rx` arm so late-arriving
  register/snapshot messages aren't dropped;
- publish `finished_snapshots` before `break` so it's visible the moment
  the loop exits;
- in `take_input_snapshot`, fall back to `finished_snapshots` after a
  channel send or response failure, since both can race with the
  manager's final shutdown step.

Reproed locally as the flaky `test_limit_without_estimated_rows`
failure on rust-tests-platform; with this change it passes 100/100.

https://claude.ai/code/session_01NVUu3zCZwk3rdaiXpPnYKD
@samstokes samstokes requested a review from a team as a code owner May 6, 2026 17:19
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

Rust Dependency Diff

Head: 11a1aedb104667c9e817e84239ffa944eb20c3d6 vs Base: 63f71392386916cf561b51bf5b690026034c41bd.

OK: Within budget.

  • New Crates: 0
  • Removed Crates: 0

@samstokes samstokes changed the title Fix race condition in RuntimeStatsManager shutdown fix(local-execution): close race in RuntimeStatsManager shutdown May 6, 2026
@github-actions github-actions Bot added the fix label May 6, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 6, 2026

Greptile Summary

This PR closes a shutdown race in RuntimeStatsManager that caused take_input_snapshot to return empty stats for fast-finishing pipelines (e.g. a hit LIMIT). The fix tightens the shutdown sequence so no message or snapshot can be lost between the finish_rx signal and the task exit.

  • Drain before break: when finish_rx fires, the new code drains all still-queued node_rx messages via try_recv before proceeding, so RegisterRuntimeStats and TakeInputSnapshot messages that beat the finish signal but lost the select! race are not discarded.
  • Publish finished_snapshots before break: finalized snapshot data is now written to the shared Mutex inside the finish_rx arm rather than after the loop, eliminating the window between break and assignment where a concurrent channel send would queue but never be read.
  • Fallback in take_input_snapshot: both the send-failure path and the rx.await error path now fall back to finished_snapshots (which is guaranteed to be populated before the task exits), converting what was a hard InternalError into a successful stat retrieval for the fast-finish case.

Confidence Score: 5/5

Safe to merge — the change closes a documented race without altering the normal hot path, and all three shutdown windows are addressed consistently.

The fix is logically tight: draining the channel before finalizing nodes, publishing finished_snapshots before break, and falling back gracefully in take_input_snapshot all compose correctly. The invariant that finished_snapshots is set before node_rx is dropped holds in all paths.

No files require special attention.

Important Files Changed

Filename Overview
src/daft-local-execution/src/runtime_stats/mod.rs Fixes a three-part shutdown race in RuntimeStatsManager: drain node_rx before break, publish finished_snapshots before break, and fall back to finished_snapshots on channel errors in take_input_snapshot. Logic is correct and all race windows are addressed.

Sequence Diagram

sequenceDiagram
    participant Caller as take_input_snapshot
    participant Chan as mpsc channel (node_rx)
    participant Mgr as Manager event loop
    participant FS as finished_snapshots (Mutex)

    note over Mgr: finish_rx fires (pipeline done)
    Mgr->>Chan: try_recv() drain loop
    Chan-->>Mgr: RegisterRuntimeStats / TakeInputSnapshot (if queued)
    Mgr->>Mgr: flush_and_finalize_node for active nodes
    Mgr->>FS: write finished snapshots (before break)
    Mgr->>Mgr: break, task exits, node_rx dropped

    Caller->>FS: take_finished_snapshot (fast path)
    alt already published
        FS-->>Caller: Some(stats)
    else not yet published
        FS-->>Caller: None
        Caller->>Chan: send TakeInputSnapshot
        alt send fails (node_rx dropped)
            Caller->>FS: take_finished_snapshot (fallback)
            FS-->>Caller: Some(stats)
        else send succeeds, await rx
            Mgr-->>Caller: Ok(stats) via oneshot
        else rx.await Err (responder dropped)
            Caller->>FS: take_finished_snapshot (fallback)
            FS-->>Caller: Some(stats)
        end
    end
Loading

Reviews (3): Last reviewed commit: "refactor(stats): extract handle_message ..." | Re-trigger Greptile

The closure mutates passed-in `&mut` references, not captured state, so
`mut` on the binding itself is unused. Caught by clippy in the style CI.

https://claude.ai/code/session_01NVUu3zCZwk3rdaiXpPnYKD
@samstokes samstokes changed the title fix(local-execution): close race in RuntimeStatsManager shutdown fix(stats): close race in RuntimeStatsManager shutdown May 6, 2026
@samstokes
Copy link
Copy Markdown
Collaborator Author

@greptileai please take another look

@codecov
Copy link
Copy Markdown

codecov Bot commented May 6, 2026

Codecov Report

❌ Patch coverage is 81.96721% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.39%. Comparing base (044b833) to head (c699e0b).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/daft-local-execution/src/runtime_stats/mod.rs 81.96% 22 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #6891      +/-   ##
==========================================
+ Coverage   75.37%   75.39%   +0.02%     
==========================================
  Files        1135     1135              
  Lines      160346   160375      +29     
==========================================
+ Hits       120857   120913      +56     
+ Misses      39489    39462      -27     
Files with missing lines Coverage Δ
src/daft-local-execution/src/runtime_stats/mod.rs 92.09% <81.96%> (-2.54%) ⬇️

... and 17 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

The closure had grown to ~50 lines and only captured shared references
that can be passed explicitly, mirroring flush_and_finalize_node's
signature. Reads more easily as a method.

https://claude.ai/code/session_01NVUu3zCZwk3rdaiXpPnYKD
@samstokes
Copy link
Copy Markdown
Collaborator Author

@greptileai one more time please

@samstokes samstokes merged commit 8b81448 into main May 6, 2026
38 checks passed
@samstokes samstokes deleted the claude/fix-limit-test-flaky-623yY branch May 6, 2026 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants