Commit 7e9872a
authored
feat(scheduler): detect and report stuck queries (#39)
Adds an operator-facing warning when a distributed query stops making
progress, with enough detail in-line to diagnose without rerunning
with debug logging. Targets spiceai/spiceai#10832 where the production
wedge is rare and high-impact: by the time it is noticed the cluster
needs restarting, so any debug-level telemetry would no longer be
in scope.
Single diagnostic surface, single info-level signal:
1. A 30s background loop on the scheduler samples per-stage progress
for every active job. Capture uses a 500ms try_read budget per
graph so a held write lock is itself recorded as a separate state
("Locked") rather than blocking the snapshot.
2. After four consecutive samples with no per-stage movement, while
the cluster has live executors and the job is not terminal, the
loop emits one block of warn lines. The block re-fires every 30s
while the condition holds and goes silent the moment progress
resumes.
3. The block contains:
- Primary line: query id, elapsed stuck time, pending task count
- One line per Running stage with unassigned/uncomplete partitions
- "Lock could not be read" line if the snapshot timed out
- Alive executor count
- The event handler currently in flight on the scheduler event
loop with its elapsed time, if a handler has been running
longer than 30s
Lines after the primary are emitted only when their underlying
check fires, so the block reads as a coherent diagnosis of which
code path is hung.
4. To support (3), EventLoop now publishes an EventInFlight slot
populated around each on_receive call and exposed via
EventLoop::in_flight_signal(). EventAction gains an event_label
method (default empty) that the scheduler implements to give each
QueryStageSchedulerEvent variant a stable label for the warn block.
Volume: zero lines from this instrumentation at default RUST_LOG=info
when the cluster is healthy. When a query is stuck, ~3-6 lines per
30s while the condition holds.
Verified: cargo build, cargo test -p ballista-core -p ballista-scheduler
(83 passed, 1 ignored, includes 4 new tests covering the warn-block
helpers), cargo clippy -D warnings, cargo fmt --check all clean.1 parent 8afc1b7 commit 7e9872a
4 files changed
Lines changed: 506 additions & 3 deletions
File tree
- ballista
- core/src
- scheduler/src
- scheduler_server
- state
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
18 | 18 | | |
19 | 19 | | |
20 | 20 | | |
| 21 | + | |
21 | 22 | | |
| 23 | + | |
22 | 24 | | |
23 | 25 | | |
24 | 26 | | |
25 | 27 | | |
26 | 28 | | |
27 | 29 | | |
28 | 30 | | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
29 | 53 | | |
30 | 54 | | |
31 | 55 | | |
| |||
45 | 69 | | |
46 | 70 | | |
47 | 71 | | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
48 | 80 | | |
49 | 81 | | |
50 | 82 | | |
| |||
57 | 89 | | |
58 | 90 | | |
59 | 91 | | |
| 92 | + | |
60 | 93 | | |
61 | 94 | | |
62 | 95 | | |
| |||
72 | 105 | | |
73 | 106 | | |
74 | 107 | | |
| 108 | + | |
75 | 109 | | |
76 | 110 | | |
77 | 111 | | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
78 | 119 | | |
79 | 120 | | |
80 | 121 | | |
| |||
84 | 125 | | |
85 | 126 | | |
86 | 127 | | |
| 128 | + | |
87 | 129 | | |
88 | 130 | | |
89 | 131 | | |
90 | 132 | | |
91 | | - | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
92 | 149 | | |
93 | 150 | | |
94 | 151 | | |
| |||
0 commit comments