fix: broadcast job state correctly#30
Conversation
There was a problem hiding this comment.
Pull request overview
This PR reorders job state broadcasting in the scheduler so terminal JobStateEvents (Completed/Failed/Cancelled) are emitted after attempting to persist the corresponding terminal status in the Ballista job store, reducing a race where subscribers observe a terminal event before the store reflects it.
Changes:
- Move “Completed” broadcast to occur after
task_manager.succeed_jobis invoked. - Move “Failed” broadcast to occur after
task_manager.abort_jobis invoked (and clonefail_messageto avoid move). - Move “Cancelled” broadcast to occur after
task_manager.cancel_jobis invoked.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Adjusts scheduler job-state broadcasting to avoid races where subscribers observe a terminal JobStateEvent before the terminal JobStatus is persisted in the Ballista job store.
Changes:
- Reordered terminal job-state broadcasts to occur after persisting
succeed_job/abort_job/cancel_joboutcomes. - Simplified several map iterations by switching from
.iter()with unused keys to.values().
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
ballista/scheduler/src/scheduler_server/query_stage_scheduler.rs |
Persist terminal job status before broadcasting Completed/Failed/Cancelled events to subscribers. |
ballista/scheduler/src/state/executor_manager.rs |
Use .values() when iterating executor heartbeats where keys are unused. |
ballista/scheduler/src/state/execution_graph.rs |
Use .values() when iterating stages where keys are unused in running_tasks(). |
ballista/scheduler/src/state/aqe/mod.rs |
Use .values() when iterating stages where keys are unused in running_tasks(). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
This PR aims to eliminate a race between job-state broadcasts and persisted job status by broadcasting terminal job state only after writing the final status to the Ballista job store, so consumers can read the correct terminal state immediately.
Changes:
- Reordered terminal job-state broadcasts (
Failed,Completed,Cancelled) to occur after persisting job terminal status in the task manager. - Adjusted cancellation/task-cancellation event posting behavior and error handling in the scheduler event loop.
- Simplified several
HashMapiterations from.iter()to.values()where keys are unused.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| ballista/scheduler/src/state/executor_manager.rs | Iterates heartbeats via .values() instead of .iter() where executor IDs aren’t needed. |
| ballista/scheduler/src/state/execution_graph.rs | Iterates stages via .values() for running task enumeration. |
| ballista/scheduler/src/state/aqe/mod.rs | Same .values() simplification for adaptive execution graph running task enumeration. |
| ballista/scheduler/src/scheduler_server/query_stage_scheduler.rs | Reorders terminal-state persistence vs broadcast; modifies cancellation event posting/error handling and cleanup timing. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
COMPLETEwhen we have not yet updated it in the Ballista job store. A consumer would then look up the job from the broadcast notification, find it is in fact notCOMPLETE, then wait some time before checking the job again