fix: handle None task slot in update_task_info after executor lost#23
Merged
Conversation
When an executor heartbeat times out, reset_tasks() sets task_infos[partition_id] to None. If the executor later reconnects and sends a late status update, update_task_info() would panic on .unwrap() of the None value. Now gracefully returns false (update rejected) with a warning log when the task slot is None, preventing the scheduler from crashing. Fixes spiceai/spiceai#9636
phillipleblanc
added a commit
to spiceai/spiceai
that referenced
this pull request
Mar 5, 2026
Two fixes for async query scheduler issues: 1. Update datafusion-ballista to fix scheduler panic (spiceai/datafusion-ballista#23): When an executor heartbeat times out, reset_tasks() sets task slots to None. Late status updates from the reconnected executor would panic on unwrap(). Now gracefully rejects stale updates with a warning. 2. Remove duplicate cancel_job calls from background task: Both executor.cancel() and the background task's select! cancel branch were calling job_store.cancel_job(), racing via OCC and producing 'Concurrent modification detected' HTTP 500 errors. The background task now only cancels the distributed query; state update is solely handled by executor.cancel(). Fixes #9636
ewgenius
approved these changes
Mar 5, 2026
phillipleblanc
added a commit
to spiceai/spiceai
that referenced
this pull request
Mar 5, 2026
Two fixes for async query scheduler issues: 1. Update datafusion-ballista to fix scheduler panic (spiceai/datafusion-ballista#23): When an executor heartbeat times out, reset_tasks() sets task slots to None. Late status updates from the reconnected executor would panic on unwrap(). Now gracefully rejects stale updates with a warning. 2. Fix cancel race condition with fibonacci backoff retry: cancel_job() could race with set_job_running() in the background task, both reading the same OCC version and one failing with ConcurrentModification (HTTP 500). Now cancel_job retries with fibonacci backoff on conflict, re-reading state each time. Also removed duplicate cancel_job calls from the background task's select! cancel branch. Fixes #9636
phillipleblanc
added a commit
to spiceai/spiceai
that referenced
this pull request
Mar 5, 2026
Two fixes for async query scheduler issues: 1. Update datafusion-ballista to fix scheduler panic (spiceai/datafusion-ballista#23): When an executor heartbeat times out, reset_tasks() sets task slots to None. Late status updates from the reconnected executor would panic on unwrap(). Now gracefully rejects stale updates with a warning. 2. Fix cancel race condition with fibonacci backoff retry: cancel_job() could race with set_job_running() in the background task, both reading the same OCC version and one failing with ConcurrentModification (HTTP 500). Now cancel_job retries with fibonacci backoff on conflict, re-reading state each time. Also removed duplicate cancel_job calls from the background task's select! cancel branch. Fixes #9636
phillipleblanc
added a commit
to spiceai/spiceai
that referenced
this pull request
Mar 5, 2026
Two fixes for async query scheduler issues: 1. Update datafusion-ballista to fix scheduler panic (spiceai/datafusion-ballista#23): When an executor heartbeat times out, reset_tasks() sets task slots to None. Late status updates from the reconnected executor would panic on unwrap(). Now gracefully rejects stale updates with a warning. 2. Fix cancel race condition with fibonacci backoff retry: cancel_job() could race with set_job_running() in the background task, both reading the same OCC version and one failing with ConcurrentModification (HTTP 500). Now cancel_job retries with fibonacci backoff on conflict, re-reading state each time. Also removed duplicate cancel_job calls from the background task's select! cancel branch. Fixes #9636
phillipleblanc
added a commit
to spiceai/spiceai
that referenced
this pull request
Mar 5, 2026
Two fixes for async query scheduler issues: 1. Update datafusion-ballista to fix scheduler panic (spiceai/datafusion-ballista#23): When an executor heartbeat times out, reset_tasks() sets task slots to None. Late status updates from the reconnected executor would panic on unwrap(). Now gracefully rejects stale updates with a warning. 2. Fix cancel race condition with fibonacci backoff retry: cancel_job() could race with set_job_running() in the background task, both reading the same OCC version and one failing with ConcurrentModification (HTTP 500). Now cancel_job retries with fibonacci backoff on conflict, re-reading state each time. Also removed duplicate cancel_job calls from the background task's select! cancel branch. Fixes #9636
github-merge-queue Bot
pushed a commit
to spiceai/spiceai
that referenced
this pull request
Mar 5, 2026
Two fixes for async query scheduler issues: 1. Update datafusion-ballista to fix scheduler panic (spiceai/datafusion-ballista#23): When an executor heartbeat times out, reset_tasks() sets task slots to None. Late status updates from the reconnected executor would panic on unwrap(). Now gracefully rejects stale updates with a warning. 2. Fix cancel race condition with fibonacci backoff retry: cancel_job() could race with set_job_running() in the background task, both reading the same OCC version and one failing with ConcurrentModification (HTTP 500). Now cancel_job retries with fibonacci backoff on conflict, re-reading state each time. Also removed duplicate cancel_job calls from the background task's select! cancel branch. Fixes #9636
lukekim
pushed a commit
that referenced
this pull request
Apr 22, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When an executor heartbeat times out,
reset_tasks()setstask_infos[partition_id]toNone. If the executor later reconnects and sends a late status update,update_task_info()would panic on.unwrap()of theNonevalue.Now gracefully returns
false(update rejected) with a warning log when the task slot isNone, preventing the scheduler from crashing.Changes
.unwrap()withlet-elsepattern that logs a warning and returnsfalseNoneslot scenarioTesting
cargo test -p ballista-scheduler -- execution_stage::tests— 3/3 passFixes spiceai/spiceai#9636 (bug 1: scheduler panic)