Skip to content

fix: handle None task slot in update_task_info after executor lost#23

Merged
phillipleblanc merged 1 commit into
spiceai-52from
phillip/260304-fix-9636
Mar 5, 2026
Merged

fix: handle None task slot in update_task_info after executor lost#23
phillipleblanc merged 1 commit into
spiceai-52from
phillip/260304-fix-9636

Conversation

@phillipleblanc
Copy link
Copy Markdown

Summary

When an executor heartbeat times out, reset_tasks() sets task_infos[partition_id] to None. If the executor later reconnects and sends a late status update, update_task_info() would panic on .unwrap() of the None value.

Now gracefully returns false (update rejected) with a warning log when the task slot is None, preventing the scheduler from crashing.

Changes

  • Replace .unwrap() with let-else pattern that logs a warning and returns false
  • Add 3 regression tests covering the None slot scenario

Testing

  • cargo test -p ballista-scheduler -- execution_stage::tests — 3/3 pass

Fixes spiceai/spiceai#9636 (bug 1: scheduler panic)

When an executor heartbeat times out, reset_tasks() sets task_infos[partition_id]
to None. If the executor later reconnects and sends a late status update,
update_task_info() would panic on .unwrap() of the None value.

Now gracefully returns false (update rejected) with a warning log when the
task slot is None, preventing the scheduler from crashing.

Fixes spiceai/spiceai#9636
phillipleblanc added a commit to spiceai/spiceai that referenced this pull request Mar 5, 2026
Two fixes for async query scheduler issues:

1. Update datafusion-ballista to fix scheduler panic (spiceai/datafusion-ballista#23):
   When an executor heartbeat times out, reset_tasks() sets task slots to None.
   Late status updates from the reconnected executor would panic on unwrap().
   Now gracefully rejects stale updates with a warning.

2. Remove duplicate cancel_job calls from background task:
   Both executor.cancel() and the background task's select! cancel branch
   were calling job_store.cancel_job(), racing via OCC and producing
   'Concurrent modification detected' HTTP 500 errors. The background task
   now only cancels the distributed query; state update is solely handled
   by executor.cancel().

Fixes #9636
@phillipleblanc phillipleblanc self-assigned this Mar 5, 2026
@phillipleblanc phillipleblanc marked this pull request as ready for review March 5, 2026 07:42
@phillipleblanc phillipleblanc merged commit e1153d7 into spiceai-52 Mar 5, 2026
29 checks passed
@phillipleblanc phillipleblanc deleted the phillip/260304-fix-9636 branch March 5, 2026 07:50
phillipleblanc added a commit to spiceai/spiceai that referenced this pull request Mar 5, 2026
Two fixes for async query scheduler issues:

1. Update datafusion-ballista to fix scheduler panic (spiceai/datafusion-ballista#23):
   When an executor heartbeat times out, reset_tasks() sets task slots to None.
   Late status updates from the reconnected executor would panic on unwrap().
   Now gracefully rejects stale updates with a warning.

2. Fix cancel race condition with fibonacci backoff retry:
   cancel_job() could race with set_job_running() in the background task,
   both reading the same OCC version and one failing with
   ConcurrentModification (HTTP 500). Now cancel_job retries with fibonacci
   backoff on conflict, re-reading state each time. Also removed duplicate
   cancel_job calls from the background task's select! cancel branch.

Fixes #9636
phillipleblanc added a commit to spiceai/spiceai that referenced this pull request Mar 5, 2026
Two fixes for async query scheduler issues:

1. Update datafusion-ballista to fix scheduler panic (spiceai/datafusion-ballista#23):
   When an executor heartbeat times out, reset_tasks() sets task slots to None.
   Late status updates from the reconnected executor would panic on unwrap().
   Now gracefully rejects stale updates with a warning.

2. Fix cancel race condition with fibonacci backoff retry:
   cancel_job() could race with set_job_running() in the background task,
   both reading the same OCC version and one failing with
   ConcurrentModification (HTTP 500). Now cancel_job retries with fibonacci
   backoff on conflict, re-reading state each time. Also removed duplicate
   cancel_job calls from the background task's select! cancel branch.

Fixes #9636
phillipleblanc added a commit to spiceai/spiceai that referenced this pull request Mar 5, 2026
Two fixes for async query scheduler issues:

1. Update datafusion-ballista to fix scheduler panic (spiceai/datafusion-ballista#23):
   When an executor heartbeat times out, reset_tasks() sets task slots to None.
   Late status updates from the reconnected executor would panic on unwrap().
   Now gracefully rejects stale updates with a warning.

2. Fix cancel race condition with fibonacci backoff retry:
   cancel_job() could race with set_job_running() in the background task,
   both reading the same OCC version and one failing with
   ConcurrentModification (HTTP 500). Now cancel_job retries with fibonacci
   backoff on conflict, re-reading state each time. Also removed duplicate
   cancel_job calls from the background task's select! cancel branch.

Fixes #9636
phillipleblanc added a commit to spiceai/spiceai that referenced this pull request Mar 5, 2026
Two fixes for async query scheduler issues:

1. Update datafusion-ballista to fix scheduler panic (spiceai/datafusion-ballista#23):
   When an executor heartbeat times out, reset_tasks() sets task slots to None.
   Late status updates from the reconnected executor would panic on unwrap().
   Now gracefully rejects stale updates with a warning.

2. Fix cancel race condition with fibonacci backoff retry:
   cancel_job() could race with set_job_running() in the background task,
   both reading the same OCC version and one failing with
   ConcurrentModification (HTTP 500). Now cancel_job retries with fibonacci
   backoff on conflict, re-reading state each time. Also removed duplicate
   cancel_job calls from the background task's select! cancel branch.

Fixes #9636
github-merge-queue Bot pushed a commit to spiceai/spiceai that referenced this pull request Mar 5, 2026
Two fixes for async query scheduler issues:

1. Update datafusion-ballista to fix scheduler panic (spiceai/datafusion-ballista#23):
   When an executor heartbeat times out, reset_tasks() sets task slots to None.
   Late status updates from the reconnected executor would panic on unwrap().
   Now gracefully rejects stale updates with a warning.

2. Fix cancel race condition with fibonacci backoff retry:
   cancel_job() could race with set_job_running() in the background task,
   both reading the same OCC version and one failing with
   ConcurrentModification (HTTP 500). Now cancel_job retries with fibonacci
   backoff on conflict, re-reading state each time. Also removed duplicate
   cancel_job calls from the background task's select! cancel branch.

Fixes #9636
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants