You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
iris/worker: protect freshly-submitted tasks from StartTasks→PollTasks race (#5043)
## Summary
Fixes#5041.
When the iris controller dispatches a task via `StartTasks` and polls
the worker for state via `PollTasks` before its own view of
`expected_tasks` has caught up, the worker treated the just-submitted
task as "unexpected" and killed it. That kill rolled up the workers-pool
job to `JOB_STATE_KILLED` and cascaded the surviving tasks with
`error="Job was terminated."`, surfacing in zephyr as the misleading
`"Worker job terminated permanently… Workers likely crashed"` abort.
`handle_heartbeat` already guarded against this race by passing
`extra_expected_keys` for the tasks it had just submitted in that RPC.
`handle_poll_tasks` did not — the PollTasks path has no "tasks_to_run"
field because StartTasks is a separate RPC — so freshly-submitted tasks
had no protection.
### Approach
- Track recent submissions on the worker: `submit_task` now records
`(task_id, attempt_id) -> monotonic_time` in `self._recent_submissions`.
- `_reconcile_expected_tasks` treats keys within a 30s grace window as
expected.
- Stale entries are pruned on each reconciliation so the dict stays
bounded.
- `_reset_worker_state` clears the tracking alongside `self._tasks`.
This fixes both `handle_poll_tasks` and the more general case where
heartbeat-submitted tasks still need race protection on the following
tick. The bespoke `extra_keys` set in `handle_heartbeat` is gone since
`submit_task` now populates `_recent_submissions` for all entry points.
0 commit comments