iris: add TASK_STATE_MISSING; stop worker-reconciled KILLED from cascading to jobs#5046
iris: add TASK_STATE_MISSING; stop worker-reconciled KILLED from cascading to jobs#5046
Conversation
… state Fixes #5041. A StartTasks→PollTasks race could cause the worker to kill a just-submitted task as "unexpected" and then echo KILLED back on the next poll, which the controller promoted to JOB_STATE_KILLED and cascaded to every sibling in the pool. Introduces TASK_STATE_MISSING as a worker-only wire signal; worker removes reconciliation-killed attempts from self._tasks so subsequent polls report MISSING via the not-found path. Controller maps MISSING to WORKER_FAILED so the task retries via its preemption budget instead of killing the parent job.
|
Claude finished @rjpower's task in 12m 26s —— View job Code reviewNo high-signal issues found. Checked for bugs, correctness, and AGENTS.md compliance. What I verified
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c307c67747
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| attempts_to_kill.append(task) | ||
| del self._tasks[key] |
There was a problem hiding this comment.
Delay removing unexpected attempts until kill is initiated
Deleting the attempt from self._tasks inside reconciliation before kill execution can misclassify successful work as missing. In the StartTasks→PollTasks race, an attempt can become terminal between this deletion and the later _kill_attempt(...) call; _kill_attempt then no-ops on terminal states, but the worker has already dropped tracking for that attempt. On the next poll (when the controller finally includes that task in expected_tasks), the worker reports MISSING, which the controller maps to WORKER_FAILED and retries, causing duplicate execution and potentially incorrect outcomes for short-lived tasks.
Useful? React with 👍 / 👎.
ravwojdyla
left a comment
There was a problem hiding this comment.
The transient/special TASK_STATE_MISSING feels a bit hacky - but I also don't have a better idea. Nice to have this improved!
| continue | ||
| new_state = entry.state | ||
| error = entry.error or None | ||
| if new_state == job_pb2.TASK_STATE_MISSING: |
There was a problem hiding this comment.
We should log here, no? To leave trace for future debugging.
There was a problem hiding this comment.
yes, good call, I should do a pass and add a lot more logging to Iris in general.
There was a problem hiding this comment.
I should do a pass and add a lot more logging to Iris in general.
+1
|
@claude pls add logging as discussed in the comments. |
|
I think MISSING will make more sense with a followup PR. That should move us to something like:
If something isn't in the "running" set, then it really is missing, and we can log it as such. Maybe the previous RPC got dropped, or there's a worker bug, but either way we can flag the discrepancy. This will still be better than our old heartbeat loop because the task scheduling can happen at different cadence, and on demand. Alternatively we can just move the Poll RPC onto the scheduler thread, removing the race condition. I'll file an issue to follow-up... |
|
Claude finished @rjpower's task in 3m 38s —— View job Added logging on both ends of the MISSING → WORKER_FAILED path
Only touched logging — no behavior changes. The reconciliation-kill log at worker.py:948-952 ("Killing task … no longer in expected_tasks") was already present, which together with the two new logs gives a clean trace: worker reconciles → worker emits MISSING → controller translates to WORKER_FAILED. |
Adds a warning when the worker emits TASK_STATE_MISSING for an expected task it can no longer find, and an info log when the controller translates that MISSING into WORKER_FAILED. Leaves a trace on both ends for future debugging of the StartTasks→PollTasks reconciliation race. Co-authored-by: Russell Power <rjpower@users.noreply.github.com>
|
I'm going to unify these 2 PRs into a cleaner approach... |
|
🤖 Superseded by #5090, which removes the race itself rather than translating its downstream effects: PollTasks now runs inline in the scheduling loop, so a poll's expected_tasks snapshot is always taken after the same iteration's StartTasks commits. The worker-reported MISSING translation in this PR is no longer needed for #5041, though the |
A StartTasks to PollTasks race could cause the worker to kill a just-submitted task as unexpected and echo KILLED on the next poll, which the controller promoted to JOB_STATE_KILLED and cascaded across the whole pool. This adds TASK_STATE_MISSING as a worker-only wire signal, pops reconciliation-killed attempts from self._tasks so subsequent polls report MISSING, and maps MISSING to WORKER_FAILED in the controller so the task retries via its preemption budget.
Fixes #5041