iris: add TASK_STATE_MISSING; stop worker-reconciled KILLED from cascading to jobs by rjpower · Pull Request #5046 · marin-community/marin

rjpower · 2026-04-22T15:51:34Z

A StartTasks to PollTasks race could cause the worker to kill a just-submitted task as unexpected and echo KILLED on the next poll, which the controller promoted to JOB_STATE_KILLED and cascaded across the whole pool. This adds TASK_STATE_MISSING as a worker-only wire signal, pops reconciliation-killed attempts from self._tasks so subsequent polls report MISSING, and maps MISSING to WORKER_FAILED in the controller so the task retries via its preemption budget.

Fixes #5041

… state Fixes #5041. A StartTasks→PollTasks race could cause the worker to kill a just-submitted task as "unexpected" and then echo KILLED back on the next poll, which the controller promoted to JOB_STATE_KILLED and cascaded to every sibling in the pool. Introduces TASK_STATE_MISSING as a worker-only wire signal; worker removes reconciliation-killed attempts from self._tasks so subsequent polls report MISSING via the not-found path. Controller maps MISSING to WORKER_FAILED so the task retries via its preemption budget instead of killing the parent job.

claude · 2026-04-22T15:51:50Z

Claude finished @rjpower's task in 12m 26s —— View job

Code review

No high-signal issues found. Checked for bugs, correctness, and AGENTS.md compliance.

What I verified

Call-site signature changes are consistent:
- _reconcile_expected_tasks return type changed from list[tuple[str, int]] to list[TaskAttempt]; both call sites (handle_heartbeat, handle_poll_tasks) updated.
- _missing_task_status now requires error; the sole call site passes "Task not found on worker".
del self._tasks[key] during iteration is safely wrapped with list(self._tasks.items()).
New proto value TASK_STATE_MISSING = 11 is non-colliding and reflected in job_pb2.py and job_pb2.pyi.
Controller-side translation (task_updates_from_proto) applies to both heartbeat and UpdateTaskStatus push paths (service.py:2642), so the MISSING→WORKER_FAILED mapping is uniform.
Thread name uses attempt.task_id (a JobName); JobName.__str__ returns the canonical wire format, so the f-string produces the same value as before.
The terminal-attempt guard in apply_task_updates (transitions.py:1823) drops late-arriving state updates on already-terminal attempts, which bounds the fix's exposure to push/heartbeat interleaving.
test_heartbeat_without_adoption_reports_missing correctly exercises the not-found → MISSING path.
• Branch: fix/missing-task-state-5041

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c307c67747

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-22T15:57:33Z

+                attempts_to_kill.append(task)
+                del self._tasks[key]


Delay removing unexpected attempts until kill is initiated

Deleting the attempt from self._tasks inside reconciliation before kill execution can misclassify successful work as missing. In the StartTasks→PollTasks race, an attempt can become terminal between this deletion and the later _kill_attempt(...) call; _kill_attempt then no-ops on terminal states, but the worker has already dropped tracking for that attempt. On the next poll (when the controller finally includes that task in expected_tasks), the worker reports MISSING, which the controller maps to WORKER_FAILED and retries, causing duplicate execution and potentially incorrect outcomes for short-lived tasks.

Useful? React with 👍 / 👎.

ravwojdyla

The transient/special TASK_STATE_MISSING feels a bit hacky - but I also don't have a better idea. Nice to have this improved!

ravwojdyla · 2026-04-22T18:28:56Z

            continue
+        new_state = entry.state
+        error = entry.error or None
+        if new_state == job_pb2.TASK_STATE_MISSING:


We should log here, no? To leave trace for future debugging.

yes, good call, I should do a pass and add a lot more logging to Iris in general.

I should do a pass and add a lot more logging to Iris in general.

+1

rjpower · 2026-04-22T19:16:31Z

@claude pls add logging as discussed in the comments.

rjpower · 2026-04-22T19:16:51Z

I think MISSING will make more sense with a followup PR. That should move us to something like:

Remove the StartTask & StopTask
Poll takes { tasks-to-start, tasks-that-should-be-running, ~~tasks-to-stop~~ }
The worker does the same reconciliation we do now for heartbeats

If something isn't in the "running" set, then it really is missing, and we can log it as such. Maybe the previous RPC got dropped, or there's a worker bug, but either way we can flag the discrepancy. This will still be better than our old heartbeat loop because the task scheduling can happen at different cadence, and on demand.

Alternatively we can just move the Poll RPC onto the scheduler thread, removing the race condition. I'll file an issue to follow-up...

claude · 2026-04-22T19:17:06Z

Claude finished @rjpower's task in 3m 38s —— View job

Added logging on both ends of the MISSING → WORKER_FAILED path

Read transitions.py context around line 207
Reviewed worker.py reconciliation logging
Added logger.info in task_updates_from_proto (transitions.py:210) on the MISSING → WORKER_FAILED translation, including task id, attempt id, and error
Added logger.warning in _reconcile_expected_tasks (worker.py:856) when the worker emits MISSING because an expected task is not present locally — so we have a trace on the emit side too
Ran ./infra/pre-commit.py --all-files --fix → OK
Ran uv run pytest lib/iris/tests/cluster/worker/test_worker.py → 41 passed
Ran uv run pytest lib/iris/tests/cluster/controller/test_transitions.py → 121 passed
Committed and pushed as d789744d5

Only touched logging — no behavior changes. The reconciliation-kill log at worker.py:948-952 ("Killing task … no longer in expected_tasks") was already present, which together with the two new logs gives a clean trace: worker reconciles → worker emits MISSING → controller translates to WORKER_FAILED.
• Branch: fix/missing-task-state-5041

Adds a warning when the worker emits TASK_STATE_MISSING for an expected task it can no longer find, and an info log when the controller translates that MISSING into WORKER_FAILED. Leaves a trace on both ends for future debugging of the StartTasks→PollTasks reconciliation race. Co-authored-by: Russell Power <rjpower@users.noreply.github.com>

rjpower · 2026-04-22T22:37:25Z

I'm going to unify these 2 PRs into a cleaner approach...

rjpower · 2026-04-22T22:50:50Z

🤖 Superseded by #5090, which removes the race itself rather than translating its downstream effects: PollTasks now runs inline in the scheduling loop, so a poll's expected_tasks snapshot is always taken after the same iteration's StartTasks commits. The worker-reported MISSING translation in this PR is no longer needed for #5041, though the TASK_STATE_MISSING enum + translation could still be cherry-picked separately as a defensive signal for genuine worker state loss across restarts. Closing for now.

rjpower added the agent-generated Created by automation/agent label Apr 22, 2026

chatgpt-codex-connector Bot reviewed Apr 22, 2026

View reviewed changes

ravwojdyla approved these changes Apr 22, 2026

View reviewed changes

rjpower closed this Apr 22, 2026

rjpower deleted the fix/missing-task-state-5041 branch April 22, 2026 22:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iris: add TASK_STATE_MISSING; stop worker-reconciled KILLED from cascading to jobs#5046

iris: add TASK_STATE_MISSING; stop worker-reconciled KILLED from cascading to jobs#5046
rjpower wants to merge 2 commits intomainfrom
fix/missing-task-state-5041

rjpower commented Apr 22, 2026

Uh oh!

claude Bot commented Apr 22, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 22, 2026

Uh oh!

ravwojdyla left a comment

Uh oh!

ravwojdyla Apr 22, 2026 •

edited

Loading

Uh oh!

rjpower Apr 22, 2026

Uh oh!

ravwojdyla Apr 22, 2026

Uh oh!

rjpower commented Apr 22, 2026

Uh oh!

rjpower commented Apr 22, 2026

Uh oh!

claude Bot commented Apr 22, 2026 •

edited

Loading

Uh oh!

rjpower commented Apr 22, 2026

Uh oh!

rjpower commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rjpower commented Apr 22, 2026

Uh oh!

claude Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code review

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

ravwojdyla left a comment

Choose a reason for hiding this comment

Uh oh!

ravwojdyla Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjpower Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

ravwojdyla Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

rjpower commented Apr 22, 2026

Uh oh!

rjpower commented Apr 22, 2026

Uh oh!

claude Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Added logging on both ends of the MISSING → WORKER_FAILED path

Uh oh!

rjpower commented Apr 22, 2026

Uh oh!

rjpower commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude Bot commented Apr 22, 2026 •

edited

Loading

ravwojdyla Apr 22, 2026 •

edited

Loading

claude Bot commented Apr 22, 2026 •

edited

Loading