Skip to content

iris: add TASK_STATE_MISSING; stop worker-reconciled KILLED from cascading to jobs#5046

Closed
rjpower wants to merge 2 commits intomainfrom
fix/missing-task-state-5041
Closed

iris: add TASK_STATE_MISSING; stop worker-reconciled KILLED from cascading to jobs#5046
rjpower wants to merge 2 commits intomainfrom
fix/missing-task-state-5041

Conversation

@rjpower
Copy link
Copy Markdown
Collaborator

@rjpower rjpower commented Apr 22, 2026

A StartTasks to PollTasks race could cause the worker to kill a just-submitted task as unexpected and echo KILLED on the next poll, which the controller promoted to JOB_STATE_KILLED and cascaded across the whole pool. This adds TASK_STATE_MISSING as a worker-only wire signal, pops reconciliation-killed attempts from self._tasks so subsequent polls report MISSING, and maps MISSING to WORKER_FAILED in the controller so the task retries via its preemption budget.

Fixes #5041

… state

Fixes #5041. A StartTasks→PollTasks race could cause the worker to kill
a just-submitted task as "unexpected" and then echo KILLED back on the
next poll, which the controller promoted to JOB_STATE_KILLED and
cascaded to every sibling in the pool. Introduces TASK_STATE_MISSING
as a worker-only wire signal; worker removes reconciliation-killed
attempts from self._tasks so subsequent polls report MISSING via the
not-found path. Controller maps MISSING to WORKER_FAILED so the task
retries via its preemption budget instead of killing the parent job.
@rjpower rjpower added the agent-generated Created by automation/agent label Apr 22, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 22, 2026

Claude finished @rjpower's task in 12m 26s —— View job


Code review

No high-signal issues found. Checked for bugs, correctness, and AGENTS.md compliance.

What I verified

  • Call-site signature changes are consistent:
    • _reconcile_expected_tasks return type changed from list[tuple[str, int]] to list[TaskAttempt]; both call sites (handle_heartbeat, handle_poll_tasks) updated.
    • _missing_task_status now requires error; the sole call site passes "Task not found on worker".
  • del self._tasks[key] during iteration is safely wrapped with list(self._tasks.items()).
  • New proto value TASK_STATE_MISSING = 11 is non-colliding and reflected in job_pb2.py and job_pb2.pyi.
  • Controller-side translation (task_updates_from_proto) applies to both heartbeat and UpdateTaskStatus push paths (service.py:2642), so the MISSING→WORKER_FAILED mapping is uniform.
  • Thread name uses attempt.task_id (a JobName); JobName.__str__ returns the canonical wire format, so the f-string produces the same value as before.
  • The terminal-attempt guard in apply_task_updates (transitions.py:1823) drops late-arriving state updates on already-terminal attempts, which bounds the fix's exposure to push/heartbeat interleaving.
  • test_heartbeat_without_adoption_reports_missing correctly exercises the not-found → MISSING path.
    • Branch: fix/missing-task-state-5041

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c307c67747

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +863 to +864
attempts_to_kill.append(task)
del self._tasks[key]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Delay removing unexpected attempts until kill is initiated

Deleting the attempt from self._tasks inside reconciliation before kill execution can misclassify successful work as missing. In the StartTasks→PollTasks race, an attempt can become terminal between this deletion and the later _kill_attempt(...) call; _kill_attempt then no-ops on terminal states, but the worker has already dropped tracking for that attempt. On the next poll (when the controller finally includes that task in expected_tasks), the worker reports MISSING, which the controller maps to WORKER_FAILED and retries, causing duplicate execution and potentially incorrect outcomes for short-lived tasks.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@ravwojdyla ravwojdyla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The transient/special TASK_STATE_MISSING feels a bit hacky - but I also don't have a better idea. Nice to have this improved!

continue
new_state = entry.state
error = entry.error or None
if new_state == job_pb2.TASK_STATE_MISSING:
Copy link
Copy Markdown
Contributor

@ravwojdyla ravwojdyla Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should log here, no? To leave trace for future debugging.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, good call, I should do a pass and add a lot more logging to Iris in general.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should do a pass and add a lot more logging to Iris in general.

+1

@rjpower
Copy link
Copy Markdown
Collaborator Author

rjpower commented Apr 22, 2026

@claude pls add logging as discussed in the comments.

@rjpower
Copy link
Copy Markdown
Collaborator Author

rjpower commented Apr 22, 2026

I think MISSING will make more sense with a followup PR. That should move us to something like:

  • Remove the StartTask & StopTask
  • Poll takes { tasks-to-start, tasks-that-should-be-running, tasks-to-stop }
  • The worker does the same reconciliation we do now for heartbeats

If something isn't in the "running" set, then it really is missing, and we can log it as such. Maybe the previous RPC got dropped, or there's a worker bug, but either way we can flag the discrepancy. This will still be better than our old heartbeat loop because the task scheduling can happen at different cadence, and on demand.

Alternatively we can just move the Poll RPC onto the scheduler thread, removing the race condition. I'll file an issue to follow-up...

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 22, 2026

Claude finished @rjpower's task in 3m 38s —— View job


Added logging on both ends of the MISSING → WORKER_FAILED path

  • Read transitions.py context around line 207
  • Reviewed worker.py reconciliation logging
  • Added logger.info in task_updates_from_proto (transitions.py:210) on the MISSING → WORKER_FAILED translation, including task id, attempt id, and error
  • Added logger.warning in _reconcile_expected_tasks (worker.py:856) when the worker emits MISSING because an expected task is not present locally — so we have a trace on the emit side too
  • Ran ./infra/pre-commit.py --all-files --fixOK
  • Ran uv run pytest lib/iris/tests/cluster/worker/test_worker.py41 passed
  • Ran uv run pytest lib/iris/tests/cluster/controller/test_transitions.py121 passed
  • Committed and pushed as d789744d5

Only touched logging — no behavior changes. The reconciliation-kill log at worker.py:948-952 ("Killing task … no longer in expected_tasks") was already present, which together with the two new logs gives a clean trace: worker reconciles → worker emits MISSING → controller translates to WORKER_FAILED.
• Branch: fix/missing-task-state-5041

Adds a warning when the worker emits TASK_STATE_MISSING for an expected
task it can no longer find, and an info log when the controller
translates that MISSING into WORKER_FAILED. Leaves a trace on both ends
for future debugging of the StartTasks→PollTasks reconciliation race.

Co-authored-by: Russell Power <rjpower@users.noreply.github.com>
@rjpower
Copy link
Copy Markdown
Collaborator Author

rjpower commented Apr 22, 2026

I'm going to unify these 2 PRs into a cleaner approach...

@rjpower rjpower closed this Apr 22, 2026
@rjpower rjpower deleted the fix/missing-task-state-5041 branch April 22, 2026 22:37
@rjpower
Copy link
Copy Markdown
Collaborator Author

rjpower commented Apr 22, 2026

🤖 Superseded by #5090, which removes the race itself rather than translating its downstream effects: PollTasks now runs inline in the scheduling loop, so a poll's expected_tasks snapshot is always taken after the same iteration's StartTasks commits. The worker-reported MISSING translation in this PR is no longer needed for #5041, though the TASK_STATE_MISSING enum + translation could still be cherry-picked separately as a defensive signal for genuine worker state loss across restarts. Closing for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[zephyr] _check_worker_group aborts on spurious is_done() during pool scale-up

2 participants