Skip to content

[iris] Controller-side grace window for StartTasks/PollTasks race#5054

Closed
rjpower wants to merge 1 commit intomainfrom
fix-controller-race
Closed

[iris] Controller-side grace window for StartTasks/PollTasks race#5054
rjpower wants to merge 1 commit intomainfrom
fix-controller-race

Conversation

@rjpower
Copy link
Copy Markdown
Collaborator

@rjpower rjpower commented Apr 22, 2026

When a PollTasks RPC's DB snapshot is taken before a concurrent StartTasks' assignment commit, the poll omits the new task from expected_tasks and the worker kills it as unexpected (or reports WORKER_FAILED for a task it has not yet received). Track recent StartTasks dispatches in-memory for 30s, merge them into the per-worker expected set, and drop any updates for tasks still inside the grace window so the next poll settles them cleanly. Complements #5043.

@rjpower rjpower added the agent-generated Created by automation/agent label Apr 22, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 22, 2026

Claude finished @rjpower's task in 4m 44s —— View job


Code review

  • Check PR state and prior reviews
  • Identify relevant CLAUDE.md/AGENTS.md files
  • Read diff and surrounding controller context
  • Analyze for bugs and CLAUDE.md compliance
  • Post final review

No issues found. Checked for bugs and CLAUDE.md compliance.

Notes from the analysis (informational, not blocking):

  • The time.monotonic() use in controller.py is consistent with existing usage at controller.py:1431. The Iris AGENTS.md note about preferring rigging.timing is scoped to Timestamp/Duration/Deadline/Timer/ExponentialBackoff, not monotonic-clock grace windows.
  • Race ordering in _poll_all_workers looks correct: DB snapshot first, then _prune_and_snapshot_recent_dispatches(), then the merge into the per-worker expected set, then the poll_workers RPC. The suppression of poll-side updates is keyed off the same recent snapshot, so any task whose dispatch was visible at snapshot time is also exempt from the kill-as-unexpected report.
  • Stale entries left behind on StartTasks-failure or ack-rejection paths are harmless: they age out after _RECENT_DISPATCH_GRACE_SECONDS and any subsequent attempt uses a different attempt_id.
    | Branch: fix-controller-race

@rjpower rjpower requested a review from ravwojdyla April 22, 2026 17:40
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 22, 2026

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

Copy link
Copy Markdown
Contributor

@ravwojdyla ravwojdyla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

# PollTasks race: a poll's DB snapshot taken before the assignment commit
# would otherwise omit the task, and the worker would kill it as
# unexpected. 30s comfortably exceeds any normal StartTasks RPC latency.
_RECENT_DISPATCH_GRACE_SECONDS = 30.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm scared of these magic numbers - but it makes sense!

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I want to get rid of this soon, we'll switch to a more sensible model without the race

@rjpower
Copy link
Copy Markdown
Collaborator Author

rjpower commented Apr 22, 2026

🤖 Superseded by #5090, which fixes the same race at the source by moving PollTasks inline in the scheduling loop (single thread now owns both the assignment commit and the running-tasks snapshot). The _recent_dispatches bookkeeping in this PR becomes unnecessary — closing.

@rjpower rjpower closed this Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants