Skip to content

[iris] Reap TPU hosts that keep failing launches; fire on_stop on natural return#5038

Open
rjpower wants to merge 2 commits intomainfrom
rjpower/20260421-bump-worker-failures-on-tpu-assignment-bug
Open

[iris] Reap TPU hosts that keep failing launches; fire on_stop on natural return#5038
rjpower wants to merge 2 commits intomainfrom
rjpower/20260421-bump-worker-failures-on-tpu-assignment-bug

Conversation

@rjpower
Copy link
Copy Markdown
Collaborator

@rjpower rjpower commented Apr 22, 2026

Three linked fixes for a TPU co-schedule loop. ASSIGNED->WORKER_FAILED now bumps the worker health tracker so a host that repeatedly fails to bring up a task (e.g. iommu/vfio group already held) gets reaped instead of looping forever. Excludes reservation-holder tasks from PollTasksRequest.expected_tasks, since holders are virtual and polling them produced bogus WORKER_FAILEDs that drained preemption budget. ManagedThread now fires on_stop on the natural-return path so docker kill+rm actually runs, releasing the TPU vfio group for the next task. Adds regression tests for each.

@rjpower rjpower added the agent-generated Created by automation/agent label Apr 22, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 22, 2026

Claude finished @rjpower's task in 1m 20s —— View job


Code review

No issues found. Checked for bugs and CLAUDE.md/AGENTS.md compliance.

What I verified

  • managed_thread.py: setting self._stop_event in the finally block correctly wakes the watcher on natural-return and exception paths. on_stop still fires exactly once — if stop() was already called, the watcher has already consumed the wait and the re-set is a no-op; the watcher.join(timeout=5.0) just waits for the single in-flight on_stop to finish. Bumping the timeout from 1.0s → 5.0s accommodates real docker kill/rm.
  • transitions.py ASSIGNED→WORKER_FAILED: the new build_failed bump mirrors the existing BUILDING→FAILED hook at line 1872, runs under the same DB transaction, and correctly guards on worker_id is not None. The preemption counter is still not bumped (since ASSIGNED isn't in EXECUTING_TASK_STATES), so the stated "retries without consuming preemption budget" property holds — verified by the new test at test_transitions.py:2228.
  • transitions.py poll filter: the added JOIN jobs ON j.job_id = t.job_id + j.is_reservation_holder = 0 is safe — tasks.job_id has a FK to jobs(job_id) and INNER JOIN can't drop real-task rows. The fix correctly matches the ping/poll feeding PollTasksRequest.expected_tasks.
  • Tests: three regressions added, one per fix. test_on_stop_runs_only_once exercises the no-double-fire invariant via concurrent stop+natural-return.
    · branch: rjpower/20260421-bump-worker-failures-on-tpu-assignment-bug

@rjpower rjpower requested a review from yonromai April 22, 2026 03:45
@rjpower rjpower enabled auto-merge (squash) April 22, 2026 03:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant