[iris] Reap TPU hosts that keep failing launches; fire on_stop on natural return by rjpower · Pull Request #5038 · marin-community/marin

rjpower · 2026-04-22T03:43:44Z

Three linked fixes for a TPU co-schedule loop. ASSIGNED->WORKER_FAILED now bumps the worker health tracker so a host that repeatedly fails to bring up a task (e.g. iommu/vfio group already held) gets reaped instead of looping forever. Excludes reservation-holder tasks from PollTasksRequest.expected_tasks, since holders are virtual and polling them produced bogus WORKER_FAILEDs that drained preemption budget. ManagedThread now fires on_stop on the natural-return path so docker kill+rm actually runs, releasing the TPU vfio group for the next task. Adds regression tests for each.

claude · 2026-04-22T03:44:08Z

Claude finished @rjpower's task in 1m 20s —— View job

Code review

No issues found. Checked for bugs and CLAUDE.md/AGENTS.md compliance.

What I verified

managed_thread.py: setting self._stop_event in the finally block correctly wakes the watcher on natural-return and exception paths. on_stop still fires exactly once — if stop() was already called, the watcher has already consumed the wait and the re-set is a no-op; the watcher.join(timeout=5.0) just waits for the single in-flight on_stop to finish. Bumping the timeout from 1.0s → 5.0s accommodates real docker kill/rm.
transitions.py ASSIGNED→WORKER_FAILED: the new build_failed bump mirrors the existing BUILDING→FAILED hook at line 1872, runs under the same DB transaction, and correctly guards on worker_id is not None. The preemption counter is still not bumped (since ASSIGNED isn't in EXECUTING_TASK_STATES), so the stated "retries without consuming preemption budget" property holds — verified by the new test at test_transitions.py:2228.
transitions.py poll filter: the added JOIN jobs ON j.job_id = t.job_id + j.is_reservation_holder = 0 is safe — tasks.job_id has a FK to jobs(job_id) and INNER JOIN can't drop real-task rows. The fix correctly matches the ping/poll feeding PollTasksRequest.expected_tasks.
Tests: three regressions added, one per fix. test_on_stop_runs_only_once exercises the no-double-fire invariant via concurrent stop+natural-return.
· branch: rjpower/20260421-bump-worker-failures-on-tpu-assignment-bug

rjpower added 2 commits April 21, 2026 20:13

Properly invoke the reaper when a task exits uncleanly.

c3e7f0f

Add tests.

0eb5aab

rjpower added the agent-generated Created by automation/agent label Apr 22, 2026

rjpower requested a review from yonromai April 22, 2026 03:45

rjpower enabled auto-merge (squash) April 22, 2026 03:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[iris] Reap TPU hosts that keep failing launches; fire on_stop on natural return#5038

[iris] Reap TPU hosts that keep failing launches; fire on_stop on natural return#5038
rjpower wants to merge 2 commits intomainfrom
rjpower/20260421-bump-worker-failures-on-tpu-assignment-bug

rjpower commented Apr 22, 2026

Uh oh!

claude Bot commented Apr 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rjpower commented Apr 22, 2026

Uh oh!

claude Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Apr 22, 2026 •

edited

Loading