Skip to content

Detect TPU bad-node stderr and promote to WORKER_FAILED#4798

Merged
dlwh merged 2 commits intomainfrom
claude/issue-4783-20260415-2001
Apr 15, 2026
Merged

Detect TPU bad-node stderr and promote to WORKER_FAILED#4798
dlwh merged 2 commits intomainfrom
claude/issue-4783-20260415-2001

Conversation

@rjpower
Copy link
Copy Markdown
Collaborator

@rjpower rjpower commented Apr 15, 2026

Fixes #4783.

On non-zero container exit, the worker now scans the stderr tail for known JAX / libtpu bad-node signatures (/dev/vfio busy, Couldn't open iommu group, TPU initialization failed, Failed to initialize TPU system, No accelerator found). On a hit, the attempt transitions to TASK_STATE_WORKER_FAILED instead of TASK_STATE_FAILED, so it's billed against the preemption retry budget and rescheduled elsewhere — matching the existing infra-failure path. No controller changes required.

When a container exits non-zero with a JAX/libtpu TPU-init error
(`/dev/vfio` busy, `Couldn't open iommu group`, `Failed to initialize TPU
system`, etc.), the worker now scans the stderr tail and promotes the
attempt from TASK_STATE_FAILED to TASK_STATE_WORKER_FAILED. This bills
the failure against the preemption retry budget and lets the controller
reschedule on another worker, so a dirty TPU VM that lingers after a
preemption teardown gets rotated out automatically.

Fixes #4783.

Co-authored-by: Russell Power <rjpower@users.noreply.github.com>
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 15, 2026

Claude finished @rjpower's task in 3m 55s —— View job


Code review complete

  • Check PR eligibility for review
  • Identify relevant CLAUDE.md/AGENTS.md files (root AGENTS.md, lib/iris/AGENTS.md)
  • Summarize PR changes
  • Run parallel review agents (CLAUDE.md compliance + bug hunt)
  • Validate flagged issues
  • Post review findings

Summary

No high-signal issues found. Checked for bugs and AGENTS.md compliance.

  • detect_tpu_init_failure is a simple substring scan over a bounded tail — empty-list slicing, None/empty-line guards, and pattern iteration all behave correctly.
  • State transition in task_attempt.py fires exactly one transition_to under both branches, followed by the single break — no double-transition risk.
  • The two new test_worker.py tests exercise both the WORKER_FAILED promotion path and the no-false-positive FAILED path.
  • AGENTS.md compliance: top-level imports, no TYPE_CHECKING, top-level constant _TPU_STDERR_TAIL_LINES, no compatibility shims.

Posted summary comment: #4798 (comment).
· Branch

@rjpower
Copy link
Copy Markdown
Collaborator Author

rjpower commented Apr 15, 2026

@claude get rid of the slop comments & tests. A function with 5 lines doesn't need a 10 line comment.

you can keep the worker tests but get rid of the slop test_tpu_health tests and slop tpu health comments

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 15, 2026

Trimmed slop comments & tests

  • Slim lib/iris/src/iris/cluster/worker/tpu_health.py — one-line module docstring, one-line function docstring, no per-pattern commentary
  • Trim verbose comments around TPU bad-node detection in lib/iris/src/iris/cluster/worker/task_attempt.py
  • Delete lib/iris/tests/cluster/worker/test_tpu_health.py
  • Shorten docstrings on the two new tests in lib/iris/tests/cluster/worker/test_worker.py
  • ./infra/pre-commit.py --all-files --fix → OK (ruff, black, pyrefly, etc.)
  • Commit + push (58d4585f2)

Net change: 4 files, +9 / -113. Behavior coverage for TPU bad-node detection is now solely via the two worker-level integration tests (test_tpu_bad_node_stderr_promotes_to_worker_failed, test_non_tpu_stderr_still_maps_to_failed) that exercise the real task_attempt._monitor_loop path end-to-end.

Test status: uv run pytest lib/iris/tests/cluster/worker/test_worker.py wouldn't return in this sandbox — two concurrent invocations ended up deadlocked on import/fixture setup and I lack permission to kill them here. Pre-commit (ruff + pyrefly) passes cleanly. The TPU-specific test logic was not touched by this commit — only docstrings around existing assertions.

View job run · branch

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 106b753a2f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread lib/iris/src/iris/cluster/worker/task_attempt.py
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 15, 2026

🤖 ## Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

Per review feedback: remove verbose module/function docstrings in
tpu_health.py, shorten the bad-node detection comments in
task_attempt.py, and delete the slop test_tpu_health.py. Behavior is
still covered by the worker-level integration tests in test_worker.py.

Co-authored-by: Russell Power <rjpower@users.noreply.github.com>
@dlwh dlwh merged commit ec3d73d into main Apr 15, 2026
41 checks passed
@dlwh dlwh deleted the claude/issue-4783-20260415-2001 branch April 15, 2026 22:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Iris] Quarantine TPU VMs that fail with /dev/vfio busy after preemption

2 participants