Detect TPU bad-node stderr and promote to WORKER_FAILED#4798
Conversation
When a container exits non-zero with a JAX/libtpu TPU-init error (`/dev/vfio` busy, `Couldn't open iommu group`, `Failed to initialize TPU system`, etc.), the worker now scans the stderr tail and promotes the attempt from TASK_STATE_FAILED to TASK_STATE_WORKER_FAILED. This bills the failure against the preemption retry budget and lets the controller reschedule on another worker, so a dirty TPU VM that lingers after a preemption teardown gets rotated out automatically. Fixes #4783. Co-authored-by: Russell Power <rjpower@users.noreply.github.com>
|
Claude finished @rjpower's task in 3m 55s —— View job Code review complete
SummaryNo high-signal issues found. Checked for bugs and AGENTS.md compliance.
|
|
@claude get rid of the slop comments & tests. A function with 5 lines doesn't need a 10 line comment. you can keep the worker tests but get rid of the slop test_tpu_health tests and slop tpu health comments |
Trimmed slop comments & tests
Net change: 4 files, +9 / -113. Behavior coverage for TPU bad-node detection is now solely via the two worker-level integration tests ( Test status: |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 106b753a2f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
🤖 ## Code review No issues found. Checked for bugs and CLAUDE.md compliance. |
Per review feedback: remove verbose module/function docstrings in tpu_health.py, shorten the bad-node detection comments in task_attempt.py, and delete the slop test_tpu_health.py. Behavior is still covered by the worker-level integration tests in test_worker.py. Co-authored-by: Russell Power <rjpower@users.noreply.github.com>
Fixes #4783.
On non-zero container exit, the worker now scans the stderr tail for known JAX / libtpu bad-node signatures (
/dev/vfiobusy,Couldn't open iommu group,TPU initialization failed,Failed to initialize TPU system,No accelerator found). On a hit, the attempt transitions toTASK_STATE_WORKER_FAILEDinstead ofTASK_STATE_FAILED, so it's billed against the preemption retry budget and rescheduled elsewhere — matching the existing infra-failure path. No controller changes required.