-
Notifications
You must be signed in to change notification settings - Fork 108
Detect TPU bad-node stderr and promote to WORKER_FAILED #4798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+137
−11
Merged
Changes from 1 commit
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,51 @@ | ||
| # Copyright The Marin Authors | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| """TPU-level bad-node failure detection. | ||
|
|
||
| When a task container exits with a non-zero status, the worker normally marks | ||
| the task as ``TASK_STATE_FAILED`` (user-code failure). Some failure signatures | ||
| are actually signs that the underlying TPU VM is dirty — typically after a | ||
| preemption / teardown where ``/dev/vfio`` is still claimed by a previous | ||
| process. We need to promote those to ``TASK_STATE_WORKER_FAILED`` so the | ||
| controller treats the attempt as an infra preemption and retries it elsewhere. | ||
|
|
||
| Patterns are hard-coded on purpose: these signatures are stable strings | ||
| emitted by JAX / libtpu during TPU init, and OPS.md already documents them as | ||
| the manual trigger list for bad-node triage. | ||
| """ | ||
|
|
||
| from collections.abc import Iterable | ||
|
|
||
| # Substrings matched against container stderr tail. A single hit promotes the | ||
| # attempt from FAILED to WORKER_FAILED. | ||
| # | ||
| # Keep this list in sync with ``lib/iris/OPS.md`` bad-node triggers. | ||
| TPU_INIT_FAILURE_PATTERNS: tuple[str, ...] = ( | ||
| # /dev/vfio/<n> busy after a dirty preemption — the canonical case from #4783. | ||
| "Couldn't open iommu group", | ||
| "open(/dev/vfio", | ||
| # libtpu / JAX surface when the device is held by another process. | ||
| "Failed to initialize TPU system", | ||
| "TPU initialization failed", | ||
| # Host has no visible accelerator at all (VM came up without TPU attached). | ||
| "No accelerator found", | ||
| ) | ||
|
|
||
|
|
||
| def detect_tpu_init_failure(stderr_lines: Iterable[str]) -> str | None: | ||
| """Return the first matching bad-node pattern found in ``stderr_lines``. | ||
|
|
||
| ``stderr_lines`` is any iterable of stderr strings (typically the tail of | ||
| the container log). Returns ``None`` if no pattern matches. | ||
|
|
||
| Callers should pass a bounded tail (not the full log) — these signatures | ||
| are emitted close to process exit, and scanning the full log wastes work. | ||
| """ | ||
| for line in stderr_lines: | ||
| if not line: | ||
| continue | ||
| for pattern in TPU_INIT_FAILURE_PATTERNS: | ||
| if pattern in line: | ||
| return pattern | ||
| return None |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,67 @@ | ||
| # Copyright The Marin Authors | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| """Tests for TPU bad-node stderr pattern detection.""" | ||
|
|
||
| import pytest | ||
|
|
||
| from iris.cluster.worker.tpu_health import ( | ||
| TPU_INIT_FAILURE_PATTERNS, | ||
| detect_tpu_init_failure, | ||
| ) | ||
|
|
||
|
|
||
| @pytest.mark.parametrize( | ||
| "line", | ||
| [ | ||
| # Exact failure from #4783: | ||
| "jax.errors.JaxRuntimeError: UNKNOWN: TPU initialization failed: " | ||
| "open(/dev/vfio/0): Device or resource busy: Device or resource busy; " | ||
| "Couldn't open iommu group /dev/vfio/0", | ||
| # libtpu's older init-failure wording | ||
| "Failed to initialize TPU system: some backend error", | ||
| # JAX surface when the VM booted without a TPU attached | ||
| "RuntimeError: No accelerator found on this host", | ||
| # vfio path-only hit | ||
| "libtpu: open(/dev/vfio/0) returned -1", | ||
| ], | ||
| ) | ||
| def test_detects_known_bad_node_signatures(line: str) -> None: | ||
| assert detect_tpu_init_failure([line]) is not None | ||
|
|
||
|
|
||
| def test_detects_from_mixed_log_tail() -> None: | ||
| tail = [ | ||
| "normal startup log line", | ||
| "another info line", | ||
| "Couldn't open iommu group /dev/vfio/0", | ||
| "subsequent error traceback frame", | ||
| ] | ||
| pattern = detect_tpu_init_failure(tail) | ||
| assert pattern == "Couldn't open iommu group" | ||
|
|
||
|
|
||
| def test_returns_none_on_unrelated_stderr() -> None: | ||
| tail = [ | ||
| "Traceback (most recent call last):", | ||
| 'ValueError: bad user config: expected "foo"', | ||
| "", | ||
| ] | ||
| assert detect_tpu_init_failure(tail) is None | ||
|
|
||
|
|
||
| def test_empty_input() -> None: | ||
| assert detect_tpu_init_failure([]) is None | ||
|
|
||
|
|
||
| def test_ignores_empty_lines() -> None: | ||
| # Empty strings should not be mistaken for matches and should not crash. | ||
| assert detect_tpu_init_failure(["", None or ""]) is None | ||
|
|
||
|
|
||
| def test_all_patterns_are_discoverable() -> None: | ||
| # Sanity: every declared pattern must be detected when it appears verbatim | ||
| # in a line. Guards against accidental pattern-list / detector drift. | ||
| for pattern in TPU_INIT_FAILURE_PATTERNS: | ||
| line = f"prefix noise {pattern} trailing noise" | ||
| assert detect_tpu_init_failure([line]) == pattern |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.