Skip to content

[iris] Extend TPU bootstrap timeouts#4792

Merged
dlwh merged 3 commits intomainfrom
codex/iris-bootstrap-timeouts
Apr 15, 2026
Merged

[iris] Extend TPU bootstrap timeouts#4792
dlwh merged 3 commits intomainfrom
codex/iris-bootstrap-timeouts

Conversation

@dlwh
Copy link
Copy Markdown
Member

@dlwh dlwh commented Apr 15, 2026

Scale TPU queued-resource and bootstrap waits by pod size and log the missing workers and probe errors when bootstrap stalls. This covers the 255/256 healthy provisioning path from base issue #4697.

Part of #4746

Large TPU pods were timing out while queued resources became ready and while the last workers finished bootstrap. Scale the wait budgets by pod size and log the missing workers and probe errors that remain when bootstrap stalls.
@dlwh dlwh added the agent-generated Created by automation/agent label Apr 15, 2026
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 15, 2026

Claude finished @dlwh's task in 4m 57s —— View job


PR Review complete

  • Check PR status
  • Identify relevant CLAUDE.md/AGENTS.md files
  • Summarize PR changes
  • Run parallel review agents (2 compliance, 2 bug-hunting)
  • Validate flagged issues
  • Post review findings

Summary: No issues found. Checked for bugs and CLAUDE.md/AGENTS.md compliance.

Two candidate issues were raised by the compliance agents and validated:

  • Raw time.monotonic() for progress-log throttling in workers.pynot confirmed as a new violation; the file already contains multiple pre-existing time.sleep() / datetime.now() calls, so the PR extends an established pattern.
  • Inline 1800.0 / 900.0 / 256 / 64 thresholds in the three new _recommended_tpu_*_timeout helpers — borderline against the "top-level constants for magic numbers" rule, but the values are self-contained within tiny, clearly-named policy helpers. Below the HIGH SIGNAL bar for this review; a maintainer may still want to extract a shared constant if these thresholds get adjusted elsewhere.

Bug agents found no syntax, logic, timeout-semantic, or error-handling issues; tests exercise real boundary values (1 / 64 / 256).

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 15, 2026

Code review

🤖 No issues found. Checked for bugs and CLAUDE.md/AGENTS.md compliance.

View job run

return DEFAULT_TPU_BOOTSTRAP_TIMEOUT


def _format_probe_error(error: BaseException) -> str:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just str(error) ?

# =============================================================================


def test_recommended_tpu_timeouts_scale_with_pod_size():
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slop, remove

@rjpower
Copy link
Copy Markdown
Collaborator

rjpower commented Apr 15, 2026

@claude recommended.*timeout -> default.*timeout, fix up the slop as per PR comments

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 15, 2026

Claude finished @rjpower's task in 3m 43s —— View job


Code changes

Already on the branch — PR author pushed aa270d1d9 iris: address TPU bootstrap timeout review while I was working, which applies the requested changes:

  • Renamed _recommended_tpu_operation_timeout_default_tpu_operation_timeout in service.py
  • Renamed _recommended_tpu_cloud_ready_timeout / _recommended_tpu_bootstrap_timeout_default_tpu_* in workers.py (+ all call sites)
  • Removed _format_probe_error; probe errors now captured inline as str(e).strip() or type(e).__name__ at workers.py:861
  • Dropped the test_summarize_missing_workers_includes_probe_errors test and the _summarize_missing_workers test import in test_platform.py

I had staged an equivalent local commit and reset it in favor of the author's version (which picks a slightly more defensive fallback for empty exception messages). No push from me was needed.

Validation

  • ./infra/pre-commit.py --all-files --fixOK (ruff, black, pyrefly, etc. all green)
  • uv run --with pytest --with pytest-xdist pytest lib/iris/tests/cluster/providers/gcp/test_platform.py lib/iris/tests/cluster/providers/gcp/test_gcp_service.py94 passed in 3.18s
    · branch

rjpower added a commit that referenced this pull request Apr 15, 2026
Per-task slow_log timers on submit_task (500ms) and synchronous kill
(2000ms) inside handle_heartbeat identify which task stalls a heartbeat.
Worker service heartbeat entrypoint gets an outer slow_log (1000ms) and
a DEBUG payload-size line to correlate with controller-side sync timing.
Slice ready/failed transitions log registered worker counts and ids to
expose partial bootstrap on large slices.

Complements #4792 and #4793.
rjpower added a commit that referenced this pull request Apr 15, 2026
Per-task slow_log timers on submit_task (500ms) and synchronous kill
(2000ms) inside handle_heartbeat identify which task stalls a heartbeat.
The worker service heartbeat entrypoint gets an outer slow_log (1000ms)
and a DEBUG payload-size line to correlate with controller-side sync
timing. Slice ready/failed transitions log registered worker counts and
ids to expose partial bootstrap on large slices.

Complements #4792 and #4793.
@dlwh dlwh merged commit 913e579 into main Apr 15, 2026
37 of 38 checks passed
@dlwh dlwh deleted the codex/iris-bootstrap-timeouts branch April 15, 2026 22:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants