[iris] Return visible TPUs before create LRO settles by dlwh · Pull Request #4786 · marin-community/marin

dlwh · 2026-04-15T18:47:46Z

Stop standard TPU creation from blocking slice tracking on the full create LRO. Return a TPU as soon as it is visible and add a regression for the case where the node exists before the LRO reports an internal error.

Fixes #4776

claude · 2026-04-15T18:48:05Z

Claude finished @dlwh's task in 2m 2s —— View job

Code review

Read the PR diff
Identify relevant CLAUDE.md/AGENTS.md
Review for bugs and compliance
Post review

No issues found. Checked for bugs and AGENTS.md compliance.

Summary of the change

tpu_create no longer blocks on the full create LRO. It probes the LRO once, and:
- if the LRO is already done with an error, checks whether the node is nonetheless visible and returns it (with a warning) rather than failing; only raises if the node isn't there.
- otherwise (LRO not yet done, or done successfully) falls through to tpu_describe, returning a state="CREATING" placeholder TpuInfo if the node isn't visible yet.
Shared logic extracted into _get_tpu_operation and _raise_tpu_operation_error, preserving the existing RESOURCE_EXHAUSTED → QuotaExhaustedError mapping.
New integration test test_tpu_create_returns_visible_tpu_when_lro_reports_error covers the regression where the node is visible before the LRO reports an internal error.

Minor note (non-blocking)

_wait_tpu_operation (lib/iris/src/iris/cluster/providers/gcp/service.py:497) now has no in-tree callers after this change. If it's not needed elsewhere, consider removing it in a follow-up per AGENTS.md ("Delete dead code"); left as a note only since keeping the helper may be intentional.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6678fd8225

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-15T18:52:17Z

+            op = self._get_tpu_operation(op_name)
+            if op.get("done") and "error" in op:


Keep polling TPU create LRO until terminal or node visible

tpu_create now polls the create operation only once, and only raises if that first read is already done with error. If the LRO is still running on that poll but later finishes with an error (for example RESOURCE_EXHAUSTED), this path falls through and returns a synthetic CREATING TPU instead of raising, so callers treat a failed create as success and skip the quota/backoff error handling that _wait_tpu_operation previously provided.

Useful? React with 👍 / 👎.

rjpower · 2026-04-15T19:47:12Z

It feels like we don't want to just fall through and assume success if the TPU create isn't done on the first poll?

This might require a bit more thought into how we poll TPUs instead. Should we break apart the create request from the lifecyle polling:

Create -> check immediate results, put TPU into monitoring thread
Monitoring thread just calls tpu_describe in a loop until timeout or success
Either (1) or (2) failing falls into the error -> deletion path

For queued resources, we might not know the TPU name at resource creation time, so we might need:

Create queued resource
Poll for resource materialization
Trigger same monitoring process as above with the resource TPU name

This feels straightforward to handle from a shared workflow.

@claude wdyt, is this a big lift?

claude · 2026-04-15T19:47:46Z

Claude finished @rjpower's task in 2m 43s —— View job

Is this a big lift?

Short version: It's a medium lift, and I agree with the direction — but there's a smaller fix that closes the Codex regression without the full refactor, and you probably want both.

Why the Codex concern is real

On the standard (non-queued) path, if the LRO finishes with an error after the first probe — e.g. RESOURCE_EXHAUSTED 5s into a zone stockout — this PR:

First probe sees done=False → returns a synthetic CREATING TpuInfo.
_run_tpu_bootstrap polls handle._describe_cloud() → tpu_describe keeps returning None → maps to CloudSliceState.UNKNOWN, not FAILED (handles.py:289-293).
Thread times out at cloud_ready_timeout (600s) with a generic InfraError — the QuotaExhaustedError → autoscaler-backoff mapping in _raise_tpu_operation_error never fires.
_create_standard_tpu_slice's try/except InfraError for _best_effort_delete_tpu also doesn't fire, because tpu_create returned normally.

And: _run_tpu_bootstrap is only spawned when worker_config is not None (workers.py:396-400). For any non-bootstrapping call site, nobody is watching the LRO at all.

Option A — the cheap fix (closes the regression, unblocks this PR)

Keep polling in tpu_create until either the LRO is terminal or tpu_describe returns a node, with a short-ish deadline. This preserves the "return as soon as visible" goal from #4776 and keeps the RESOURCE_EXHAUSTED → QuotaExhaustedError path intact:

deadline = time.monotonic() + _TPU_CREATE_VISIBILITY_TIMEOUT
while True:
    op = self._get_tpu_operation(op_name)
    existing = self.tpu_describe(request.name, request.zone)
    if existing is not None:
        if op.get("done") and "error" in op:
            logger.warning("TPU %s visible despite LRO error: %s", ...)
        return existing
    if op.get("done"):
        if "error" in op:
            self._raise_tpu_operation_error(op["error"])  # raises Quota/InfraError
        # done + no error + node not visible: very briefly retry describe, then raise
        ...
    if time.monotonic() >= deadline:
        break
    time.sleep(_OPERATION_POLL_INTERVAL)

This is ~15 lines of delta on top of what you already have and doesn't require plumbing op names through handles.

Option B — the refactor you sketched (right long-term shape)

The good news: most of the scaffolding already exists.

_run_tpu_bootstrap is already the monitoring thread you're describing. It already polls _describe_cloud() in a loop and transitions on FAILED/READY/DELETING.
Reserved/queued path already does the two-phase (_wait_for_queued_resource_activation → TPU-VM READY). That's exactly your step 1→2→3 for queued resources.

The gaps to turn this into what you described:

Spawn monitoring unconditionally, not only when worker_config is not None. Today a non-bootstrapping create has no watcher. Factor the cloud-readiness phase out of _run_tpu_bootstrap into a standalone _monitor_slice_ready that always runs.
Plumb the LRO name into the handle (e.g. GcpSliceHandle._pending_create_op) so the monitor can poll the op alongside tpu_describe and detect "LRO terminal with error AND node never appeared" → raise the right QuotaExhaustedError/InfraError. Drop the field once the node is visible.
Move the delete-on-failure path out of _create_{standard,reserved}_tpu_slice's try/except (which now never trips, because tpu_create returns fast) into the monitor — i.e. "monitor observed FAILED → trigger same best-effort delete."
Autoscaler signaling. tpu_create returning successfully + CREATING handle currently means "create accepted." Quota errors surface asynchronously via the monitor raising, but the autoscaler's backoff path today catches QuotaExhaustedError at the create call site. You'll need either a handle.wait_accepted() the autoscaler can call to get the LRO verdict quickly, or a way to push quota failures onto the same scheduling-failure path the bootstrap thread uses. Option A preserves the current behavior here for free; Option B requires a small redesign of how the autoscaler learns about quota hits.
Queued resources are easier than you suggested: the QR name is known at create time (we generate the slice_id before queued_resource_create), so there's no "poll for materialization to discover the name" step. The monitor just adds a Phase 0 identical to the existing _wait_for_queued_resource_activation.

Sizing: I'd call Option B a 1-2 day change, not a week — one meaningful design decision (how quota hits reach the autoscaler), plus mostly mechanical refactoring since the monitoring thread already exists. Most of the risk is in test coverage around the create-fail-before-visible transitions and the autoscaler integration, not the implementation itself.

My recommendation

Land Option A in this PR (keeps scope matching the title, closes #4776, doesn't regress quota handling), then do Option B as a follow-up with its own issue. Option A is a genuinely small patch; Option B touches the autoscaler contract and deserves a separate review.

If you want I can push the Option A change onto this branch.
• Branch

[iris] Return visible TPUs before create LRO settles

6678fd8

dlwh added the agent-generated Created by automation/agent label Apr 15, 2026

chatgpt-codex-connector bot reviewed Apr 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[iris] Return visible TPUs before create LRO settles#4786

[iris] Return visible TPUs before create LRO settles#4786
dlwh wants to merge 1 commit intomainfrom
codex/tpu-create-visible-before-lro

dlwh commented Apr 15, 2026

Uh oh!

claude bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 15, 2026

Uh oh!

rjpower commented Apr 15, 2026

Uh oh!

claude bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		op = self._get_tpu_operation(op_name)
		if op.get("done") and "error" in op:

Conversation

dlwh commented Apr 15, 2026

Uh oh!

claude bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code review

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

rjpower commented Apr 15, 2026

Uh oh!

claude bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Is this a big lift?

Why the Codex concern is real

Option A — the cheap fix (closes the regression, unblocks this PR)

Option B — the refactor you sketched (right long-term shape)

My recommendation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude bot commented Apr 15, 2026 •

edited

Loading

claude bot commented Apr 15, 2026 •

edited

Loading