Skip to content

[iris] Return visible TPUs before create LRO settles#4786

Open
dlwh wants to merge 1 commit intomainfrom
codex/tpu-create-visible-before-lro
Open

[iris] Return visible TPUs before create LRO settles#4786
dlwh wants to merge 1 commit intomainfrom
codex/tpu-create-visible-before-lro

Conversation

@dlwh
Copy link
Copy Markdown
Member

@dlwh dlwh commented Apr 15, 2026

Stop standard TPU creation from blocking slice tracking on the full create LRO. Return a TPU as soon as it is visible and add a regression for the case where the node exists before the LRO reports an internal error.

Fixes #4776

@dlwh dlwh added the agent-generated Created by automation/agent label Apr 15, 2026
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 15, 2026

Claude finished @dlwh's task in 2m 2s —— View job


Code review

  • Read the PR diff
  • Identify relevant CLAUDE.md/AGENTS.md
  • Review for bugs and compliance
  • Post review

No issues found. Checked for bugs and AGENTS.md compliance.

Summary of the change

  • tpu_create no longer blocks on the full create LRO. It probes the LRO once, and:
    • if the LRO is already done with an error, checks whether the node is nonetheless visible and returns it (with a warning) rather than failing; only raises if the node isn't there.
    • otherwise (LRO not yet done, or done successfully) falls through to tpu_describe, returning a state="CREATING" placeholder TpuInfo if the node isn't visible yet.
  • Shared logic extracted into _get_tpu_operation and _raise_tpu_operation_error, preserving the existing RESOURCE_EXHAUSTEDQuotaExhaustedError mapping.
  • New integration test test_tpu_create_returns_visible_tpu_when_lro_reports_error covers the regression where the node is visible before the LRO reports an internal error.

Minor note (non-blocking)

  • _wait_tpu_operation (lib/iris/src/iris/cluster/providers/gcp/service.py:497) now has no in-tree callers after this change. If it's not needed elsewhere, consider removing it in a follow-up per AGENTS.md ("Delete dead code"); left as a note only since keeping the helper may be intentional.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6678fd8225

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +574 to +575
op = self._get_tpu_operation(op_name)
if op.get("done") and "error" in op:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep polling TPU create LRO until terminal or node visible

tpu_create now polls the create operation only once, and only raises if that first read is already done with error. If the LRO is still running on that poll but later finishes with an error (for example RESOURCE_EXHAUSTED), this path falls through and returns a synthetic CREATING TPU instead of raising, so callers treat a failed create as success and skip the quota/backoff error handling that _wait_tpu_operation previously provided.

Useful? React with 👍 / 👎.

@rjpower
Copy link
Copy Markdown
Collaborator

rjpower commented Apr 15, 2026

It feels like we don't want to just fall through and assume success if the TPU create isn't done on the first poll?

This might require a bit more thought into how we poll TPUs instead. Should we break apart the create request from the lifecyle polling:

  1. Create -> check immediate results, put TPU into monitoring thread
  2. Monitoring thread just calls tpu_describe in a loop until timeout or success
  3. Either (1) or (2) failing falls into the error -> deletion path

For queued resources, we might not know the TPU name at resource creation time, so we might need:

  1. Create queued resource
  2. Poll for resource materialization
  3. Trigger same monitoring process as above with the resource TPU name

This feels straightforward to handle from a shared workflow.

@claude wdyt, is this a big lift?

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 15, 2026

Claude finished @rjpower's task in 2m 43s —— View job


Is this a big lift?

Short version: It's a medium lift, and I agree with the direction — but there's a smaller fix that closes the Codex regression without the full refactor, and you probably want both.

Why the Codex concern is real

On the standard (non-queued) path, if the LRO finishes with an error after the first probe — e.g. RESOURCE_EXHAUSTED 5s into a zone stockout — this PR:

  1. First probe sees done=False → returns a synthetic CREATING TpuInfo.
  2. _run_tpu_bootstrap polls handle._describe_cloud()tpu_describe keeps returning None → maps to CloudSliceState.UNKNOWN, not FAILED (handles.py:289-293).
  3. Thread times out at cloud_ready_timeout (600s) with a generic InfraError — the QuotaExhaustedError → autoscaler-backoff mapping in _raise_tpu_operation_error never fires.
  4. _create_standard_tpu_slice's try/except InfraError for _best_effort_delete_tpu also doesn't fire, because tpu_create returned normally.

And: _run_tpu_bootstrap is only spawned when worker_config is not None (workers.py:396-400). For any non-bootstrapping call site, nobody is watching the LRO at all.

Option A — the cheap fix (closes the regression, unblocks this PR)

Keep polling in tpu_create until either the LRO is terminal or tpu_describe returns a node, with a short-ish deadline. This preserves the "return as soon as visible" goal from #4776 and keeps the RESOURCE_EXHAUSTED → QuotaExhaustedError path intact:

deadline = time.monotonic() + _TPU_CREATE_VISIBILITY_TIMEOUT
while True:
    op = self._get_tpu_operation(op_name)
    existing = self.tpu_describe(request.name, request.zone)
    if existing is not None:
        if op.get("done") and "error" in op:
            logger.warning("TPU %s visible despite LRO error: %s", ...)
        return existing
    if op.get("done"):
        if "error" in op:
            self._raise_tpu_operation_error(op["error"])  # raises Quota/InfraError
        # done + no error + node not visible: very briefly retry describe, then raise
        ...
    if time.monotonic() >= deadline:
        break
    time.sleep(_OPERATION_POLL_INTERVAL)

This is ~15 lines of delta on top of what you already have and doesn't require plumbing op names through handles.

Option B — the refactor you sketched (right long-term shape)

The good news: most of the scaffolding already exists.

  • _run_tpu_bootstrap is already the monitoring thread you're describing. It already polls _describe_cloud() in a loop and transitions on FAILED/READY/DELETING.
  • Reserved/queued path already does the two-phase (_wait_for_queued_resource_activation → TPU-VM READY). That's exactly your step 1→2→3 for queued resources.

The gaps to turn this into what you described:

  1. Spawn monitoring unconditionally, not only when worker_config is not None. Today a non-bootstrapping create has no watcher. Factor the cloud-readiness phase out of _run_tpu_bootstrap into a standalone _monitor_slice_ready that always runs.
  2. Plumb the LRO name into the handle (e.g. GcpSliceHandle._pending_create_op) so the monitor can poll the op alongside tpu_describe and detect "LRO terminal with error AND node never appeared" → raise the right QuotaExhaustedError/InfraError. Drop the field once the node is visible.
  3. Move the delete-on-failure path out of _create_{standard,reserved}_tpu_slice's try/except (which now never trips, because tpu_create returns fast) into the monitor — i.e. "monitor observed FAILED → trigger same best-effort delete."
  4. Autoscaler signaling. tpu_create returning successfully + CREATING handle currently means "create accepted." Quota errors surface asynchronously via the monitor raising, but the autoscaler's backoff path today catches QuotaExhaustedError at the create call site. You'll need either a handle.wait_accepted() the autoscaler can call to get the LRO verdict quickly, or a way to push quota failures onto the same scheduling-failure path the bootstrap thread uses. Option A preserves the current behavior here for free; Option B requires a small redesign of how the autoscaler learns about quota hits.
  5. Queued resources are easier than you suggested: the QR name is known at create time (we generate the slice_id before queued_resource_create), so there's no "poll for materialization to discover the name" step. The monitor just adds a Phase 0 identical to the existing _wait_for_queued_resource_activation.

Sizing: I'd call Option B a 1-2 day change, not a week — one meaningful design decision (how quota hits reach the autoscaler), plus mostly mechanical refactoring since the monitoring thread already exists. Most of the risk is in test coverage around the create-fail-before-visible transitions and the autoscaler integration, not the implementation itself.

My recommendation

Land Option A in this PR (keeps scope matching the title, closes #4776, doesn't regress quota handling), then do Option B as a follow-up with its own issue. Option A is a genuinely small patch; Option B touches the autoscaler contract and deserves a separate review.

If you want I can push the Option A change onto this branch.
Branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Iris] Fix restart_worker slice loss in cloud smoke

2 participants