[iris] Return visible TPUs before create LRO settles#4786
Conversation
|
Claude finished @dlwh's task in 2m 2s —— View job Code review
No issues found. Checked for bugs and AGENTS.md compliance. Summary of the change
Minor note (non-blocking)
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6678fd8225
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| op = self._get_tpu_operation(op_name) | ||
| if op.get("done") and "error" in op: |
There was a problem hiding this comment.
Keep polling TPU create LRO until terminal or node visible
tpu_create now polls the create operation only once, and only raises if that first read is already done with error. If the LRO is still running on that poll but later finishes with an error (for example RESOURCE_EXHAUSTED), this path falls through and returns a synthetic CREATING TPU instead of raising, so callers treat a failed create as success and skip the quota/backoff error handling that _wait_tpu_operation previously provided.
Useful? React with 👍 / 👎.
|
It feels like we don't want to just fall through and assume success if the TPU create isn't done on the first poll? This might require a bit more thought into how we poll TPUs instead. Should we break apart the create request from the lifecyle polling:
For queued resources, we might not know the TPU name at resource creation time, so we might need:
This feels straightforward to handle from a shared workflow. @claude wdyt, is this a big lift? |
|
Claude finished @rjpower's task in 2m 43s —— View job Is this a big lift?Short version: It's a medium lift, and I agree with the direction — but there's a smaller fix that closes the Codex regression without the full refactor, and you probably want both. Why the Codex concern is realOn the standard (non-queued) path, if the LRO finishes with an error after the first probe — e.g.
And: Option A — the cheap fix (closes the regression, unblocks this PR)Keep polling in deadline = time.monotonic() + _TPU_CREATE_VISIBILITY_TIMEOUT
while True:
op = self._get_tpu_operation(op_name)
existing = self.tpu_describe(request.name, request.zone)
if existing is not None:
if op.get("done") and "error" in op:
logger.warning("TPU %s visible despite LRO error: %s", ...)
return existing
if op.get("done"):
if "error" in op:
self._raise_tpu_operation_error(op["error"]) # raises Quota/InfraError
# done + no error + node not visible: very briefly retry describe, then raise
...
if time.monotonic() >= deadline:
break
time.sleep(_OPERATION_POLL_INTERVAL)This is ~15 lines of delta on top of what you already have and doesn't require plumbing op names through handles. Option B — the refactor you sketched (right long-term shape)The good news: most of the scaffolding already exists.
The gaps to turn this into what you described:
Sizing: I'd call Option B a 1-2 day change, not a week — one meaningful design decision (how quota hits reach the autoscaler), plus mostly mechanical refactoring since the monitoring thread already exists. Most of the risk is in test coverage around the create-fail-before-visible transitions and the autoscaler integration, not the implementation itself. My recommendationLand Option A in this PR (keeps scope matching the title, closes #4776, doesn't regress quota handling), then do Option B as a follow-up with its own issue. Option A is a genuinely small patch; Option B touches the autoscaler contract and deserves a separate review. If you want I can push the Option A change onto this branch. |
Stop standard TPU creation from blocking slice tracking on the full create LRO. Return a TPU as soon as it is visible and add a regression for the case where the node exists before the LRO reports an internal error.
Fixes #4776