fix(iris): add --async to gcloud delete/reset to unblock heartbeat thread#3682
fix(iris): add --async to gcloud delete/reset to unblock heartbeat thread#3682
Conversation
…read Synchronous gcloud subprocess calls (especially `tpu-vm delete`, 10-18s) were blocking the heartbeat thread, stalling all task dispatch cluster-wide for minutes at a time. Add `--async` flag to all gcloud delete and reset commands so they return immediately after submitting the operation to GCP. All callers already handle terminate() failures gracefully (catch + log), so async deletion is safe. Create and describe commands remain synchronous since their results are needed immediately. Fixes #3678 Co-authored-by: Russell Power <rjpower@users.noreply.github.com>
|
Claude finished @rjpower's task in 1m 42s —— View job PR Review: fix(iris): add --async to gcloud delete/reset to unblock heartbeat thread
Review SummaryOverall: LGTM — this is a clean, well-scoped fix for a real operational problem. The change is mechanical and consistent across all 7 call sites. What the PR does well:
One thing to watch (non-blocking):
Minor observations:
|
| f"--project={self._project_id}", | ||
| f"--zone={self._zone}", | ||
| "--quiet", | ||
| "--async", |
There was a problem hiding this comment.
🤖 Potential race condition worth noting: In vm_lifecycle.py:340, an unhealthy controller VM is terminated then immediately recreated with the same deterministic name (iris-controller-{label_prefix}). With --async, the old VM may still be mid-deletion when create_vm runs, which could cause a 409 CONFLICT from GCP.
This is probably fine in practice (GCP typically allows recreation once the delete operation is accepted), but worth keeping an eye on in the field. If it surfaces, a short poll-until-gone or a unique suffix on the name would fix it.
|
dup: #3684 |
Synchronous gcloud subprocess calls (especially
tpu-vm delete, 10-18s) were blocking the heartbeat thread, stalling all task dispatch cluster-wide for minutes at a time. Add--asyncflag to all gcloud delete and reset commands so they return immediately after submitting the operation to GCP.Fixes #3678