Skip to content

fix(iris): add --async to gcloud delete/reset to unblock heartbeat thread#3682

Merged
rjpower merged 1 commit intomainfrom
claude/issue-3678-20260314-2209
Mar 14, 2026
Merged

fix(iris): add --async to gcloud delete/reset to unblock heartbeat thread#3682
rjpower merged 1 commit intomainfrom
claude/issue-3678-20260314-2209

Conversation

@rjpower
Copy link
Copy Markdown
Collaborator

@rjpower rjpower commented Mar 14, 2026

Synchronous gcloud subprocess calls (especially tpu-vm delete, 10-18s) were blocking the heartbeat thread, stalling all task dispatch cluster-wide for minutes at a time. Add --async flag to all gcloud delete and reset commands so they return immediately after submitting the operation to GCP.

Fixes #3678

…read

Synchronous gcloud subprocess calls (especially `tpu-vm delete`, 10-18s)
were blocking the heartbeat thread, stalling all task dispatch cluster-wide
for minutes at a time. Add `--async` flag to all gcloud delete and reset
commands so they return immediately after submitting the operation to GCP.

All callers already handle terminate() failures gracefully (catch + log),
so async deletion is safe. Create and describe commands remain synchronous
since their results are needed immediately.

Fixes #3678

Co-authored-by: Russell Power <rjpower@users.noreply.github.com>
@rjpower rjpower requested a review from ravwojdyla March 14, 2026 22:18
@rjpower rjpower enabled auto-merge (squash) March 14, 2026 22:18
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 14, 2026

Claude finished @rjpower's task in 1m 42s —— View job


PR Review: fix(iris): add --async to gcloud delete/reset to unblock heartbeat thread

  • Gather context and read changed files
  • Review code changes
  • Post review feedback

Review Summary

Overall: LGTM — this is a clean, well-scoped fix for a real operational problem. The change is mechanical and consistent across all 7 call sites.

What the PR does well:

  • Correctly identifies that synchronous gcloud delete/reset calls (10-18s each) were blocking the heartbeat thread
  • Consistently applies --async to all gcloud delete and reset commands
  • Updates log messages to reflect the async nature — good for operational debugging
  • The _best_effort_delete_* methods are natural fits for async since they already swallow "not found" errors

One thing to watch (non-blocking):

  • Left an inline comment about a potential race in vm_lifecycle.py where a controller VM is terminated then immediately recreated with the same name. With --async, the old VM may still be deleting when create_vm fires. Likely fine in practice but worth monitoring.

Minor observations:

  • The error handling after subprocess.run still checks returncode and stderr, which is correct — --async returns 0 on successful submission, and errors (auth, bad args, etc.) still surface immediately
  • reboot() at line 380 uses check=True while all the terminate() methods do manual returncode checking. This is pre-existing inconsistency, not introduced by this PR

f"--project={self._project_id}",
f"--zone={self._zone}",
"--quiet",
"--async",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Potential race condition worth noting: In vm_lifecycle.py:340, an unhealthy controller VM is terminated then immediately recreated with the same deterministic name (iris-controller-{label_prefix}). With --async, the old VM may still be mid-deletion when create_vm runs, which could cause a 409 CONFLICT from GCP.

This is probably fine in practice (GCP typically allows recreation once the delete operation is accepted), but worth keeping an eye on in the field. If it surfaces, a short poll-until-gone or a unique suffix on the name would fix it.

@ravwojdyla
Copy link
Copy Markdown
Contributor

dup: #3684

@rjpower rjpower merged commit d1d7fdd into main Mar 14, 2026
23 checks passed
@rjpower rjpower deleted the claude/issue-3678-20260314-2209 branch March 14, 2026 22:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[iris] Scheduling loop bottlenecks: _building_counts (600ms) and DB lock contention

2 participants