Skip to content

iris: add --async to gcloud delete commands to unblock heartbeat thread#3684

Closed
ravwojdyla-agent wants to merge 2 commits intomainfrom
agent/20260314-fix-3678
Closed

iris: add --async to gcloud delete commands to unblock heartbeat thread#3684
ravwojdyla-agent wants to merge 2 commits intomainfrom
agent/20260314-fix-3678

Conversation

@ravwojdyla-agent
Copy link
Copy Markdown
Contributor

Summary

  • Adds --async flag to all 5 gcloud ... delete call sites in gcp.py (3 terminate() methods + 2 _best_effort_delete_* helpers)
  • Fixes item 3 from [iris] Scheduling loop bottlenecks: _building_counts (600ms) and DB lock contention #3678: synchronous gcloud tpu-vm delete was blocking heartbeat/autoscaler threads for 10-18s per call
  • Validation errors (not-found, auth) are still caught synchronously; only the wait for actual resource destruction is skipped
  • Does not change reboot() — only delete operations

Test plan

  • Two new tests verify --async is present in all delete commands (test_gcp_terminate_passes_async_flag, test_gcp_best_effort_delete_passes_async_flag)
  • All 57 platform tests pass
  • All 1085 iris tests pass (2 flaky chaos tests pass on re-run)
  • Pre-commit clean

Fixes #3678

🤖 Generated with Claude Code

Fixes #3678 (item 3). All five gcloud delete call sites in gcp.py were
synchronous, blocking the calling thread for 10-18s (TPU) or 3-5s (GCE).
The heartbeat thread was observed stuck on these calls, stalling task
dispatches. Adding --async tells gcloud to return immediately after
initiating the deletion; validation errors are still caught synchronously.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
assert listed_by_id[handle.slice_id].created_at.epoch_ms() == expected_epoch_ms


def test_gcp_terminate_passes_async_flag():
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this test, we either added the flag or we didn't

assert "--async" in cmd, f"Missing --async in delete command: {cmd}"


def test_gcp_best_effort_delete_passes_async_flag():
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@rjpower
Copy link
Copy Markdown
Collaborator

rjpower commented Mar 14, 2026

@claude remove dumb tests

@ravwojdyla
Copy link
Copy Markdown
Contributor

@claude address the comments above please

Copy link
Copy Markdown
Collaborator

@rjpower rjpower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ravwojdyla you can pick :)

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 14, 2026

Claude finished @rjpower's task in 2m 9s —— View job


Removing flagged tests

  • Read test file to identify the two tests to remove
  • Remove test_gcp_terminate_passes_async_flag and test_gcp_best_effort_delete_passes_async_flag
  • Remove recorded_commands field from FakeGcloud (no longer used)
  • Pre-commit clean ✅
  • 55 platform tests pass ✅ (55/55, down from 57)
  • Commit and push

Removed both --async assertion tests and the supporting recorded_commands tracking in FakeGcloud, per review feedback.
· Branch

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 14, 2026

Claude finished @ravwojdyla's task in 2m 5s —— View job


Addressing review comments

  • Gather context
  • Remove test_gcp_terminate_passes_async_flag and test_gcp_best_effort_delete_passes_async_flag tests
  • Remove recorded_commands tracking from FakeGcloud (no longer used)
  • Pre-commit clean, all 55 platform tests pass
  • Changes already pushed

The two --async flag assertion tests and the recorded_commands field in FakeGcloud have been removed per @rjpower's review feedback. The branch already has these changes in commit 3d3172cc.

The --async flag is either present in the code or it isn't — dedicated
tests for its presence add no value.

Co-authored-by: Russell Power <rjpower@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ravwojdyla ravwojdyla enabled auto-merge (squash) March 14, 2026 22:30
@ravwojdyla ravwojdyla disabled auto-merge March 14, 2026 22:41
@ravwojdyla ravwojdyla closed this Mar 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[iris] Scheduling loop bottlenecks: _building_counts (600ms) and DB lock contention

3 participants