[iris] Reclaim dead cloud slices and purge orphan slice rows at boot by rjpower · Pull Request #5720 · marin-community/marin

rjpower · 2026-05-14T00:26:12Z

list_all_slices now returns (handle, state) pairs across every cloud state, and restore_autoscaler_state partitions on it: live slices feed the autoscaler, dead ones are async-terminated. Discarded checkpoint slices are also deleted from the slices table so SQLite no longer accumulates ghost rows that the autoscaler cannot see.

claude · 2026-05-14T00:26:26Z

Claude finished @rjpower's task in 7m 10s —— View job

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

Scope reviewed: 9 files, +120/-88, single commit e78891c.

Notes from the review (informational, not blocking):

list_all_slices return-type change to list[ListedSlice] is propagated consistently through protocols.py, both providers, default_stop_all, and tests.
_LIVE_CLOUD_STATES = {CREATING, READY, REPAIRING} lines up with _TPU_STATE_MAP and _VM_STATE_MAP; VMs are still pre-filtered to active statuses in gcp/workers.py so dead VMs never reach the reclaim path (intentional — terminated VMs are detected via "missing from cloud" + purge_persisted_slice_rows).
_reclaim_dead_slice only runs from restore_autoscaler_state, which is called once at controller boot via runtime.restore_from_db. GcpSliceHandle.terminate() defaults to wait=False, matching the PR description's "async-terminated" wording.
ScalingGroupRestoreResult.discarded_count → discarded_slice_ids is updated at both call sites (recovery.py, test_snapshot_reconciliation.py).
purge_persisted_slice_rows deletes by slice_id.in_(...) without a scale_group filter. Slice IDs include a random suffix (generate_slice_suffix) so collisions across groups are not a realistic concern.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e78891c48c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-14T00:31:49Z

+
+def _reclaim_dead_slice(handle: SliceHandle, state: CloudSliceState) -> None:
+    logger.info("Reclaiming dead slice %s (state=%s, zone=%s)", handle.slice_id, state, handle.zone)
+    handle.terminate()


Handle dead-slice termination failures gracefully

Startup recovery now calls terminate() for every non-live slice, but this call is uncaught; if the provider raises (for example on transient cloud API/auth/quota failures), restore_autoscaler_state aborts and controller boot fails. This turns best-effort cleanup into a hard startup dependency and can take the control plane down whenever one stale slice cannot be deleted. Please catch/log terminate errors here and continue restoring autoscaler state.

Useful? React with 👍 / 👎.

yonromai

🤖 Approved with two net-new inline findings. I did not repeat the existing Codex comment on handle.terminate() failures aborting restore. The remaining issues are both in the same boundary: some non-live GCP resources still never make it through list_all_slices(), so recovery can purge Iris state without reclaiming the cloud resource.

Generated with Codex.

yonromai · 2026-05-14T01:03:23Z

-    def list_all_slices(self) -> list[GcpSliceHandle | GcpVmSliceHandle]:
-        """List all autoscaler-managed slices for this cluster.
+    def list_all_slices(self) -> list[ListedSlice]:
+        """List every autoscaler-managed slice for this cluster, regardless of cloud state.


🤖 P1: list_all_slices() still filters out stopped VM slices, so restore purges the DB row but never deletes the cloud VM.

The new contract says this method lists every managed slice regardless of cloud state, and restore_autoscaler_state() relies on that to call terminate() for dead slices. The TPU path now returns DELETING/UNKNOWN entries, but the VM path still drops anything outside PROVISIONING/STAGING/RUNNING at lines 731-733. For a GCE-backed slice whose instance is STOPPING or TERMINATED, boot recovery will treat the checkpoint row as missing and delete it from SQLite, but it never gets a GcpVmSliceHandle, so vm_delete() is never called and the managed VM is orphaned. The updated test_gcp_list_slices_skips_inactive_vm_instances currently locks in that old behavior. Please include inactive VM-backed slices in list_all_slices() with _VM_STATE_MAP state, while keeping list_slices() filtered for active discovery.

Generated with Codex.

yonromai · 2026-05-14T01:03:23Z

+                _ssh_config=self._ssh_config,
+                _is_queued_resource=True,
            )
+            listed.append(ListedSlice(handle=handle, state=CloudSliceState.CREATING))


🤖 P2: failed queued resources are skipped before the boot reconciler can reclaim them.

This branch still drops queued resources in FAILED, SUSPENDED, or DELETING, but those are exactly the non-live reserved TPU resources the new recovery path is meant to clean up. Because they are filtered out before a GcpSliceHandle(_is_queued_resource=True) is created, _reclaim_dead_slice() never calls queued_resource_delete(), so a failed reserved request can stay in GCP after restart while Iris forgets about it. Please return these queued resources as ListedSlice(..., state=FAILED/DELETING/UNKNOWN as appropriate) instead of continuing past them; the existing live QUEUED/PROVISIONING/WAITING_FOR_RESOURCES cases can remain CREATING.

Generated with Codex.

- `_reclaim_dead_slice` now runs `handle.terminate()` on a daemon thread and swallows InfraError so a transient delete failure cannot abort boot recovery. - `list_all_slices()` surfaces every managed VM and queued resource regardless of cloud state: stopped/terminated VMs flow through with `DELETING`, and failed/suspended/deleting queued resources are mapped via a new `_QR_STATE_MAP` so the boot reconciler can reclaim them. `list_slices()` (live discovery) still filters to active states. - Flip `test_gcp_list_slices_skips_inactive_vm_instances` into `test_gcp_list_all_slices_includes_terminated_vm_instances` to lock in the new contract. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

rjpower added the agent-generated Created by automation/agent label May 14, 2026

rjpower requested a review from yonromai May 14, 2026 00:26

rjpower enabled auto-merge (squash) May 14, 2026 00:26

chatgpt-codex-connector Bot reviewed May 14, 2026

View reviewed changes

rjpower disabled auto-merge May 14, 2026 00:37

yonromai approved these changes May 14, 2026

View reviewed changes

rjpower enabled auto-merge (squash) May 14, 2026 17:31

rjpower merged commit 2f52e4c into main May 14, 2026
30 of 31 checks passed

rjpower deleted the iris-purge-orphan-slice-rows-on-restore branch May 14, 2026 17:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[iris] Reclaim dead cloud slices and purge orphan slice rows at boot#5720

[iris] Reclaim dead cloud slices and purge orphan slice rows at boot#5720
rjpower merged 2 commits into
mainfrom
iris-purge-orphan-slice-rows-on-restore

rjpower commented May 14, 2026

Uh oh!

claude Bot commented May 14, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 14, 2026

Uh oh!

yonromai left a comment

Uh oh!

yonromai May 14, 2026

Uh oh!

yonromai May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rjpower commented May 14, 2026

Uh oh!

claude Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code review

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

yonromai left a comment

Choose a reason for hiding this comment

Uh oh!

yonromai May 14, 2026

Choose a reason for hiding this comment

Uh oh!

yonromai May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude Bot commented May 14, 2026 •

edited

Loading