Skip to content

[iris] Reclaim dead cloud slices and purge orphan slice rows at boot#5720

Merged
rjpower merged 2 commits into
mainfrom
iris-purge-orphan-slice-rows-on-restore
May 14, 2026
Merged

[iris] Reclaim dead cloud slices and purge orphan slice rows at boot#5720
rjpower merged 2 commits into
mainfrom
iris-purge-orphan-slice-rows-on-restore

Conversation

@rjpower
Copy link
Copy Markdown
Collaborator

@rjpower rjpower commented May 14, 2026

list_all_slices now returns (handle, state) pairs across every cloud state, and restore_autoscaler_state partitions on it: live slices feed the autoscaler, dead ones are async-terminated. Discarded checkpoint slices are also deleted from the slices table so SQLite no longer accumulates ghost rows that the autoscaler cannot see.

list_all_slices now returns (handle, state) pairs across every cloud state,
and restore_autoscaler_state partitions on it: live slices feed the
autoscaler, dead ones are async-terminated. Discarded checkpoint slices are
also deleted from the slices table so SQLite no longer accumulates ghost rows
that the autoscaler cannot see.
@rjpower rjpower added the agent-generated Created by automation/agent label May 14, 2026
@rjpower rjpower requested a review from yonromai May 14, 2026 00:26
@rjpower rjpower enabled auto-merge (squash) May 14, 2026 00:26
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 14, 2026

Claude finished @rjpower's task in 7m 10s —— View job


Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

Scope reviewed: 9 files, +120/-88, single commit e78891c.

Notes from the review (informational, not blocking):

  • list_all_slices return-type change to list[ListedSlice] is propagated consistently through protocols.py, both providers, default_stop_all, and tests.
  • _LIVE_CLOUD_STATES = {CREATING, READY, REPAIRING} lines up with _TPU_STATE_MAP and _VM_STATE_MAP; VMs are still pre-filtered to active statuses in gcp/workers.py so dead VMs never reach the reclaim path (intentional — terminated VMs are detected via "missing from cloud" + purge_persisted_slice_rows).
  • _reclaim_dead_slice only runs from restore_autoscaler_state, which is called once at controller boot via runtime.restore_from_db. GcpSliceHandle.terminate() defaults to wait=False, matching the PR description's "async-terminated" wording.
  • ScalingGroupRestoreResult.discarded_count → discarded_slice_ids is updated at both call sites (recovery.py, test_snapshot_reconciliation.py).
  • purge_persisted_slice_rows deletes by slice_id.in_(...) without a scale_group filter. Slice IDs include a random suffix (generate_slice_suffix) so collisions across groups are not a realistic concern.

  • Verify PR is eligible for review
  • List relevant CLAUDE.md/AGENTS.md files
  • Summarize PR changes
  • Run parallel bug + compliance review
  • Validate findings
  • Post review
    · Branch

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e78891c48c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".


def _reclaim_dead_slice(handle: SliceHandle, state: CloudSliceState) -> None:
logger.info("Reclaiming dead slice %s (state=%s, zone=%s)", handle.slice_id, state, handle.zone)
handle.terminate()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Handle dead-slice termination failures gracefully

Startup recovery now calls terminate() for every non-live slice, but this call is uncaught; if the provider raises (for example on transient cloud API/auth/quota failures), restore_autoscaler_state aborts and controller boot fails. This turns best-effort cleanup into a hard startup dependency and can take the control plane down whenever one stale slice cannot be deleted. Please catch/log terminate errors here and continue restoring autoscaler state.

Useful? React with 👍 / 👎.

@rjpower rjpower disabled auto-merge May 14, 2026 00:37
Copy link
Copy Markdown
Contributor

@yonromai yonromai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Approved with two net-new inline findings. I did not repeat the existing Codex comment on handle.terminate() failures aborting restore. The remaining issues are both in the same boundary: some non-live GCP resources still never make it through list_all_slices(), so recovery can purge Iris state without reclaiming the cloud resource.

Generated with Codex.

def list_all_slices(self) -> list[GcpSliceHandle | GcpVmSliceHandle]:
"""List all autoscaler-managed slices for this cluster.
def list_all_slices(self) -> list[ListedSlice]:
"""List every autoscaler-managed slice for this cluster, regardless of cloud state.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 P1: list_all_slices() still filters out stopped VM slices, so restore purges the DB row but never deletes the cloud VM.

The new contract says this method lists every managed slice regardless of cloud state, and restore_autoscaler_state() relies on that to call terminate() for dead slices. The TPU path now returns DELETING/UNKNOWN entries, but the VM path still drops anything outside PROVISIONING/STAGING/RUNNING at lines 731-733. For a GCE-backed slice whose instance is STOPPING or TERMINATED, boot recovery will treat the checkpoint row as missing and delete it from SQLite, but it never gets a GcpVmSliceHandle, so vm_delete() is never called and the managed VM is orphaned. The updated test_gcp_list_slices_skips_inactive_vm_instances currently locks in that old behavior. Please include inactive VM-backed slices in list_all_slices() with _VM_STATE_MAP state, while keeping list_slices() filtered for active discovery.

Generated with Codex.

_ssh_config=self._ssh_config,
_is_queued_resource=True,
)
listed.append(ListedSlice(handle=handle, state=CloudSliceState.CREATING))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 P2: failed queued resources are skipped before the boot reconciler can reclaim them.

This branch still drops queued resources in FAILED, SUSPENDED, or DELETING, but those are exactly the non-live reserved TPU resources the new recovery path is meant to clean up. Because they are filtered out before a GcpSliceHandle(_is_queued_resource=True) is created, _reclaim_dead_slice() never calls queued_resource_delete(), so a failed reserved request can stay in GCP after restart while Iris forgets about it. Please return these queued resources as ListedSlice(..., state=FAILED/DELETING/UNKNOWN as appropriate) instead of continuing past them; the existing live QUEUED/PROVISIONING/WAITING_FOR_RESOURCES cases can remain CREATING.

Generated with Codex.

- `_reclaim_dead_slice` now runs `handle.terminate()` on a daemon thread and
  swallows InfraError so a transient delete failure cannot abort boot recovery.
- `list_all_slices()` surfaces every managed VM and queued resource regardless
  of cloud state: stopped/terminated VMs flow through with `DELETING`, and
  failed/suspended/deleting queued resources are mapped via a new
  `_QR_STATE_MAP` so the boot reconciler can reclaim them. `list_slices()`
  (live discovery) still filters to active states.
- Flip `test_gcp_list_slices_skips_inactive_vm_instances` into
  `test_gcp_list_all_slices_includes_terminated_vm_instances` to lock in the
  new contract.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@rjpower rjpower enabled auto-merge (squash) May 14, 2026 17:31
@rjpower rjpower merged commit 2f52e4c into main May 14, 2026
30 of 31 checks passed
@rjpower rjpower deleted the iris-purge-orphan-slice-rows-on-restore branch May 14, 2026 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants