[iris] Reclaim dead cloud slices and purge orphan slice rows at boot#5720
Conversation
list_all_slices now returns (handle, state) pairs across every cloud state, and restore_autoscaler_state partitions on it: live slices feed the autoscaler, dead ones are async-terminated. Discarded checkpoint slices are also deleted from the slices table so SQLite no longer accumulates ghost rows that the autoscaler cannot see.
|
Claude finished @rjpower's task in 7m 10s —— View job Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. Scope reviewed: 9 files, +120/-88, single commit Notes from the review (informational, not blocking):
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e78891c48c
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
|
||
| def _reclaim_dead_slice(handle: SliceHandle, state: CloudSliceState) -> None: | ||
| logger.info("Reclaiming dead slice %s (state=%s, zone=%s)", handle.slice_id, state, handle.zone) | ||
| handle.terminate() |
There was a problem hiding this comment.
Handle dead-slice termination failures gracefully
Startup recovery now calls terminate() for every non-live slice, but this call is uncaught; if the provider raises (for example on transient cloud API/auth/quota failures), restore_autoscaler_state aborts and controller boot fails. This turns best-effort cleanup into a hard startup dependency and can take the control plane down whenever one stale slice cannot be deleted. Please catch/log terminate errors here and continue restoring autoscaler state.
Useful? React with 👍 / 👎.
yonromai
left a comment
There was a problem hiding this comment.
🤖 Approved with two net-new inline findings. I did not repeat the existing Codex comment on handle.terminate() failures aborting restore. The remaining issues are both in the same boundary: some non-live GCP resources still never make it through list_all_slices(), so recovery can purge Iris state without reclaiming the cloud resource.
Generated with Codex.
| def list_all_slices(self) -> list[GcpSliceHandle | GcpVmSliceHandle]: | ||
| """List all autoscaler-managed slices for this cluster. | ||
| def list_all_slices(self) -> list[ListedSlice]: | ||
| """List every autoscaler-managed slice for this cluster, regardless of cloud state. |
There was a problem hiding this comment.
🤖 P1: list_all_slices() still filters out stopped VM slices, so restore purges the DB row but never deletes the cloud VM.
The new contract says this method lists every managed slice regardless of cloud state, and restore_autoscaler_state() relies on that to call terminate() for dead slices. The TPU path now returns DELETING/UNKNOWN entries, but the VM path still drops anything outside PROVISIONING/STAGING/RUNNING at lines 731-733. For a GCE-backed slice whose instance is STOPPING or TERMINATED, boot recovery will treat the checkpoint row as missing and delete it from SQLite, but it never gets a GcpVmSliceHandle, so vm_delete() is never called and the managed VM is orphaned. The updated test_gcp_list_slices_skips_inactive_vm_instances currently locks in that old behavior. Please include inactive VM-backed slices in list_all_slices() with _VM_STATE_MAP state, while keeping list_slices() filtered for active discovery.
Generated with Codex.
| _ssh_config=self._ssh_config, | ||
| _is_queued_resource=True, | ||
| ) | ||
| listed.append(ListedSlice(handle=handle, state=CloudSliceState.CREATING)) |
There was a problem hiding this comment.
🤖 P2: failed queued resources are skipped before the boot reconciler can reclaim them.
This branch still drops queued resources in FAILED, SUSPENDED, or DELETING, but those are exactly the non-live reserved TPU resources the new recovery path is meant to clean up. Because they are filtered out before a GcpSliceHandle(_is_queued_resource=True) is created, _reclaim_dead_slice() never calls queued_resource_delete(), so a failed reserved request can stay in GCP after restart while Iris forgets about it. Please return these queued resources as ListedSlice(..., state=FAILED/DELETING/UNKNOWN as appropriate) instead of continuing past them; the existing live QUEUED/PROVISIONING/WAITING_FOR_RESOURCES cases can remain CREATING.
Generated with Codex.
- `_reclaim_dead_slice` now runs `handle.terminate()` on a daemon thread and swallows InfraError so a transient delete failure cannot abort boot recovery. - `list_all_slices()` surfaces every managed VM and queued resource regardless of cloud state: stopped/terminated VMs flow through with `DELETING`, and failed/suspended/deleting queued resources are mapped via a new `_QR_STATE_MAP` so the boot reconciler can reclaim them. `list_slices()` (live discovery) still filters to active states. - Flip `test_gcp_list_slices_skips_inactive_vm_instances` into `test_gcp_list_all_slices_includes_terminated_vm_instances` to lock in the new contract. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
list_all_slices now returns (handle, state) pairs across every cloud state, and restore_autoscaler_state partitions on it: live slices feed the autoscaler, dead ones are async-terminated. Discarded checkpoint slices are also deleted from the slices table so SQLite no longer accumulates ghost rows that the autoscaler cannot see.