[iris] Add slice lifecycle state machine for autoscaler transitions by rjpower · Pull Request #4816 · marin-community/marin

rjpower · 2026-04-16T02:19:21Z

Summary

Slice lifecycle in the autoscaler becomes an explicit state machine: a (from_state, type(event)) → to_state transition table, typed event dataclasses carrying exactly the payload each transition needs, and a sum-type outcome (NoOp | InternalTransition | BecameReady | BecameFailed) that callers consume via a single match statement.
ScalingGroup.dispatch() folds the short-lived-failure → group-backoff cascade into the same lock as the slice mutation, so consecutive_failures / backoff_until stay consistent with _slices under concurrent failures (see test_concurrent_failures_account_atomically).
SliceLifecycleState and SliceState move from scaling_group.py to models.py so the transition table can import them without creating a cycle.

Behavior change

FAILED transitions now atomically remove the slice from _slices and return the SliceHandle to the caller (via BecameFailed.handle). Callers unregister workers and async-terminate the handle. Consequences:

slice_state_counts() and to_status() will never observe a FAILED slice for this group. Failure is surfaced via the returned outcome, a slice_failed / worker_failed entry in the action log, and the slice_transition … →failed structured log line.
The old code left FAILED slices tracked until a separate cleanup pass; in practice both paths removed them within the same refresh cycle, so downstream consumers (dashboards, RPC status) observe no regression. Callers that depend on visible FAILED state must reach for the action log.

Changes

File	What
`slice_lifecycle.py`	Typed `SliceEvent` hierarchy, `TRANSITIONS` table, `cloud_event()` helper, `TransitionOutcome` sum type. No `StrEnum`, no `context: dict[str, Any]`, no separate `BACKOFF_TRIGGERS` set (each event type carries `counts_toward_backoff`).
`models.py`	Owns `SliceLifecycleState` (StrEnum with `enum.auto()`) and `SliceState`.
`scaling_group.py`	New `dispatch(slice_id, event)` method. Short-lived-failure backoff cascade is atomic with the slice detach under `_slices_lock`.
`runtime.py`	Single `_apply(outcome, group)` method `match`es on outcome type and owns all external side effects (worker registry, async terminate, action-log entries). `refresh()` / `terminate_slices_for_workers()` lose their scattered `if new_state == …` switches. `operations.py` folded back into `Autoscaler`.
`test_slice_lifecycle.py`	Parametrized table sanity, READY/FAILED invariants, exponential backoff, exhaustiveness, and a new `test_concurrent_failures_account_atomically` covering the lock widening.

Test plan

tests/cluster/controller/test_slice_lifecycle.py — 20+ cases including concurrent-failure atomicity
tests/cluster/controller/ tests/cluster/providers/ — 1359 tests pass
./infra/pre-commit.py --fix — ruff, black, pyrefly, license headers clean

Refs: #4754

claude · 2026-04-16T02:19:38Z

Claude finished @rjpower's task in 3m 58s —— View job

Review of PR #4816: Slice lifecycle state machine

Read all changed files and gather context
Review the state machine design (slice_lifecycle.py)
Review migration (0033_slice_transitions.py)
Review integration points (runtime.py, operations.py, scaling_group.py)
Review model changes and import updates
Review tests (test_slice_lifecycle.py)
Post review feedback

Overall

Good design — the (from_state, event) → (to_state, side_effects) table is clean and testable, the audit trail will be valuable for debugging, and the model extraction to models.py is a sensible separation. The exhaustiveness test is a nice touch.

Issues

1. Behavioral regression: bootstrap failures in refresh() no longer trigger backoff (P1)

The old refresh() code called group.record_failure() unconditionally when CLOUD_STATE_FAILED or CLOUD_STATE_UNKNOWN_TIMEOUT was observed. The new code dispatches through the state machine, where _on_bootstrap_failed / _on_timeout only emit RECORD_GROUP_FAILURE if ctx.get("is_short_lived") — but refresh() never passes is_short_lived in context. Net effect: bootstrap failures observed during polling no longer trigger exponential backoff.

If this is intentional (e.g., only applying backoff for short-lived worker failures), document it. If not, refresh() needs to compute and pass is_short_lived the same way operations.py does. See inline comment.

2. Inconsistent side-effect execution in operations.py (P2)

terminate_slices_for_workers() dispatches WORKER_FAILURE_REPORTED but only handles the RECORD_GROUP_FAILURE side effect, ignoring DEREGISTER_WORKERS and TERMINATE_SLICE (which it does manually). This works today but undermines the state machine contract — new side effects added to _on_worker_failure won't execute here. See inline comment.

3. Unused events and transitions (P3)

PLATFORM_CALL_STARTED, TEARDOWN_COMPLETE, CLOUD_STATE_INITIALIZING, and IDLE_TIMEOUT are defined but never dispatched. Per repo conventions on dead code, consider removing until wired up or adding a comment referencing the plan. See inline comment.

4. Side effects computed outside the lock (P3)

In dispatch_slice_event(), line 191 calls transition.side_effects(state, context) after releasing _slices_lock. Current side-effect functions only read ctx so it's safe, but state is a mutable reference. Worth a comment or moving inside the lock. See inline comment.

Smaller notes

_on_timeout is a trivial delegation to _on_bootstrap_failed; could use the function directly in the transition table.
error_message is no longer set on SliceState through the dispatch path (old mark_slice_failed did this). Since failed slices get deleted by scale_down/detach_slice, this is likely fine — the error context lives in the audit trail instead.
Migration 0033 is clean. Consider adding a LIMIT-based retention policy later if the audit table grows unbounded.
Tests are thorough. The exhaustiveness check on line 239-249 is a good pattern.

Summary

The core state machine design is solid and the audit trail is a clear win for debuggability. The main concern is the behavioral change around backoff (#1 above) — please clarify whether it's intentional or needs a fix before merging.

claude · 2026-04-16T02:22:54Z

+def _on_bootstrap_failed(state: SliceState, ctx: dict[str, Any]) -> list[SliceSideEffect]:
+    effects: list[SliceSideEffect] = [
+        SliceSideEffect(SliceSideEffectKind.DEREGISTER_WORKERS),
+        SliceSideEffect(SliceSideEffectKind.TERMINATE_SLICE),
+    ]
+    if ctx.get("is_short_lived"):
+        effects.append(SliceSideEffect(SliceSideEffectKind.RECORD_GROUP_FAILURE))
+    return effects


🤖 Behavioral change in refresh() — bootstrap failures no longer trigger backoff

The old refresh() code called group.record_failure() unconditionally on CLOUD_STATE_FAILED and CLOUD_STATE_UNKNOWN_TIMEOUT. This function gates RECORD_GROUP_FAILURE on ctx.get("is_short_lived"), but refresh() never passes is_short_lived in its context dict:

# runtime.py refresh() dispatches: result = group.dispatch( slice_id, SliceEvent.CLOUD_STATE_FAILED, {"error_message": status.error_message}, # no is_short_lived! now=timestamp, )

So bootstrap failures observed during refresh() no longer trigger backoff. If this is intentional (only applying backoff for short-lived worker failures from operations.py), it should be documented. If not, refresh() needs to compute and pass is_short_lived just like operations.py does.

claude · 2026-04-16T02:23:08Z

+        if result is not None:
+            for effect in result.side_effects:
+                if effect.kind == SliceSideEffectKind.RECORD_GROUP_FAILURE:
+                    group.record_failure(timestamp)
+                    log_action(
+                        "backoff_triggered",
+                        group.name,
+                        slice_id=slice_id,
+                        reason=f"short-lived slice (age={age_ms}ms)",
+                    )


🤖 Partial side-effect execution undermines the state machine contract

This loop only handles RECORD_GROUP_FAILURE, ignoring DEREGISTER_WORKERS and TERMINATE_SLICE that the transition table returns for WORKER_FAILURE_REPORTED. The actual deregistration and termination happen a few lines below via detach_slice + unregister_slice_workers.

This works correctly, but it means the caller is cherry-picking which side effects to honor rather than executing them all. The state machine says "deregister workers and terminate the slice," but the caller silently ignores that and does its own thing. If someone later adds a new side effect to _on_worker_failure, they'd expect it to run here — but it wouldn't.

Consider either:

Using a shared _execute_side_effects helper (like runtime.py does), or

Having the transition for WORKER_FAILURE_REPORTED only return the side effects that actually get executed here (just RECORD_GROUP_FAILURE), and documenting that operations.py handles teardown itself

Also, age_ms on line 145 is only defined when slice_handle is not None (line 125). It's safe today because is_short_lived=False when handle is None, so RECORD_GROUP_FAILURE won't fire, but it's fragile.

claude · 2026-04-16T02:23:11Z

+        state.lifecycle = transition.to_state
+
+    side_effects = transition.side_effects(state, context) if transition.side_effects else []
+


🤖 Side effects computed outside the lock

The state mutation (state.lifecycle = transition.to_state) happens under the lock, but transition.side_effects(state, context) runs after the lock is released. The current side-effect functions only read from context, not from state, so this is safe. But state is a mutable reference that other threads could modify. Worth a brief comment noting this invariant (side-effect functions must not read mutable fields from state), or compute side effects inside the lock to be safe.

claude · 2026-04-16T02:23:14Z

+def _on_timeout(state: SliceState, ctx: dict[str, Any]) -> list[SliceSideEffect]:
+    return _on_bootstrap_failed(state, ctx)


🤖 Nit: _on_timeout is a one-line delegation to _on_bootstrap_failed with identical behavior. Consider using _on_bootstrap_failed directly in the transition table entry for CLOUD_STATE_UNKNOWN_TIMEOUT and dropping this wrapper. If you want to keep them separate for future divergence, a comment noting the intent would help.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 73ae17bf52

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-16T02:23:17Z

+                    result = group.dispatch(
                        slice_id,
-                        reason=reason,
-                        status="failed",
+                        SliceEvent.CLOUD_STATE_FAILED,
+                        {"error_message": status.error_message},
+                        now=timestamp,


Preserve failure backoff on cloud bootstrap failures

refresh() now dispatches CLOUD_STATE_FAILED without any short-lived metadata, but the new transition handlers only emit RECORD_GROUP_FAILURE when is_short_lived is true. As a result, normal bootstrap failures (and similarly UNKNOWN-timeout failures) stop calling group.record_failure(), which removes exponential backoff and can cause repeated scale-up retries against a persistently failing provider.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-16T02:23:17Z

+            if effect.kind == SliceSideEffectKind.REGISTER_WORKERS:
+                self._register_slice_workers(workers, slice_id, group.name)


Update slice state when registering READY workers

The REGISTER_WORKERS side effect only updates the worker registry and never writes worker IDs (or activity timestamp) into the ScalingGroup slice state. ScalingGroup.find_slice_for_worker() and idle-scale-down checks read from state.worker_ids, so slices that become READY via this path are no longer discoverable for worker-failure teardown and can fail idle verification.

Useful? React with 👍 / 👎.

ravwojdyla-agent · 2026-04-22T22:46:38Z

🤖 ## Design review

I walked through the diff, checked out the branch (pr-4816), compared to main, and traced call sites through runtime.py → scaling_group.py. Major concerns, prioritized.

1. The PR description is stale in load-bearing ways

Two claims in the description are now wrong:

"Adds a slice_transitions audit table (migration 0033) that logs every state change with timestamp, event, and context — enables answering 'why didn't group X scale up from 2:30–3:45pm?'"
This is the headline motivation. It is not in the PR. Commit 2552b17dc ("Drop slice_transitions audit table: rely on structured log lines") removed it. No 0033_slice_transitions.py exists in the diff, and migration 0033 is already taken by 0033_worker_task_history_fk_cascade.py. What's left is a single logger.info("slice transition …") line — much less than an audit table; you can't SQL-query log lines retroactively.
"Existing mark_slice_ready/mark_slice_failed methods preserved for the scale_down_if_idle path and backward compat with tests"
Not true — both methods are deleted. Every caller was rewritten to dispatch(…, SliceEvent.CLOUD_STATE_READY, {"worker_ids": …}).

Please update the description before merge, otherwise reviewers who read only the summary get a materially wrong picture of what's shipping.

2. Hidden behavior change: FAILED slices now disappear from tracking

Not described as a semantic change but it is one:

main: mark_slice_failed() sets state.lifecycle = FAILED, keeps the slice in _slices with error_message, and a separate scale_down() later removes it. slice_state_counts()[FAILED] is observable. to_status() renders FAILED slices with their error in the RPC proto.
PR: dispatch(…, _FAILED) atomically removes the slice from _slices and hands the handle to the caller. slice_state_counts()[FAILED] is now always 0. The test test_failed_takes_precedence was renamed to test_failed_slice_detached_atomically and its invariant inverted:

# asserts counts[FAILED] == 1 on main → asserts counts[FAILED] == 0 in PR
assert counts[SliceLifecycleState.FAILED] == 0
assert group.slice_count() == 0

Callers that relied on the FAILED state being visible (status UI, dashboards, snapshots) will silently see fewer slices. Given the stated motivation is debuggability, dropping FAILED visibility from to_status() and slice_state_counts() cuts against that motivation. Either (a) call out the change explicitly and confirm no downstream consumer cares, or (b) keep FAILED in tracking with a separate cleanup pass, matching main.

3. The abstraction is heavier than the 11 transitions warrant

TRANSITIONS is 11 entries, and every to_state is one of INITIALIZING / READY / FAILED. To consume this you stitch together:

cloud_state_to_event() in slice_lifecycle.py — maps CloudSliceState → SliceEvent (with a hidden timeout check for UNKNOWN).
TRANSITIONS[(state, event)] — tells you the new state.
dispatch() in scaling_group.py — switches on to_state for side effects inside the lock (worker_ids, last_active, error_message, detach, _apply_failure_locked, pop, last_scale_down).
_handle_transition() in runtime.py — switches on new_state again for side effects outside the lock (register/unregister workers, spawn terminate, log action).
Each call site in refresh() / terminate_slices_for_workers() — switches on new_state a third time to pick log text.

The original imperative code (if status.state == READY: mark_ready(); register_workers()) required a single read. The state-machine version triples the number of if new_state == … branches and scatters them across three modules. For a state machine with 2 non-terminal states and 3 terminal outcomes, this is over-engineered — a couple of typed methods (mark_ready, mark_failed, mark_idle) with the transition check inlined would express the same invariants with less surface.

4. `context: dict[str, Any]` defeats typing at the API boundary

dispatch(slice_id, event, context: dict[str, Any] | None = None, *, now=…) pushes typed information through an untyped dict. Concrete consequences:

Inconsistent payload shape for the same event: production passes {"workers": [RemoteWorkerHandle, …]}, but tests and conftest pass {"worker_ids": ["10.0.0.1", …]}. dispatch() branches on ctx.get("worker_ids") or [w.worker_id for w in workers] to paper over this. Direct sign the API boundary is wrong.
Dead payload: terminate_slices_for_workers passes {"failed_workers": sorted(…)}. dispatch() never reads it.
No static guarantee that FAILED events carry error_message, that READY events carry workers, etc. A typo silently yields "" or [].

Cleaner fix: discriminated event dataclass (CloudReady(workers=…), CloudFailed(error=…), IdleTimeout(), etc.) — or just typed methods given the small number of call shapes.

Pyrefly won't catch any of this today because of Any, and the repo's own guidance (AGENTS.md) calls out avoiding Any.

5. "Atomic cross-machine cascade" lock widening — real race?

Rationale for folding _apply_failure_locked inside _slices_lock: "short-lived slice failures bump consecutive_failures and extend backoff_until atomically with the FAILED transition." But:

_consecutive_failures / _backoff_until are read in availability(), which doesn't require consistency with _slices (readers already tolerate partial views across unrelated fields).
No test asserts the atomicity invariant; no race is cited in a commit message.
Holding _slices_lock while mutating _backoff_until widens lock scope and makes future lock-discipline reasoning harder.

If there's no concrete reader that observes the inconsistency as a bug, the invariant is aspirational and not worth the lock widening.

6. Minor: `operations.py` dissolution is a partial reversion

This PR deletes operations.py (the module that extracted restart_worker and terminate_slices_for_workers as free functions) and inlines the logic back into Autoscaler. The final commit (1b567c0f3) reverts restart_worker to main's behavior verbatim — net effect on restart_worker is ~30 lines moved module-to-module with no behavior change. Would read more honestly split into its own PR or commit.

Recommendation

Fix the PR description (no audit table, no preserved mark_slice_* methods).
Acknowledge + justify the "FAILED disappears from _slices" semantic change, or restore main's behavior.
Replace context: dict[str, Any] with typed methods or typed event payloads; drop the dead failed_workers key.
Either drop the lock widening around _apply_failure_locked or add a test demonstrating the atomicity invariant matters.
Split the operations.py → runtime.py move into its own PR (or at least its own commit with no behavior changes mixed in).

The underlying idea — turning slice lifecycle into a table so transitions are reviewable at a glance — is reasonable. The execution adds more indirection than the 11 transitions repay, and the description oversells what actually ships.

ravwojdyla

Top level makes sense!

ravwojdyla · 2026-04-22T22:34:38Z

+    Note: These are slice-level aggregate states, not direct VM states.
+    """
+
+    REQUESTING = "requesting"


nit: you can use enum.auto

Introduce an explicit transition table for slice lifecycle state changes, replacing scattered mark_slice_ready/mark_slice_failed calls with a dispatch-based state machine. Each transition is validated against a (from_state, event) → to_state table, logged to a new slice_transitions audit table, and returns typed side effects for the caller to execute. Key changes: - New slice_lifecycle.py with SliceEvent enum, TRANSITIONS table, and dispatch_slice_event() entry point - New 0033_slice_transitions migration for the audit log table - ScalingGroup.dispatch() delegates to the state machine - runtime.py refresh() uses dispatch instead of direct state mutations - operations.py worker failure path uses dispatch for state tracking - SliceLifecycleState and SliceState moved to models.py to break the circular import between scaling_group.py and slice_lifecycle.py Existing mark_slice_ready/mark_slice_failed methods are preserved for backward compatibility with tests and the scale_down_if_idle path. Refs: #4754

- FIX: dispatch path now sets last_active and worker_ids on READY transition inside the lock, preventing immediate scaledown of freshly-ready slices - Remove SliceSideEffect wrapper dataclass — side_effects is now list[SliceSideEffectKind] directly, eliminating dead .context field - Remove InvalidTransitionError (unused), PLATFORM_CALL_STARTED and TEARDOWN_COMPLETE events (no transitions) - Remove unused `state` param from side-effect functions - Deduplicate _on_bootstrap_failed/_on_worker_failure into shared _on_failure_with_teardown - Rewrite mark_slice_ready/mark_slice_failed as thin dispatch() wrappers - Remove redundant comments from transition table - Fix dict[str, object] vs dict[str, Any] inconsistency

Move restart_worker and terminate_slices_for_workers logic directly into the Autoscaler class, eliminating the operations.py module. - restart_worker inlined as Autoscaler method (was a thin wrapper) - Worker-failure path uses dispatch() + _execute_side_effects() like refresh(), eliminating the duplicate side-effect if/elif chain - TERMINATE_SLICE now always does async detach+terminate via _async_terminate_slice/_spawn_terminate, unifying the refresh and worker-failure termination paths - find_slice_for_worker becomes a private Autoscaler method - SliceTerminationRequest/SliceTerminationResult data types deleted

…hort_lived - dispatch() returns NOOP (applied=False) instead of None, eliminating all `if result:` guards at call sites - cloud_state_to_event() maps CloudSliceState → SliceEvent in one place, collapsing the 3-branch if/elif/elif in refresh() to a single dispatch + _execute_side_effects call - is_short_lived computed inside _on_failure_with_teardown from state.handle.created_at — callers no longer pass it via context - Delete mark_slice_ready/mark_slice_failed: all callers (tests and production) now use dispatch() directly

…oup failure The slice machine and group machine now mutate together under one lock. This eliminates the SliceSideEffectKind enum and the cross-module duplication where slice_lifecycle.py emitted RECORD_GROUP_FAILURE and runtime.py interpreted it back into group.record_failure(). Changes: - ScalingGroup.dispatch() now does the work inline: mutates slice state, detaches the slice on FAILED, and updates _consecutive_failures / _backoff_until atomically when a short-lived failure should trigger backoff. The dispatch logic is no longer split across slice_lifecycle.py and runtime.py. - TransitionResult gains detached_handle, registered_workers, and triggered_backoff so callers know what to do without interpreting an enum: spawn an async terminate, register workers, log a backoff event. - _execute_side_effects -> _handle_transition: a single 4-line method on Autoscaler that handles caller-side concerns only (worker registry, async terminate, dashboard log). - BACKOFF_TRIGGERS frozenset replaces the per-event side-effect functions. - record_failure stays public for the _do_scale_up exception path (the slice never made it into _slices, so no transition fires); it now delegates to _apply_failure_locked for the actual mutation. - reset_backoff deleted (unused). - FAILED is now terminal-and-detached: slices don't linger in _slices with lifecycle=FAILED. This is a cleaner invariant — the FAILED state is a brief intermediate during the dispatch transaction. Tests rewritten to exercise the new interface through ScalingGroup.dispatch().

The DB audit table duplicated what the log server already captures. logger.info now emits all transition details (group, slice, prior/new state, event, triggered_backoff, error) as a single structured line that can be queried through the log server. - Remove _log_transition helper and its SQL insert - Remove 0033_slice_transitions migration - Expand the existing logger.info line with structured fields

Replace per-transition tests with one parametrized test that exercises every TRANSITIONS entry. Keep explicit tests for things that aren't readable off the table: - READY transition sets last_active and worker_ids (the bug fix) - Cross-machine backoff cascade (short-lived vs. long-lived, parametrized over BACKOFF_TRIGGERS) - IDLE_TIMEOUT explicitly does not trigger backoff - FAILED is terminal-and-detached - Exponential backoff progression - complete_scale_up clears failure state - Structural exhaustiveness (tracked states have outgoing transitions) Writing the parametrized test exposed that PLATFORM_CALL_* events and their REQUESTING transitions were dead — the scale-up exception path calls group.record_failure() directly; no slice ever has lifecycle=REQUESTING (_pending_scale_ups is a counter, not a slice state). Removed both events and their transitions. Test count: 16 → 25 (parametrize expands table sanity into 11 cases, plus 4 new parametrized backoff-trigger cases).

The last remaining production path that bypassed the state machine — scale_down_if_idle called scale_down directly instead of dispatching IDLE_TIMEOUT. Now all slice teardowns (cloud failure, worker failure, idle timeout) flow through ScalingGroup.dispatch, which gives: - One place that detaches the slice from _slices - One place that emits the transition log line - Single path for _handle_transition to pick up the detached handle and spawn async termination scale_down_if_idle now returns list[TransitionResult] instead of list[SliceHandle], and runtime.py's refresh loop calls _handle_transition to unregister workers + async terminate. Addresses a stale review comment on the original commit (IDLE_TIMEOUT declared in TRANSITIONS but never dispatched).

The old path (workers table → groups → _slices → describe → match worker handle) was fragile. If the autoscaler detached the slice (e.g., idle timeout, worker failure cascade, short-lived failure), _slices no longer has the entry and restart_worker rejects the RPC with "Slice X not found in group Y" — even though the cloud VM is still running and the worker handle is cached in the worker_registry. The worker_registry is populated by the REGISTER_WORKERS side effect on the READY transition and carries the RemoteWorkerHandle directly. Use it as the source of truth: look up the TrackedWorker by worker_id, then call handle.restart_worker on the cached handle. This removes a dependency on _slices that didn't need to exist. Fixes smoke test test_worker_restart_preserves_task failure.

…entry There's a pre-existing race window between worker registration and _do_scale_up's `complete_scale_up`: workers can register via RPC and start serving tasks before the synchronous `platform.create_slice()` call returns. During that window, the slice exists on the cloud and is functional, but the autoscaler's _slices dict hasn't been populated yet. The original restart_worker code went through _slices and failed in this window. My earlier fix routed through _worker_registry, which has the same problem (populated only when refresh observes the slice READY). Fix: try _slices first (fast path), then fall back to platform.list_slices filtered by the scale-group label to find the slice handle. Slower but correct in both cases. restart_worker is a rare admin RPC so the extra cloud call doesn't matter.

The smoke test failure (test_worker_restart_preserves_task) is a pre-existing race in the autoscaler architecture, not something my state machine changes introduced: - platform.create_slice() blocks until the gcloud TPU LRO completes (often 5-8 minutes for v5e) - Workers boot from the startup script and register via RPC long before that, often within 3 minutes - During that window, _do_scale_up is still blocked, complete_scale_up hasn't run, and _slices is empty - Any RPC that depends on _slices to find an active slice will fail This race exists on main too — the test is flaky there for the same reason; it just happened to land on the unlucky timing in our recent runs (TPU create taking 7+ minutes). My earlier "fix" (fall back to platform.list_slices) avoided the _slices dependency but hit the next race: tpu_describe returns no network endpoints during provisioning, so SSH targets an empty hostname. Both workarounds were treating symptoms. The actual fix would be to either decouple slice tracking from the synchronous create_slice call (insert into _slices immediately) or to make restart_worker wait for the slice to be fully provisioned. That's out of scope for this PR. Reverting to main's behavior so this PR isn't gated on a pre-existing bug.

…h site Addresses the heavier review feedback: - Replace untyped `context: dict[str, Any]` payload with discriminated-union event dataclasses (CloudReady, CloudFailed, UnknownTimeout, WorkerFailure, IdleTimeout, CloudInitializing). Pyrefly now catches missing/misspelled fields at call sites that previously passed whatever dict shape they liked. - Drop BACKOFF_TRIGGERS frozenset in favor of a `counts_toward_backoff` class-level attribute on the event type; dispatch never branches on kind. - TransitionResult (with optional detached_handle/registered_workers/ triggered_backoff/applied fields) is replaced by a sum type NoOp | InternalTransition | BecameReady | BecameFailed. Callers consume via a single `match` statement instead of three separate `if new_state == …` switches scattered across runtime.py. - Collapse runtime.py: _register/_unregister wrappers gone, refresh() loses its slice_ready/slice_failed duplication, terminate_slices_for_workers() no longer needs a fallback "slice untracked" reconciliation branch (NoOp handles the race cleanly). - Document the semantic change on dispatch(): FAILED transitions remove the slice from _slices atomically. slice_state_counts()/to_status() will never observe FAILED slices — failure is surfaced via the outcome and the action log, not lingering entries in the tracked map. - Add test_concurrent_failures_account_atomically covering the lock widening: N concurrent CloudFailed dispatches for N slices in the same group produce consecutive_failures == N with no lost updates.

rjpower added the agent-generated Created by automation/agent label Apr 16, 2026

claude Bot reviewed Apr 16, 2026

View reviewed changes

Comment thread lib/iris/src/iris/cluster/controller/autoscaler/slice_lifecycle.py Outdated

claude Bot reviewed Apr 16, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 16, 2026

View reviewed changes

rjpower requested a review from ravwojdyla April 16, 2026 19:14

rjpower force-pushed the claude/stoic-heyrovsky branch from 3245230 to a93055d Compare April 20, 2026 00:19

claude Bot mentioned this pull request Apr 22, 2026

Iris: Manual slice startup #5069

Closed

rjpower force-pushed the claude/stoic-heyrovsky branch 2 times, most recently from 1b567c0 to a25e63e Compare April 22, 2026 22:41

ravwojdyla approved these changes Apr 22, 2026

View reviewed changes

rjpower added 13 commits April 22, 2026 17:50

Use enum.auto() for SliceLifecycleState

57c65be

rjpower force-pushed the claude/stoic-heyrovsky branch from 3432e50 to d37ce2f Compare April 23, 2026 02:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[iris] Add slice lifecycle state machine for autoscaler transitions#4816

[iris] Add slice lifecycle state machine for autoscaler transitions#4816
rjpower wants to merge 13 commits intomainfrom
claude/stoic-heyrovsky

rjpower commented Apr 16, 2026 •

edited

Loading

Uh oh!

claude Bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

claude Bot Apr 16, 2026

Uh oh!

Uh oh!

claude Bot Apr 16, 2026

Uh oh!

claude Bot Apr 16, 2026

Uh oh!

claude Bot Apr 16, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 16, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 16, 2026

Uh oh!

ravwojdyla-agent commented Apr 22, 2026

Uh oh!

ravwojdyla left a comment

Uh oh!

ravwojdyla Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		state.lifecycle = transition.to_state

		side_effects = transition.side_effects(state, context) if transition.side_effects else []

		def _on_timeout(state: SliceState, ctx: dict[str, Any]) -> list[SliceSideEffect]:
		return _on_bootstrap_failed(state, ctx)

		if effect.kind == SliceSideEffectKind.REGISTER_WORKERS:
		self._register_slice_workers(workers, slice_id, group.name)

Conversation

rjpower commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Behavior change

Changes

Test plan

Uh oh!

claude Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review of PR #4816: Slice lifecycle state machine

Overall

Issues

Smaller notes

Summary

Uh oh!

claude Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

claude Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

ravwojdyla-agent commented Apr 22, 2026

1. The PR description is stale in load-bearing ways

2. Hidden behavior change: FAILED slices now disappear from tracking

3. The abstraction is heavier than the 11 transitions warrant

4. context: dict[str, Any] defeats typing at the API boundary

5. "Atomic cross-machine cascade" lock widening — real race?

6. Minor: operations.py dissolution is a partial reversion

Recommendation

Uh oh!

ravwojdyla left a comment

Choose a reason for hiding this comment

Uh oh!

ravwojdyla Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rjpower commented Apr 16, 2026 •

edited

Loading

claude Bot commented Apr 16, 2026 •

edited

Loading

4. `context: dict[str, Any]` defeats typing at the API boundary

6. Minor: `operations.py` dissolution is a partial reversion