[iris] Add slice lifecycle state machine for autoscaler transitions#4816
[iris] Add slice lifecycle state machine for autoscaler transitions#4816
Conversation
|
Claude finished @rjpower's task in 3m 58s —— View job Review of PR #4816: Slice lifecycle state machine
OverallGood design — the Issues1. Behavioral regression: bootstrap failures in The old If this is intentional (e.g., only applying backoff for short-lived worker failures), document it. If not, 2. Inconsistent side-effect execution in
3. Unused events and transitions (P3)
4. Side effects computed outside the lock (P3) In Smaller notes
SummaryThe core state machine design is solid and the audit trail is a clear win for debuggability. The main concern is the behavioral change around backoff (#1 above) — please clarify whether it's intentional or needs a fix before merging. |
| def _on_bootstrap_failed(state: SliceState, ctx: dict[str, Any]) -> list[SliceSideEffect]: | ||
| effects: list[SliceSideEffect] = [ | ||
| SliceSideEffect(SliceSideEffectKind.DEREGISTER_WORKERS), | ||
| SliceSideEffect(SliceSideEffectKind.TERMINATE_SLICE), | ||
| ] | ||
| if ctx.get("is_short_lived"): | ||
| effects.append(SliceSideEffect(SliceSideEffectKind.RECORD_GROUP_FAILURE)) | ||
| return effects |
There was a problem hiding this comment.
🤖 Behavioral change in refresh() — bootstrap failures no longer trigger backoff
The old refresh() code called group.record_failure() unconditionally on CLOUD_STATE_FAILED and CLOUD_STATE_UNKNOWN_TIMEOUT. This function gates RECORD_GROUP_FAILURE on ctx.get("is_short_lived"), but refresh() never passes is_short_lived in its context dict:
# runtime.py refresh() dispatches:
result = group.dispatch(
slice_id,
SliceEvent.CLOUD_STATE_FAILED,
{"error_message": status.error_message}, # no is_short_lived!
now=timestamp,
)So bootstrap failures observed during refresh() no longer trigger backoff. If this is intentional (only applying backoff for short-lived worker failures from operations.py), it should be documented. If not, refresh() needs to compute and pass is_short_lived just like operations.py does.
| if result is not None: | ||
| for effect in result.side_effects: | ||
| if effect.kind == SliceSideEffectKind.RECORD_GROUP_FAILURE: | ||
| group.record_failure(timestamp) | ||
| log_action( | ||
| "backoff_triggered", | ||
| group.name, | ||
| slice_id=slice_id, | ||
| reason=f"short-lived slice (age={age_ms}ms)", | ||
| ) |
There was a problem hiding this comment.
🤖 Partial side-effect execution undermines the state machine contract
This loop only handles RECORD_GROUP_FAILURE, ignoring DEREGISTER_WORKERS and TERMINATE_SLICE that the transition table returns for WORKER_FAILURE_REPORTED. The actual deregistration and termination happen a few lines below via detach_slice + unregister_slice_workers.
This works correctly, but it means the caller is cherry-picking which side effects to honor rather than executing them all. The state machine says "deregister workers and terminate the slice," but the caller silently ignores that and does its own thing. If someone later adds a new side effect to _on_worker_failure, they'd expect it to run here — but it wouldn't.
Consider either:
- Using a shared
_execute_side_effectshelper (likeruntime.pydoes), or - Having the transition for
WORKER_FAILURE_REPORTEDonly return the side effects that actually get executed here (justRECORD_GROUP_FAILURE), and documenting thatoperations.pyhandles teardown itself
Also, age_ms on line 145 is only defined when slice_handle is not None (line 125). It's safe today because is_short_lived=False when handle is None, so RECORD_GROUP_FAILURE won't fire, but it's fragile.
| state.lifecycle = transition.to_state | ||
|
|
||
| side_effects = transition.side_effects(state, context) if transition.side_effects else [] | ||
|
|
There was a problem hiding this comment.
🤖 Side effects computed outside the lock
The state mutation (state.lifecycle = transition.to_state) happens under the lock, but transition.side_effects(state, context) runs after the lock is released. The current side-effect functions only read from context, not from state, so this is safe. But state is a mutable reference that other threads could modify. Worth a brief comment noting this invariant (side-effect functions must not read mutable fields from state), or compute side effects inside the lock to be safe.
| def _on_timeout(state: SliceState, ctx: dict[str, Any]) -> list[SliceSideEffect]: | ||
| return _on_bootstrap_failed(state, ctx) |
There was a problem hiding this comment.
🤖 Nit: _on_timeout is a one-line delegation to _on_bootstrap_failed with identical behavior. Consider using _on_bootstrap_failed directly in the transition table entry for CLOUD_STATE_UNKNOWN_TIMEOUT and dropping this wrapper. If you want to keep them separate for future divergence, a comment noting the intent would help.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 73ae17bf52
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| result = group.dispatch( | ||
| slice_id, | ||
| reason=reason, | ||
| status="failed", | ||
| SliceEvent.CLOUD_STATE_FAILED, | ||
| {"error_message": status.error_message}, | ||
| now=timestamp, |
There was a problem hiding this comment.
Preserve failure backoff on cloud bootstrap failures
refresh() now dispatches CLOUD_STATE_FAILED without any short-lived metadata, but the new transition handlers only emit RECORD_GROUP_FAILURE when is_short_lived is true. As a result, normal bootstrap failures (and similarly UNKNOWN-timeout failures) stop calling group.record_failure(), which removes exponential backoff and can cause repeated scale-up retries against a persistently failing provider.
Useful? React with 👍 / 👎.
| if effect.kind == SliceSideEffectKind.REGISTER_WORKERS: | ||
| self._register_slice_workers(workers, slice_id, group.name) |
There was a problem hiding this comment.
Update slice state when registering READY workers
The REGISTER_WORKERS side effect only updates the worker registry and never writes worker IDs (or activity timestamp) into the ScalingGroup slice state. ScalingGroup.find_slice_for_worker() and idle-scale-down checks read from state.worker_ids, so slices that become READY via this path are no longer discoverable for worker-failure teardown and can fail idle verification.
Useful? React with 👍 / 👎.
3245230 to
a93055d
Compare
1b567c0 to
a25e63e
Compare
|
🤖 ## Design review I walked through the diff, checked out the branch ( 1. The PR description is stale in load-bearing waysTwo claims in the description are now wrong:
Please update the description before merge, otherwise reviewers who read only the summary get a materially wrong picture of what's shipping. 2. Hidden behavior change: FAILED slices now disappear from trackingNot described as a semantic change but it is one:
# asserts counts[FAILED] == 1 on main → asserts counts[FAILED] == 0 in PR
assert counts[SliceLifecycleState.FAILED] == 0
assert group.slice_count() == 0Callers that relied on the FAILED state being visible (status UI, dashboards, snapshots) will silently see fewer slices. Given the stated motivation is debuggability, dropping FAILED visibility from 3. The abstraction is heavier than the 11 transitions warrant
The original imperative code ( 4.
|
ravwojdyla
left a comment
There was a problem hiding this comment.
Top level makes sense!
| Note: These are slice-level aggregate states, not direct VM states. | ||
| """ | ||
|
|
||
| REQUESTING = "requesting" |
There was a problem hiding this comment.
nit: you can use enum.auto
Introduce an explicit transition table for slice lifecycle state changes, replacing scattered mark_slice_ready/mark_slice_failed calls with a dispatch-based state machine. Each transition is validated against a (from_state, event) → to_state table, logged to a new slice_transitions audit table, and returns typed side effects for the caller to execute. Key changes: - New slice_lifecycle.py with SliceEvent enum, TRANSITIONS table, and dispatch_slice_event() entry point - New 0033_slice_transitions migration for the audit log table - ScalingGroup.dispatch() delegates to the state machine - runtime.py refresh() uses dispatch instead of direct state mutations - operations.py worker failure path uses dispatch for state tracking - SliceLifecycleState and SliceState moved to models.py to break the circular import between scaling_group.py and slice_lifecycle.py Existing mark_slice_ready/mark_slice_failed methods are preserved for backward compatibility with tests and the scale_down_if_idle path. Refs: #4754
- FIX: dispatch path now sets last_active and worker_ids on READY transition inside the lock, preventing immediate scaledown of freshly-ready slices - Remove SliceSideEffect wrapper dataclass — side_effects is now list[SliceSideEffectKind] directly, eliminating dead .context field - Remove InvalidTransitionError (unused), PLATFORM_CALL_STARTED and TEARDOWN_COMPLETE events (no transitions) - Remove unused `state` param from side-effect functions - Deduplicate _on_bootstrap_failed/_on_worker_failure into shared _on_failure_with_teardown - Rewrite mark_slice_ready/mark_slice_failed as thin dispatch() wrappers - Remove redundant comments from transition table - Fix dict[str, object] vs dict[str, Any] inconsistency
Move restart_worker and terminate_slices_for_workers logic directly into the Autoscaler class, eliminating the operations.py module. - restart_worker inlined as Autoscaler method (was a thin wrapper) - Worker-failure path uses dispatch() + _execute_side_effects() like refresh(), eliminating the duplicate side-effect if/elif chain - TERMINATE_SLICE now always does async detach+terminate via _async_terminate_slice/_spawn_terminate, unifying the refresh and worker-failure termination paths - find_slice_for_worker becomes a private Autoscaler method - SliceTerminationRequest/SliceTerminationResult data types deleted
…hort_lived - dispatch() returns NOOP (applied=False) instead of None, eliminating all `if result:` guards at call sites - cloud_state_to_event() maps CloudSliceState → SliceEvent in one place, collapsing the 3-branch if/elif/elif in refresh() to a single dispatch + _execute_side_effects call - is_short_lived computed inside _on_failure_with_teardown from state.handle.created_at — callers no longer pass it via context - Delete mark_slice_ready/mark_slice_failed: all callers (tests and production) now use dispatch() directly
…oup failure The slice machine and group machine now mutate together under one lock. This eliminates the SliceSideEffectKind enum and the cross-module duplication where slice_lifecycle.py emitted RECORD_GROUP_FAILURE and runtime.py interpreted it back into group.record_failure(). Changes: - ScalingGroup.dispatch() now does the work inline: mutates slice state, detaches the slice on FAILED, and updates _consecutive_failures / _backoff_until atomically when a short-lived failure should trigger backoff. The dispatch logic is no longer split across slice_lifecycle.py and runtime.py. - TransitionResult gains detached_handle, registered_workers, and triggered_backoff so callers know what to do without interpreting an enum: spawn an async terminate, register workers, log a backoff event. - _execute_side_effects -> _handle_transition: a single 4-line method on Autoscaler that handles caller-side concerns only (worker registry, async terminate, dashboard log). - BACKOFF_TRIGGERS frozenset replaces the per-event side-effect functions. - record_failure stays public for the _do_scale_up exception path (the slice never made it into _slices, so no transition fires); it now delegates to _apply_failure_locked for the actual mutation. - reset_backoff deleted (unused). - FAILED is now terminal-and-detached: slices don't linger in _slices with lifecycle=FAILED. This is a cleaner invariant — the FAILED state is a brief intermediate during the dispatch transaction. Tests rewritten to exercise the new interface through ScalingGroup.dispatch().
The DB audit table duplicated what the log server already captures. logger.info now emits all transition details (group, slice, prior/new state, event, triggered_backoff, error) as a single structured line that can be queried through the log server. - Remove _log_transition helper and its SQL insert - Remove 0033_slice_transitions migration - Expand the existing logger.info line with structured fields
Replace per-transition tests with one parametrized test that exercises every TRANSITIONS entry. Keep explicit tests for things that aren't readable off the table: - READY transition sets last_active and worker_ids (the bug fix) - Cross-machine backoff cascade (short-lived vs. long-lived, parametrized over BACKOFF_TRIGGERS) - IDLE_TIMEOUT explicitly does not trigger backoff - FAILED is terminal-and-detached - Exponential backoff progression - complete_scale_up clears failure state - Structural exhaustiveness (tracked states have outgoing transitions) Writing the parametrized test exposed that PLATFORM_CALL_* events and their REQUESTING transitions were dead — the scale-up exception path calls group.record_failure() directly; no slice ever has lifecycle=REQUESTING (_pending_scale_ups is a counter, not a slice state). Removed both events and their transitions. Test count: 16 → 25 (parametrize expands table sanity into 11 cases, plus 4 new parametrized backoff-trigger cases).
The last remaining production path that bypassed the state machine — scale_down_if_idle called scale_down directly instead of dispatching IDLE_TIMEOUT. Now all slice teardowns (cloud failure, worker failure, idle timeout) flow through ScalingGroup.dispatch, which gives: - One place that detaches the slice from _slices - One place that emits the transition log line - Single path for _handle_transition to pick up the detached handle and spawn async termination scale_down_if_idle now returns list[TransitionResult] instead of list[SliceHandle], and runtime.py's refresh loop calls _handle_transition to unregister workers + async terminate. Addresses a stale review comment on the original commit (IDLE_TIMEOUT declared in TRANSITIONS but never dispatched).
The old path (workers table → groups → _slices → describe → match worker handle) was fragile. If the autoscaler detached the slice (e.g., idle timeout, worker failure cascade, short-lived failure), _slices no longer has the entry and restart_worker rejects the RPC with "Slice X not found in group Y" — even though the cloud VM is still running and the worker handle is cached in the worker_registry. The worker_registry is populated by the REGISTER_WORKERS side effect on the READY transition and carries the RemoteWorkerHandle directly. Use it as the source of truth: look up the TrackedWorker by worker_id, then call handle.restart_worker on the cached handle. This removes a dependency on _slices that didn't need to exist. Fixes smoke test test_worker_restart_preserves_task failure.
…entry There's a pre-existing race window between worker registration and _do_scale_up's `complete_scale_up`: workers can register via RPC and start serving tasks before the synchronous `platform.create_slice()` call returns. During that window, the slice exists on the cloud and is functional, but the autoscaler's _slices dict hasn't been populated yet. The original restart_worker code went through _slices and failed in this window. My earlier fix routed through _worker_registry, which has the same problem (populated only when refresh observes the slice READY). Fix: try _slices first (fast path), then fall back to platform.list_slices filtered by the scale-group label to find the slice handle. Slower but correct in both cases. restart_worker is a rare admin RPC so the extra cloud call doesn't matter.
The smoke test failure (test_worker_restart_preserves_task) is a
pre-existing race in the autoscaler architecture, not something my
state machine changes introduced:
- platform.create_slice() blocks until the gcloud TPU LRO completes
(often 5-8 minutes for v5e)
- Workers boot from the startup script and register via RPC long
before that, often within 3 minutes
- During that window, _do_scale_up is still blocked, complete_scale_up
hasn't run, and _slices is empty
- Any RPC that depends on _slices to find an active slice will fail
This race exists on main too — the test is flaky there for the same
reason; it just happened to land on the unlucky timing in our recent
runs (TPU create taking 7+ minutes).
My earlier "fix" (fall back to platform.list_slices) avoided the
_slices dependency but hit the next race: tpu_describe returns no
network endpoints during provisioning, so SSH targets an empty
hostname.
Both workarounds were treating symptoms. The actual fix would be to
either decouple slice tracking from the synchronous create_slice call
(insert into _slices immediately) or to make restart_worker wait for
the slice to be fully provisioned. That's out of scope for this PR.
Reverting to main's behavior so this PR isn't gated on a pre-existing
bug.
…h site Addresses the heavier review feedback: - Replace untyped `context: dict[str, Any]` payload with discriminated-union event dataclasses (CloudReady, CloudFailed, UnknownTimeout, WorkerFailure, IdleTimeout, CloudInitializing). Pyrefly now catches missing/misspelled fields at call sites that previously passed whatever dict shape they liked. - Drop BACKOFF_TRIGGERS frozenset in favor of a `counts_toward_backoff` class-level attribute on the event type; dispatch never branches on kind. - TransitionResult (with optional detached_handle/registered_workers/ triggered_backoff/applied fields) is replaced by a sum type NoOp | InternalTransition | BecameReady | BecameFailed. Callers consume via a single `match` statement instead of three separate `if new_state == …` switches scattered across runtime.py. - Collapse runtime.py: _register/_unregister wrappers gone, refresh() loses its slice_ready/slice_failed duplication, terminate_slices_for_workers() no longer needs a fallback "slice untracked" reconciliation branch (NoOp handles the race cleanly). - Document the semantic change on dispatch(): FAILED transitions remove the slice from _slices atomically. slice_state_counts()/to_status() will never observe FAILED slices — failure is surfaced via the outcome and the action log, not lingering entries in the tracked map. - Add test_concurrent_failures_account_atomically covering the lock widening: N concurrent CloudFailed dispatches for N slices in the same group produce consecutive_failures == N with no lost updates.
3432e50 to
d37ce2f
Compare
Summary
(from_state, type(event)) → to_statetransition table, typed event dataclasses carrying exactly the payload each transition needs, and a sum-type outcome (NoOp | InternalTransition | BecameReady | BecameFailed) that callers consume via a singlematchstatement.ScalingGroup.dispatch()folds the short-lived-failure → group-backoff cascade into the same lock as the slice mutation, soconsecutive_failures/backoff_untilstay consistent with_slicesunder concurrent failures (seetest_concurrent_failures_account_atomically).SliceLifecycleStateandSliceStatemove fromscaling_group.pytomodels.pyso the transition table can import them without creating a cycle.Behavior change
FAILED transitions now atomically remove the slice from
_slicesand return the SliceHandle to the caller (viaBecameFailed.handle). Callers unregister workers and async-terminate the handle. Consequences:slice_state_counts()andto_status()will never observe a FAILED slice for this group. Failure is surfaced via the returned outcome, aslice_failed/worker_failedentry in the action log, and theslice_transition … →failedstructured log line.Changes
slice_lifecycle.pySliceEventhierarchy,TRANSITIONStable,cloud_event()helper,TransitionOutcomesum type. NoStrEnum, nocontext: dict[str, Any], no separateBACKOFF_TRIGGERSset (each event type carriescounts_toward_backoff).models.pySliceLifecycleState(StrEnum withenum.auto()) andSliceState.scaling_group.pydispatch(slice_id, event)method. Short-lived-failure backoff cascade is atomic with the slice detach under_slices_lock.runtime.py_apply(outcome, group)methodmatches on outcome type and owns all external side effects (worker registry, async terminate, action-log entries).refresh()/terminate_slices_for_workers()lose their scatteredif new_state == …switches.operations.pyfolded back intoAutoscaler.test_slice_lifecycle.pytest_concurrent_failures_account_atomicallycovering the lock widening.Test plan
tests/cluster/controller/test_slice_lifecycle.py— 20+ cases including concurrent-failure atomicitytests/cluster/controller/ tests/cluster/providers/— 1359 tests pass./infra/pre-commit.py --fix— ruff, black, pyrefly, license headers cleanRefs: #4754