Skip to content

[iris] Add slice lifecycle state machine for autoscaler transitions#4816

Open
rjpower wants to merge 13 commits intomainfrom
claude/stoic-heyrovsky
Open

[iris] Add slice lifecycle state machine for autoscaler transitions#4816
rjpower wants to merge 13 commits intomainfrom
claude/stoic-heyrovsky

Conversation

@rjpower
Copy link
Copy Markdown
Collaborator

@rjpower rjpower commented Apr 16, 2026

Summary

  • Slice lifecycle in the autoscaler becomes an explicit state machine: a (from_state, type(event)) → to_state transition table, typed event dataclasses carrying exactly the payload each transition needs, and a sum-type outcome (NoOp | InternalTransition | BecameReady | BecameFailed) that callers consume via a single match statement.
  • ScalingGroup.dispatch() folds the short-lived-failure → group-backoff cascade into the same lock as the slice mutation, so consecutive_failures / backoff_until stay consistent with _slices under concurrent failures (see test_concurrent_failures_account_atomically).
  • SliceLifecycleState and SliceState move from scaling_group.py to models.py so the transition table can import them without creating a cycle.

Behavior change

FAILED transitions now atomically remove the slice from _slices and return the SliceHandle to the caller (via BecameFailed.handle). Callers unregister workers and async-terminate the handle. Consequences:

  • slice_state_counts() and to_status() will never observe a FAILED slice for this group. Failure is surfaced via the returned outcome, a slice_failed / worker_failed entry in the action log, and the slice_transition … →failed structured log line.
  • The old code left FAILED slices tracked until a separate cleanup pass; in practice both paths removed them within the same refresh cycle, so downstream consumers (dashboards, RPC status) observe no regression. Callers that depend on visible FAILED state must reach for the action log.

Changes

File What
slice_lifecycle.py Typed SliceEvent hierarchy, TRANSITIONS table, cloud_event() helper, TransitionOutcome sum type. No StrEnum, no context: dict[str, Any], no separate BACKOFF_TRIGGERS set (each event type carries counts_toward_backoff).
models.py Owns SliceLifecycleState (StrEnum with enum.auto()) and SliceState.
scaling_group.py New dispatch(slice_id, event) method. Short-lived-failure backoff cascade is atomic with the slice detach under _slices_lock.
runtime.py Single _apply(outcome, group) method matches on outcome type and owns all external side effects (worker registry, async terminate, action-log entries). refresh() / terminate_slices_for_workers() lose their scattered if new_state == … switches. operations.py folded back into Autoscaler.
test_slice_lifecycle.py Parametrized table sanity, READY/FAILED invariants, exponential backoff, exhaustiveness, and a new test_concurrent_failures_account_atomically covering the lock widening.

Test plan

  • tests/cluster/controller/test_slice_lifecycle.py — 20+ cases including concurrent-failure atomicity
  • tests/cluster/controller/ tests/cluster/providers/ — 1359 tests pass
  • ./infra/pre-commit.py --fix — ruff, black, pyrefly, license headers clean

Refs: #4754

@rjpower rjpower added the agent-generated Created by automation/agent label Apr 16, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 16, 2026

Claude finished @rjpower's task in 3m 58s —— View job


Review of PR #4816: Slice lifecycle state machine

  • Read all changed files and gather context
  • Review the state machine design (slice_lifecycle.py)
  • Review migration (0033_slice_transitions.py)
  • Review integration points (runtime.py, operations.py, scaling_group.py)
  • Review model changes and import updates
  • Review tests (test_slice_lifecycle.py)
  • Post review feedback

Overall

Good design — the (from_state, event) → (to_state, side_effects) table is clean and testable, the audit trail will be valuable for debugging, and the model extraction to models.py is a sensible separation. The exhaustiveness test is a nice touch.

Issues

1. Behavioral regression: bootstrap failures in refresh() no longer trigger backoff (P1)

The old refresh() code called group.record_failure() unconditionally when CLOUD_STATE_FAILED or CLOUD_STATE_UNKNOWN_TIMEOUT was observed. The new code dispatches through the state machine, where _on_bootstrap_failed / _on_timeout only emit RECORD_GROUP_FAILURE if ctx.get("is_short_lived") — but refresh() never passes is_short_lived in context. Net effect: bootstrap failures observed during polling no longer trigger exponential backoff.

If this is intentional (e.g., only applying backoff for short-lived worker failures), document it. If not, refresh() needs to compute and pass is_short_lived the same way operations.py does. See inline comment.

2. Inconsistent side-effect execution in operations.py (P2)

terminate_slices_for_workers() dispatches WORKER_FAILURE_REPORTED but only handles the RECORD_GROUP_FAILURE side effect, ignoring DEREGISTER_WORKERS and TERMINATE_SLICE (which it does manually). This works today but undermines the state machine contract — new side effects added to _on_worker_failure won't execute here. See inline comment.

3. Unused events and transitions (P3)

PLATFORM_CALL_STARTED, TEARDOWN_COMPLETE, CLOUD_STATE_INITIALIZING, and IDLE_TIMEOUT are defined but never dispatched. Per repo conventions on dead code, consider removing until wired up or adding a comment referencing the plan. See inline comment.

4. Side effects computed outside the lock (P3)

In dispatch_slice_event(), line 191 calls transition.side_effects(state, context) after releasing _slices_lock. Current side-effect functions only read ctx so it's safe, but state is a mutable reference. Worth a comment or moving inside the lock. See inline comment.

Smaller notes

  • _on_timeout is a trivial delegation to _on_bootstrap_failed; could use the function directly in the transition table.
  • error_message is no longer set on SliceState through the dispatch path (old mark_slice_failed did this). Since failed slices get deleted by scale_down/detach_slice, this is likely fine — the error context lives in the audit trail instead.
  • Migration 0033 is clean. Consider adding a LIMIT-based retention policy later if the audit table grows unbounded.
  • Tests are thorough. The exhaustiveness check on line 239-249 is a good pattern.

Summary

The core state machine design is solid and the audit trail is a clear win for debuggability. The main concern is the behavioral change around backoff (#1 above) — please clarify whether it's intentional or needs a fix before merging.

Comment on lines +89 to +96
def _on_bootstrap_failed(state: SliceState, ctx: dict[str, Any]) -> list[SliceSideEffect]:
effects: list[SliceSideEffect] = [
SliceSideEffect(SliceSideEffectKind.DEREGISTER_WORKERS),
SliceSideEffect(SliceSideEffectKind.TERMINATE_SLICE),
]
if ctx.get("is_short_lived"):
effects.append(SliceSideEffect(SliceSideEffectKind.RECORD_GROUP_FAILURE))
return effects
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Behavioral change in refresh() — bootstrap failures no longer trigger backoff

The old refresh() code called group.record_failure() unconditionally on CLOUD_STATE_FAILED and CLOUD_STATE_UNKNOWN_TIMEOUT. This function gates RECORD_GROUP_FAILURE on ctx.get("is_short_lived"), but refresh() never passes is_short_lived in its context dict:

# runtime.py refresh() dispatches:
result = group.dispatch(
    slice_id,
    SliceEvent.CLOUD_STATE_FAILED,
    {"error_message": status.error_message},  # no is_short_lived!
    now=timestamp,
)

So bootstrap failures observed during refresh() no longer trigger backoff. If this is intentional (only applying backoff for short-lived worker failures from operations.py), it should be documented. If not, refresh() needs to compute and pass is_short_lived just like operations.py does.

Comment thread lib/iris/src/iris/cluster/controller/autoscaler/slice_lifecycle.py Outdated
Comment on lines +136 to +145
if result is not None:
for effect in result.side_effects:
if effect.kind == SliceSideEffectKind.RECORD_GROUP_FAILURE:
group.record_failure(timestamp)
log_action(
"backoff_triggered",
group.name,
slice_id=slice_id,
reason=f"short-lived slice (age={age_ms}ms)",
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Partial side-effect execution undermines the state machine contract

This loop only handles RECORD_GROUP_FAILURE, ignoring DEREGISTER_WORKERS and TERMINATE_SLICE that the transition table returns for WORKER_FAILURE_REPORTED. The actual deregistration and termination happen a few lines below via detach_slice + unregister_slice_workers.

This works correctly, but it means the caller is cherry-picking which side effects to honor rather than executing them all. The state machine says "deregister workers and terminate the slice," but the caller silently ignores that and does its own thing. If someone later adds a new side effect to _on_worker_failure, they'd expect it to run here — but it wouldn't.

Consider either:

  1. Using a shared _execute_side_effects helper (like runtime.py does), or
  2. Having the transition for WORKER_FAILURE_REPORTED only return the side effects that actually get executed here (just RECORD_GROUP_FAILURE), and documenting that operations.py handles teardown itself

Also, age_ms on line 145 is only defined when slice_handle is not None (line 125). It's safe today because is_short_lived=False when handle is None, so RECORD_GROUP_FAILURE won't fire, but it's fragile.

Comment on lines +188 to +191
state.lifecycle = transition.to_state

side_effects = transition.side_effects(state, context) if transition.side_effects else []

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Side effects computed outside the lock

The state mutation (state.lifecycle = transition.to_state) happens under the lock, but transition.side_effects(state, context) runs after the lock is released. The current side-effect functions only read from context, not from state, so this is safe. But state is a mutable reference that other threads could modify. Worth a brief comment noting this invariant (side-effect functions must not read mutable fields from state), or compute side effects inside the lock to be safe.

Comment on lines +99 to +100
def _on_timeout(state: SliceState, ctx: dict[str, Any]) -> list[SliceSideEffect]:
return _on_bootstrap_failed(state, ctx)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Nit: _on_timeout is a one-line delegation to _on_bootstrap_failed with identical behavior. Consider using _on_bootstrap_failed directly in the transition table entry for CLOUD_STATE_UNKNOWN_TIMEOUT and dropping this wrapper. If you want to keep them separate for future divergence, a comment noting the intent would help.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 73ae17bf52

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +462 to +466
result = group.dispatch(
slice_id,
reason=reason,
status="failed",
SliceEvent.CLOUD_STATE_FAILED,
{"error_message": status.error_message},
now=timestamp,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve failure backoff on cloud bootstrap failures

refresh() now dispatches CLOUD_STATE_FAILED without any short-lived metadata, but the new transition handlers only emit RECORD_GROUP_FAILURE when is_short_lived is true. As a result, normal bootstrap failures (and similarly UNKNOWN-timeout failures) stop calling group.record_failure(), which removes exponential backoff and can cause repeated scale-up retries against a persistently failing provider.

Useful? React with 👍 / 👎.

Comment on lines +240 to +241
if effect.kind == SliceSideEffectKind.REGISTER_WORKERS:
self._register_slice_workers(workers, slice_id, group.name)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Update slice state when registering READY workers

The REGISTER_WORKERS side effect only updates the worker registry and never writes worker IDs (or activity timestamp) into the ScalingGroup slice state. ScalingGroup.find_slice_for_worker() and idle-scale-down checks read from state.worker_ids, so slices that become READY via this path are no longer discoverable for worker-failure teardown and can fail idle verification.

Useful? React with 👍 / 👎.

@rjpower rjpower requested a review from ravwojdyla April 16, 2026 19:14
@rjpower rjpower force-pushed the claude/stoic-heyrovsky branch from 3245230 to a93055d Compare April 20, 2026 00:19
@claude claude Bot mentioned this pull request Apr 22, 2026
@rjpower rjpower force-pushed the claude/stoic-heyrovsky branch 2 times, most recently from 1b567c0 to a25e63e Compare April 22, 2026 22:41
@ravwojdyla-agent
Copy link
Copy Markdown
Contributor

🤖 ## Design review

I walked through the diff, checked out the branch (pr-4816), compared to main, and traced call sites through runtime.pyscaling_group.py. Major concerns, prioritized.


1. The PR description is stale in load-bearing ways

Two claims in the description are now wrong:

  • "Adds a slice_transitions audit table (migration 0033) that logs every state change with timestamp, event, and context — enables answering 'why didn't group X scale up from 2:30–3:45pm?'"
    This is the headline motivation. It is not in the PR. Commit 2552b17dc ("Drop slice_transitions audit table: rely on structured log lines") removed it. No 0033_slice_transitions.py exists in the diff, and migration 0033 is already taken by 0033_worker_task_history_fk_cascade.py. What's left is a single logger.info("slice transition …") line — much less than an audit table; you can't SQL-query log lines retroactively.

  • "Existing mark_slice_ready/mark_slice_failed methods preserved for the scale_down_if_idle path and backward compat with tests"
    Not true — both methods are deleted. Every caller was rewritten to dispatch(…, SliceEvent.CLOUD_STATE_READY, {"worker_ids": …}).

Please update the description before merge, otherwise reviewers who read only the summary get a materially wrong picture of what's shipping.


2. Hidden behavior change: FAILED slices now disappear from tracking

Not described as a semantic change but it is one:

  • main: mark_slice_failed() sets state.lifecycle = FAILED, keeps the slice in _slices with error_message, and a separate scale_down() later removes it. slice_state_counts()[FAILED] is observable. to_status() renders FAILED slices with their error in the RPC proto.
  • PR: dispatch(…, _FAILED) atomically removes the slice from _slices and hands the handle to the caller. slice_state_counts()[FAILED] is now always 0. The test test_failed_takes_precedence was renamed to test_failed_slice_detached_atomically and its invariant inverted:
# asserts counts[FAILED] == 1 on main → asserts counts[FAILED] == 0 in PR
assert counts[SliceLifecycleState.FAILED] == 0
assert group.slice_count() == 0

Callers that relied on the FAILED state being visible (status UI, dashboards, snapshots) will silently see fewer slices. Given the stated motivation is debuggability, dropping FAILED visibility from to_status() and slice_state_counts() cuts against that motivation. Either (a) call out the change explicitly and confirm no downstream consumer cares, or (b) keep FAILED in tracking with a separate cleanup pass, matching main.


3. The abstraction is heavier than the 11 transitions warrant

TRANSITIONS is 11 entries, and every to_state is one of INITIALIZING / READY / FAILED. To consume this you stitch together:

  1. cloud_state_to_event() in slice_lifecycle.py — maps CloudSliceState → SliceEvent (with a hidden timeout check for UNKNOWN).
  2. TRANSITIONS[(state, event)] — tells you the new state.
  3. dispatch() in scaling_group.py — switches on to_state for side effects inside the lock (worker_ids, last_active, error_message, detach, _apply_failure_locked, pop, last_scale_down).
  4. _handle_transition() in runtime.py — switches on new_state again for side effects outside the lock (register/unregister workers, spawn terminate, log action).
  5. Each call site in refresh() / terminate_slices_for_workers() — switches on new_state a third time to pick log text.

The original imperative code (if status.state == READY: mark_ready(); register_workers()) required a single read. The state-machine version triples the number of if new_state == … branches and scatters them across three modules. For a state machine with 2 non-terminal states and 3 terminal outcomes, this is over-engineered — a couple of typed methods (mark_ready, mark_failed, mark_idle) with the transition check inlined would express the same invariants with less surface.


4. context: dict[str, Any] defeats typing at the API boundary

dispatch(slice_id, event, context: dict[str, Any] | None = None, *, now=…) pushes typed information through an untyped dict. Concrete consequences:

  • Inconsistent payload shape for the same event: production passes {"workers": [RemoteWorkerHandle, …]}, but tests and conftest pass {"worker_ids": ["10.0.0.1", …]}. dispatch() branches on ctx.get("worker_ids") or [w.worker_id for w in workers] to paper over this. Direct sign the API boundary is wrong.
  • Dead payload: terminate_slices_for_workers passes {"failed_workers": sorted(…)}. dispatch() never reads it.
  • No static guarantee that FAILED events carry error_message, that READY events carry workers, etc. A typo silently yields "" or [].

Cleaner fix: discriminated event dataclass (CloudReady(workers=…), CloudFailed(error=…), IdleTimeout(), etc.) — or just typed methods given the small number of call shapes.

Pyrefly won't catch any of this today because of Any, and the repo's own guidance (AGENTS.md) calls out avoiding Any.


5. "Atomic cross-machine cascade" lock widening — real race?

Rationale for folding _apply_failure_locked inside _slices_lock: "short-lived slice failures bump consecutive_failures and extend backoff_until atomically with the FAILED transition." But:

  • _consecutive_failures / _backoff_until are read in availability(), which doesn't require consistency with _slices (readers already tolerate partial views across unrelated fields).
  • No test asserts the atomicity invariant; no race is cited in a commit message.
  • Holding _slices_lock while mutating _backoff_until widens lock scope and makes future lock-discipline reasoning harder.

If there's no concrete reader that observes the inconsistency as a bug, the invariant is aspirational and not worth the lock widening.


6. Minor: operations.py dissolution is a partial reversion

This PR deletes operations.py (the module that extracted restart_worker and terminate_slices_for_workers as free functions) and inlines the logic back into Autoscaler. The final commit (1b567c0f3) reverts restart_worker to main's behavior verbatim — net effect on restart_worker is ~30 lines moved module-to-module with no behavior change. Would read more honestly split into its own PR or commit.


Recommendation

  1. Fix the PR description (no audit table, no preserved mark_slice_* methods).
  2. Acknowledge + justify the "FAILED disappears from _slices" semantic change, or restore main's behavior.
  3. Replace context: dict[str, Any] with typed methods or typed event payloads; drop the dead failed_workers key.
  4. Either drop the lock widening around _apply_failure_locked or add a test demonstrating the atomicity invariant matters.
  5. Split the operations.pyruntime.py move into its own PR (or at least its own commit with no behavior changes mixed in).

The underlying idea — turning slice lifecycle into a table so transitions are reviewable at a glance — is reasonable. The execution adds more indirection than the 11 transitions repay, and the description oversells what actually ships.

Copy link
Copy Markdown
Contributor

@ravwojdyla ravwojdyla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Top level makes sense!

Note: These are slice-level aggregate states, not direct VM states.
"""

REQUESTING = "requesting"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can use enum.auto

rjpower added 13 commits April 22, 2026 17:50
Introduce an explicit transition table for slice lifecycle state changes,
replacing scattered mark_slice_ready/mark_slice_failed calls with a
dispatch-based state machine. Each transition is validated against a
(from_state, event) → to_state table, logged to a new slice_transitions
audit table, and returns typed side effects for the caller to execute.

Key changes:
- New slice_lifecycle.py with SliceEvent enum, TRANSITIONS table, and
  dispatch_slice_event() entry point
- New 0033_slice_transitions migration for the audit log table
- ScalingGroup.dispatch() delegates to the state machine
- runtime.py refresh() uses dispatch instead of direct state mutations
- operations.py worker failure path uses dispatch for state tracking
- SliceLifecycleState and SliceState moved to models.py to break the
  circular import between scaling_group.py and slice_lifecycle.py

Existing mark_slice_ready/mark_slice_failed methods are preserved for
backward compatibility with tests and the scale_down_if_idle path.

Refs: #4754
- FIX: dispatch path now sets last_active and worker_ids on READY
  transition inside the lock, preventing immediate scaledown of
  freshly-ready slices
- Remove SliceSideEffect wrapper dataclass — side_effects is now
  list[SliceSideEffectKind] directly, eliminating dead .context field
- Remove InvalidTransitionError (unused), PLATFORM_CALL_STARTED and
  TEARDOWN_COMPLETE events (no transitions)
- Remove unused `state` param from side-effect functions
- Deduplicate _on_bootstrap_failed/_on_worker_failure into shared
  _on_failure_with_teardown
- Rewrite mark_slice_ready/mark_slice_failed as thin dispatch() wrappers
- Remove redundant comments from transition table
- Fix dict[str, object] vs dict[str, Any] inconsistency
Move restart_worker and terminate_slices_for_workers logic directly
into the Autoscaler class, eliminating the operations.py module.

- restart_worker inlined as Autoscaler method (was a thin wrapper)
- Worker-failure path uses dispatch() + _execute_side_effects() like
  refresh(), eliminating the duplicate side-effect if/elif chain
- TERMINATE_SLICE now always does async detach+terminate via
  _async_terminate_slice/_spawn_terminate, unifying the refresh
  and worker-failure termination paths
- find_slice_for_worker becomes a private Autoscaler method
- SliceTerminationRequest/SliceTerminationResult data types deleted
…hort_lived

- dispatch() returns NOOP (applied=False) instead of None, eliminating
  all `if result:` guards at call sites
- cloud_state_to_event() maps CloudSliceState → SliceEvent in one place,
  collapsing the 3-branch if/elif/elif in refresh() to a single
  dispatch + _execute_side_effects call
- is_short_lived computed inside _on_failure_with_teardown from
  state.handle.created_at — callers no longer pass it via context
- Delete mark_slice_ready/mark_slice_failed: all callers (tests and
  production) now use dispatch() directly
…oup failure

The slice machine and group machine now mutate together under one lock.
This eliminates the SliceSideEffectKind enum and the cross-module
duplication where slice_lifecycle.py emitted RECORD_GROUP_FAILURE and
runtime.py interpreted it back into group.record_failure().

Changes:
- ScalingGroup.dispatch() now does the work inline: mutates slice state,
  detaches the slice on FAILED, and updates _consecutive_failures /
  _backoff_until atomically when a short-lived failure should trigger
  backoff. The dispatch logic is no longer split across slice_lifecycle.py
  and runtime.py.
- TransitionResult gains detached_handle, registered_workers, and
  triggered_backoff so callers know what to do without interpreting an
  enum: spawn an async terminate, register workers, log a backoff event.
- _execute_side_effects -> _handle_transition: a single 4-line method
  on Autoscaler that handles caller-side concerns only (worker registry,
  async terminate, dashboard log).
- BACKOFF_TRIGGERS frozenset replaces the per-event side-effect functions.
- record_failure stays public for the _do_scale_up exception path (the
  slice never made it into _slices, so no transition fires); it now
  delegates to _apply_failure_locked for the actual mutation.
- reset_backoff deleted (unused).
- FAILED is now terminal-and-detached: slices don't linger in _slices
  with lifecycle=FAILED. This is a cleaner invariant — the FAILED state
  is a brief intermediate during the dispatch transaction.

Tests rewritten to exercise the new interface through ScalingGroup.dispatch().
The DB audit table duplicated what the log server already captures.
logger.info now emits all transition details (group, slice, prior/new
state, event, triggered_backoff, error) as a single structured line
that can be queried through the log server.

- Remove _log_transition helper and its SQL insert
- Remove 0033_slice_transitions migration
- Expand the existing logger.info line with structured fields
Replace per-transition tests with one parametrized test that exercises every
TRANSITIONS entry. Keep explicit tests for things that aren't readable off
the table:
- READY transition sets last_active and worker_ids (the bug fix)
- Cross-machine backoff cascade (short-lived vs. long-lived, parametrized
  over BACKOFF_TRIGGERS)
- IDLE_TIMEOUT explicitly does not trigger backoff
- FAILED is terminal-and-detached
- Exponential backoff progression
- complete_scale_up clears failure state
- Structural exhaustiveness (tracked states have outgoing transitions)

Writing the parametrized test exposed that PLATFORM_CALL_* events and their
REQUESTING transitions were dead — the scale-up exception path calls
group.record_failure() directly; no slice ever has lifecycle=REQUESTING
(_pending_scale_ups is a counter, not a slice state). Removed both events
and their transitions.

Test count: 16 → 25 (parametrize expands table sanity into 11 cases, plus
4 new parametrized backoff-trigger cases).
The last remaining production path that bypassed the state machine —
scale_down_if_idle called scale_down directly instead of dispatching
IDLE_TIMEOUT. Now all slice teardowns (cloud failure, worker failure,
idle timeout) flow through ScalingGroup.dispatch, which gives:

- One place that detaches the slice from _slices
- One place that emits the transition log line
- Single path for _handle_transition to pick up the detached handle
  and spawn async termination

scale_down_if_idle now returns list[TransitionResult] instead of
list[SliceHandle], and runtime.py's refresh loop calls _handle_transition
to unregister workers + async terminate.

Addresses a stale review comment on the original commit (IDLE_TIMEOUT
declared in TRANSITIONS but never dispatched).
The old path (workers table → groups → _slices → describe → match worker
handle) was fragile. If the autoscaler detached the slice (e.g., idle
timeout, worker failure cascade, short-lived failure), _slices no longer
has the entry and restart_worker rejects the RPC with "Slice X not found
in group Y" — even though the cloud VM is still running and the worker
handle is cached in the worker_registry.

The worker_registry is populated by the REGISTER_WORKERS side effect on
the READY transition and carries the RemoteWorkerHandle directly. Use
it as the source of truth: look up the TrackedWorker by worker_id, then
call handle.restart_worker on the cached handle. This removes a
dependency on _slices that didn't need to exist.

Fixes smoke test test_worker_restart_preserves_task failure.
…entry

There's a pre-existing race window between worker registration and
_do_scale_up's `complete_scale_up`: workers can register via RPC and
start serving tasks before the synchronous `platform.create_slice()`
call returns. During that window, the slice exists on the cloud and is
functional, but the autoscaler's _slices dict hasn't been populated yet.

The original restart_worker code went through _slices and failed in this
window. My earlier fix routed through _worker_registry, which has the
same problem (populated only when refresh observes the slice READY).

Fix: try _slices first (fast path), then fall back to platform.list_slices
filtered by the scale-group label to find the slice handle. Slower but
correct in both cases. restart_worker is a rare admin RPC so the extra
cloud call doesn't matter.
The smoke test failure (test_worker_restart_preserves_task) is a
pre-existing race in the autoscaler architecture, not something my
state machine changes introduced:

  - platform.create_slice() blocks until the gcloud TPU LRO completes
    (often 5-8 minutes for v5e)
  - Workers boot from the startup script and register via RPC long
    before that, often within 3 minutes
  - During that window, _do_scale_up is still blocked, complete_scale_up
    hasn't run, and _slices is empty
  - Any RPC that depends on _slices to find an active slice will fail

This race exists on main too — the test is flaky there for the same
reason; it just happened to land on the unlucky timing in our recent
runs (TPU create taking 7+ minutes).

My earlier "fix" (fall back to platform.list_slices) avoided the
_slices dependency but hit the next race: tpu_describe returns no
network endpoints during provisioning, so SSH targets an empty
hostname.

Both workarounds were treating symptoms. The actual fix would be to
either decouple slice tracking from the synchronous create_slice call
(insert into _slices immediately) or to make restart_worker wait for
the slice to be fully provisioned. That's out of scope for this PR.
Reverting to main's behavior so this PR isn't gated on a pre-existing
bug.
…h site

Addresses the heavier review feedback:

- Replace untyped `context: dict[str, Any]` payload with discriminated-union
  event dataclasses (CloudReady, CloudFailed, UnknownTimeout, WorkerFailure,
  IdleTimeout, CloudInitializing). Pyrefly now catches missing/misspelled
  fields at call sites that previously passed whatever dict shape they liked.

- Drop BACKOFF_TRIGGERS frozenset in favor of a `counts_toward_backoff`
  class-level attribute on the event type; dispatch never branches on kind.

- TransitionResult (with optional detached_handle/registered_workers/
  triggered_backoff/applied fields) is replaced by a sum type NoOp |
  InternalTransition | BecameReady | BecameFailed. Callers consume via a
  single `match` statement instead of three separate `if new_state == …`
  switches scattered across runtime.py.

- Collapse runtime.py: _register/_unregister wrappers gone, refresh() loses
  its slice_ready/slice_failed duplication, terminate_slices_for_workers()
  no longer needs a fallback "slice untracked" reconciliation branch
  (NoOp handles the race cleanly).

- Document the semantic change on dispatch(): FAILED transitions remove the
  slice from _slices atomically. slice_state_counts()/to_status() will
  never observe FAILED slices — failure is surfaced via the outcome and the
  action log, not lingering entries in the tracked map.

- Add test_concurrent_failures_account_atomically covering the lock
  widening: N concurrent CloudFailed dispatches for N slices in the same
  group produce consecutive_failures == N with no lost updates.
@rjpower rjpower force-pushed the claude/stoic-heyrovsky branch from 3432e50 to d37ce2f Compare April 23, 2026 02:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants