Skip to content

Commit 1b567c0

Browse files
committed
Revert restart_worker to main's behavior; race is pre-existing
The smoke test failure (test_worker_restart_preserves_task) is a pre-existing race in the autoscaler architecture, not something my state machine changes introduced: - platform.create_slice() blocks until the gcloud TPU LRO completes (often 5-8 minutes for v5e) - Workers boot from the startup script and register via RPC long before that, often within 3 minutes - During that window, _do_scale_up is still blocked, complete_scale_up hasn't run, and _slices is empty - Any RPC that depends on _slices to find an active slice will fail This race exists on main too — the test is flaky there for the same reason; it just happened to land on the unlucky timing in our recent runs (TPU create taking 7+ minutes). My earlier "fix" (fall back to platform.list_slices) avoided the _slices dependency but hit the next race: tpu_describe returns no network endpoints during provisioning, so SSH targets an empty hostname. Both workarounds were treating symptoms. The actual fix would be to either decouple slice tracking from the synchronous create_slice call (insert into _slices immediately) or to make restart_worker wait for the slice to be fully provisioned. That's out of scope for this PR. Reverting to main's behavior so this PR isn't gated on a pre-existing bug.
1 parent 0c6c9ac commit 1b567c0

1 file changed

Lines changed: 2 additions & 19 deletions

File tree

  • lib/iris/src/iris/cluster/controller/autoscaler

lib/iris/src/iris/cluster/controller/autoscaler/runtime.py

Lines changed: 2 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -496,14 +496,7 @@ def get_tracked_worker(self, worker_id: str) -> TrackedWorker | None:
496496
return self._worker_registry.tracked_worker(worker_id)
497497

498498
def restart_worker(self, worker_id: str) -> None:
499-
"""Restart a worker with a fresh bootstrap script using the latest image.
500-
501-
Looks up the slice/scale-group from the workers DB row, then asks the
502-
platform directly for the slice handle. This avoids depending on
503-
_slices (which may not yet contain the slice if `complete_scale_up`
504-
hasn't run) or _worker_registry (which is only populated when refresh
505-
observes the slice as READY).
506-
"""
499+
"""Restart a worker with a fresh bootstrap script using the latest image."""
507500
if self._db is None:
508501
raise ValueError("No DB configured — cannot look up worker")
509502

@@ -520,19 +513,9 @@ def restart_worker(self, worker_id: str) -> None:
520513
if group is None:
521514
raise ValueError(f"Scale group {row.scale_group} not found for worker {worker_id}")
522515

523-
# Try _slices first (fast path); fall back to a platform query for
524-
# slices created via _do_scale_up that haven't yet hit complete_scale_up().
525516
slice_handle = group.get_slice(row.slice_id)
526517
if slice_handle is None:
527-
zone = group.zone
528-
zones = [zone] if zone else []
529-
labels = {group._labels.iris_scale_group: group.name}
530-
for handle in self._platform.list_slices(zones, labels):
531-
if handle.slice_id == row.slice_id:
532-
slice_handle = handle
533-
break
534-
if slice_handle is None:
535-
raise ValueError(f"Slice {row.slice_id} not found for worker {worker_id}")
518+
raise ValueError(f"Slice {row.slice_id} not found in group {row.scale_group}")
536519

537520
workers = slice_handle.describe().workers
538521
handle = next((w for w in workers if w.worker_id == worker_id), None)

0 commit comments

Comments
 (0)