Revert restart_worker to main's behavior; race is pre-existing

rjpower · rjpower · commit 1b567c0f329f · 2026-04-22T15:16:54.000-07:00
The smoke test failure (test_worker_restart_preserves_task) is a
pre-existing race in the autoscaler architecture, not something my
state machine changes introduced:

  - platform.create_slice() blocks until the gcloud TPU LRO completes
    (often 5-8 minutes for v5e)
  - Workers boot from the startup script and register via RPC long
    before that, often within 3 minutes
  - During that window, _do_scale_up is still blocked, complete_scale_up
    hasn't run, and _slices is empty
  - Any RPC that depends on _slices to find an active slice will fail

This race exists on main too — the test is flaky there for the same
reason; it just happened to land on the unlucky timing in our recent
runs (TPU create taking 7+ minutes).

My earlier "fix" (fall back to platform.list_slices) avoided the
_slices dependency but hit the next race: tpu_describe returns no
network endpoints during provisioning, so SSH targets an empty
hostname.

Both workarounds were treating symptoms. The actual fix would be to
either decouple slice tracking from the synchronous create_slice call
(insert into _slices immediately) or to make restart_worker wait for
the slice to be fully provisioned. That's out of scope for this PR.
Reverting to main's behavior so this PR isn't gated on a pre-existing
bug.
diff --git a/lib/iris/src/iris/cluster/controller/autoscaler/runtime.py b/lib/iris/src/iris/cluster/controller/autoscaler/runtime.py
@@ -496,14 +496,7 @@ def get_tracked_worker(self, worker_id: str) -> TrackedWorker | None:
         return self._worker_registry.tracked_worker(worker_id)
 
     def restart_worker(self, worker_id: str) -> None:
-        """Restart a worker with a fresh bootstrap script using the latest image.
-
-        Looks up the slice/scale-group from the workers DB row, then asks the
-        platform directly for the slice handle. This avoids depending on
-        _slices (which may not yet contain the slice if `complete_scale_up`
-        hasn't run) or _worker_registry (which is only populated when refresh
-        observes the slice as READY).
-        """
+        """Restart a worker with a fresh bootstrap script using the latest image."""
         if self._db is None:
             raise ValueError("No DB configured — cannot look up worker")
 
@@ -520,19 +513,9 @@ def restart_worker(self, worker_id: str) -> None:
         if group is None:
             raise ValueError(f"Scale group {row.scale_group} not found for worker {worker_id}")
 
-        # Try _slices first (fast path); fall back to a platform query for
-        # slices created via _do_scale_up that haven't yet hit complete_scale_up().
         slice_handle = group.get_slice(row.slice_id)
         if slice_handle is None:
-            zone = group.zone
-            zones = [zone] if zone else []
-            labels = {group._labels.iris_scale_group: group.name}
-            for handle in self._platform.list_slices(zones, labels):
-                if handle.slice_id == row.slice_id:
-                    slice_handle = handle
-                    break
-        if slice_handle is None:
-            raise ValueError(f"Slice {row.slice_id} not found for worker {worker_id}")
+            raise ValueError(f"Slice {row.slice_id} not found in group {row.scale_group}")
 
         workers = slice_handle.describe().workers
         handle = next((w for w in workers if w.worker_id == worker_id), None)