Commit a25e63e
committed
Revert restart_worker to main's behavior; race is pre-existing
The smoke test failure (test_worker_restart_preserves_task) is a
pre-existing race in the autoscaler architecture, not something my
state machine changes introduced:
- platform.create_slice() blocks until the gcloud TPU LRO completes
(often 5-8 minutes for v5e)
- Workers boot from the startup script and register via RPC long
before that, often within 3 minutes
- During that window, _do_scale_up is still blocked, complete_scale_up
hasn't run, and _slices is empty
- Any RPC that depends on _slices to find an active slice will fail
This race exists on main too — the test is flaky there for the same
reason; it just happened to land on the unlucky timing in our recent
runs (TPU create taking 7+ minutes).
My earlier "fix" (fall back to platform.list_slices) avoided the
_slices dependency but hit the next race: tpu_describe returns no
network endpoints during provisioning, so SSH targets an empty
hostname.
Both workarounds were treating symptoms. The actual fix would be to
either decouple slice tracking from the synchronous create_slice call
(insert into _slices immediately) or to make restart_worker wait for
the slice to be fully provisioned. That's out of scope for this PR.
Reverting to main's behavior so this PR isn't gated on a pre-existing
bug.1 parent c508a62 commit a25e63e
1 file changed
Lines changed: 2 additions & 19 deletions
Lines changed: 2 additions & 19 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
496 | 496 | | |
497 | 497 | | |
498 | 498 | | |
499 | | - | |
500 | | - | |
501 | | - | |
502 | | - | |
503 | | - | |
504 | | - | |
505 | | - | |
506 | | - | |
| 499 | + | |
507 | 500 | | |
508 | 501 | | |
509 | 502 | | |
| |||
520 | 513 | | |
521 | 514 | | |
522 | 515 | | |
523 | | - | |
524 | | - | |
525 | 516 | | |
526 | 517 | | |
527 | | - | |
528 | | - | |
529 | | - | |
530 | | - | |
531 | | - | |
532 | | - | |
533 | | - | |
534 | | - | |
535 | | - | |
| 518 | + | |
536 | 519 | | |
537 | 520 | | |
538 | 521 | | |
| |||
0 commit comments