scripts: restart and wait on nodes in parallel#14302
Conversation
PR SummaryMedium Risk Overview A Reviewed by Cursor Bugbot for commit 852f831. Bugbot is set up for automated code reviews on this repo. Configure here. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 244baa3. Configure here.
ron-starkware
left a comment
There was a problem hiding this comment.
@ron-starkware reviewed 7 files and all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on matanl-starkware).
878e911 to
fcc2da4
Compare
244baa3 to
773b823
Compare
|
Re: the parallel restart |
fcc2da4 to
8d5a423
Compare
773b823 to
85f43e8
Compare
85f43e8 to
85fcb3f
Compare
41e48cc to
699ce08
Compare
3223758 to
6ce3eed
Compare
699ce08 to
32948e4
Compare
6ce3eed to
9b72c31
Compare
32948e4 to
137d975
Compare
9b72c31 to
8f348f3
Compare
0270f55 to
ed98ef3
Compare
8f348f3 to
94d2f12
Compare
ed98ef3 to
fc1a2f9
Compare
94d2f12 to
a769e94
Compare
fc1a2f9 to
a69f19c
Compare
a769e94 to
179d2e9
Compare
For the non-interactive ALL_AT_ONCE strategy, restart every node's pod and then run the post-restart health/metric waits concurrently instead of node-by-node, which was the main source of slow rollouts. Add ServiceRestarter.restart_all driving two parallel phases (restart, then wait) via run_in_parallel, gated by a new --max-parallelism flag (default 16). ONE_BY_ONE and NO_RESTART stay sequential since they prompt the user between nodes. Thread the flag through all four entry scripts. Also fix _wait_for_pod_to_satisfy_condition to return True on success (it returned None, so callers always logged a spurious failure). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
179d2e9 to
852f831
Compare
|
@ron-starkware — next in the stack 🙏 #14301 merged, so I rebased this onto |


For the non-interactive ALL_AT_ONCE strategy, restart every node's pod and then
run the post-restart health/metric waits concurrently instead of node-by-node,
which was the main source of slow rollouts. Add ServiceRestarter.restart_all
driving two parallel phases (restart, then wait) via run_in_parallel, gated by a
new --max-parallelism flag (default 16). ONE_BY_ONE and NO_RESTART stay
sequential since they prompt the user between nodes. Thread the flag through all
four entry scripts. Also fix _wait_for_pod_to_satisfy_condition to return True on
success (it returned None, so callers always logged a spurious failure).
Co-Authored-By: Claude Opus 4.8 (1M context) noreply@anthropic.com