Skip to content

scripts: restart and wait on nodes in parallel#14302

Merged
matanl-starkware merged 1 commit into
mainfrom
matanl/prod-parallel-restart
Jun 3, 2026
Merged

scripts: restart and wait on nodes in parallel#14302
matanl-starkware merged 1 commit into
mainfrom
matanl/prod-parallel-restart

Conversation

@matanl-starkware

Copy link
Copy Markdown
Collaborator

For the non-interactive ALL_AT_ONCE strategy, restart every node's pod and then
run the post-restart health/metric waits concurrently instead of node-by-node,
which was the main source of slow rollouts. Add ServiceRestarter.restart_all
driving two parallel phases (restart, then wait) via run_in_parallel, gated by a
new --max-parallelism flag (default 16). ONE_BY_ONE and NO_RESTART stay
sequential since they prompt the user between nodes. Thread the flag through all
four entry scripts. Also fix _wait_for_pod_to_satisfy_condition to return True on
success (it returned None, so callers always logged a spurious failure).

Co-Authored-By: Claude Opus 4.8 (1M context) noreply@anthropic.com

@cursor

cursor Bot commented Jun 3, 2026

Copy link
Copy Markdown

PR Summary

Medium Risk
Touches production rollout scripts that delete pods and gate on metrics across many nodes at once; mis-tuned parallelism or signal handling could complicate incident response, though interactive strategies are unchanged.

Overview
For ALL_AT_ONCE, prod restart scripts now restart every node’s pod in parallel, then run post-restart waits (pod ready + metric gates) in parallel via new ServiceRestarter.restart_all and run_in_parallel, instead of finishing one node before starting the next.

A --max-parallelism / -p flag (default 16) caps concurrent restarts/waits; ONE_BY_ONE and NO_RESTART stay sequential. The flag is threaded through update_config_and_restart_nodes and the revert, observer, config-update, and proposal-wait entry scripts. Kubectl output is captured so logs stay readable under parallelism; Ctrl-C on metric waits calls terminate_all_port_forwards. _wait_for_pod_to_satisfy_condition now returns True on success (fixing spurious failure logs). Unit tests in test_parallel_restart.py assert concurrency behavior.

Reviewed by Cursor Bugbot for commit 852f831. Bugbot is set up for automated code reviews on this repo. Configure here.

@reviewable-StarkWare

Copy link
Copy Markdown

This change is Reviewable

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 244baa3. Configure here.

Comment thread scripts/prod/restarter_lib.py

@ron-starkware ron-starkware left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ron-starkware reviewed 7 files and all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on matanl-starkware).

@matanl-starkware matanl-starkware force-pushed the matanl/prod-metrics-thread-safe branch from 878e911 to fcc2da4 Compare June 3, 2026 08:04
@matanl-starkware matanl-starkware force-pushed the matanl/prod-parallel-restart branch from 244baa3 to 773b823 Compare June 3, 2026 08:04
@matanl-starkware

Copy link
Copy Markdown
Collaborator Author

Re: the parallel restart sys.exit note — this is intended for the ALL_AT_ONCE flow: pod deletes are dispatched concurrently, and a failure in any worker is collected by run_in_parallel and surfaced as a non-zero exit once the in-flight operations settle (it is not silently swallowed). Aborting mid-flight isn't meaningful once deletes are already dispatched in parallel; the sequential one_by_one strategy retains the immediate-abort-on-failure behavior. (KeyboardInterrupt now propagates, per the helper PR.)

@matanl-starkware matanl-starkware force-pushed the matanl/prod-metrics-thread-safe branch from fcc2da4 to 8d5a423 Compare June 3, 2026 08:29
@matanl-starkware matanl-starkware force-pushed the matanl/prod-parallel-restart branch from 773b823 to 85f43e8 Compare June 3, 2026 08:29
@matanl-starkware matanl-starkware force-pushed the matanl/prod-parallel-restart branch from 85f43e8 to 85fcb3f Compare June 3, 2026 08:36
@matanl-starkware matanl-starkware force-pushed the matanl/prod-metrics-thread-safe branch 2 times, most recently from 41e48cc to 699ce08 Compare June 3, 2026 11:56
@matanl-starkware matanl-starkware force-pushed the matanl/prod-parallel-restart branch 2 times, most recently from 3223758 to 6ce3eed Compare June 3, 2026 12:19
@matanl-starkware matanl-starkware force-pushed the matanl/prod-metrics-thread-safe branch from 699ce08 to 32948e4 Compare June 3, 2026 12:19
@matanl-starkware matanl-starkware force-pushed the matanl/prod-parallel-restart branch from 6ce3eed to 9b72c31 Compare June 3, 2026 12:22
@matanl-starkware matanl-starkware force-pushed the matanl/prod-metrics-thread-safe branch from 32948e4 to 137d975 Compare June 3, 2026 12:22
@matanl-starkware matanl-starkware force-pushed the matanl/prod-parallel-restart branch from 9b72c31 to 8f348f3 Compare June 3, 2026 12:33
@matanl-starkware matanl-starkware force-pushed the matanl/prod-metrics-thread-safe branch 2 times, most recently from 0270f55 to ed98ef3 Compare June 3, 2026 13:50
@matanl-starkware matanl-starkware force-pushed the matanl/prod-parallel-restart branch from 8f348f3 to 94d2f12 Compare June 3, 2026 13:50
@matanl-starkware matanl-starkware force-pushed the matanl/prod-metrics-thread-safe branch from ed98ef3 to fc1a2f9 Compare June 3, 2026 14:00
@matanl-starkware matanl-starkware force-pushed the matanl/prod-parallel-restart branch from 94d2f12 to a769e94 Compare June 3, 2026 14:00
@matanl-starkware matanl-starkware force-pushed the matanl/prod-metrics-thread-safe branch from fc1a2f9 to a69f19c Compare June 3, 2026 14:16
@matanl-starkware matanl-starkware force-pushed the matanl/prod-parallel-restart branch from a769e94 to 179d2e9 Compare June 3, 2026 14:16
@matanl-starkware matanl-starkware changed the base branch from matanl/prod-metrics-thread-safe to main June 3, 2026 15:11
For the non-interactive ALL_AT_ONCE strategy, restart every node's pod and then
run the post-restart health/metric waits concurrently instead of node-by-node,
which was the main source of slow rollouts. Add ServiceRestarter.restart_all
driving two parallel phases (restart, then wait) via run_in_parallel, gated by a
new --max-parallelism flag (default 16). ONE_BY_ONE and NO_RESTART stay
sequential since they prompt the user between nodes. Thread the flag through all
four entry scripts. Also fix _wait_for_pod_to_satisfy_condition to return True on
success (it returned None, so callers always logged a spurious failure).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@matanl-starkware matanl-starkware force-pushed the matanl/prod-parallel-restart branch from 179d2e9 to 852f831 Compare June 3, 2026 17:46
@matanl-starkware

Copy link
Copy Markdown
Collaborator Author

@ron-starkware — next in the stack 🙏 #14301 merged, so I rebased this onto main and its Reviewable check reset. Could you re-approve this revision when you have a moment? (No code changes from your prior approval — just the rebase.)

@matanl-starkware matanl-starkware added this pull request to the merge queue Jun 3, 2026
Merged via the queue into main with commit 287e678 Jun 3, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants