scripts: restart and wait on nodes in parallel by matanl-starkware · Pull Request #14302 · starkware-libs/sequencer

matanl-starkware · 2026-06-03T06:29:57Z

For the non-interactive ALL_AT_ONCE strategy, restart every node's pod and then
run the post-restart health/metric waits concurrently instead of node-by-node,
which was the main source of slow rollouts. Add ServiceRestarter.restart_all
driving two parallel phases (restart, then wait) via run_in_parallel, gated by a
new --max-parallelism flag (default 16). ONE_BY_ONE and NO_RESTART stay
sequential since they prompt the user between nodes. Thread the flag through all
four entry scripts. Also fix _wait_for_pod_to_satisfy_condition to return True on
success (it returned None, so callers always logged a spurious failure).

Co-Authored-By: Claude Opus 4.8 (1M context) noreply@anthropic.com

cursor · 2026-06-03T06:30:04Z

PR Summary

Medium Risk
Touches production rollout scripts that delete pods and gate on metrics across many nodes at once; mis-tuned parallelism or signal handling could complicate incident response, though interactive strategies are unchanged.

Overview
For ALL_AT_ONCE, prod restart scripts now restart every node’s pod in parallel, then run post-restart waits (pod ready + metric gates) in parallel via new ServiceRestarter.restart_all and run_in_parallel, instead of finishing one node before starting the next.

A --max-parallelism / -p flag (default 16) caps concurrent restarts/waits; ONE_BY_ONE and NO_RESTART stay sequential. The flag is threaded through update_config_and_restart_nodes and the revert, observer, config-update, and proposal-wait entry scripts. Kubectl output is captured so logs stay readable under parallelism; Ctrl-C on metric waits calls terminate_all_port_forwards. _wait_for_pod_to_satisfy_condition now returns True on success (fixing spurious failure logs). Unit tests in test_parallel_restart.py assert concurrency behavior.

^{Reviewed by Cursor Bugbot for commit 852f831. Bugbot is set up for automated code reviews on this repo. Configure here.}

reviewable-StarkWare · 2026-06-03T06:30:09Z

This change is

matanl-starkware · 2026-06-03T06:30:11Z

This stack of pull requests is managed by Graphite. Learn more about stacking.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 244baa3. Configure here.}

ron-starkware

@ron-starkware reviewed 7 files and all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on matanl-starkware).

matanl-starkware · 2026-06-03T08:06:08Z

Re: the parallel restart sys.exit note — this is intended for the ALL_AT_ONCE flow: pod deletes are dispatched concurrently, and a failure in any worker is collected by run_in_parallel and surfaced as a non-zero exit once the in-flight operations settle (it is not silently swallowed). Aborting mid-flight isn't meaningful once deletes are already dispatched in parallel; the sequential one_by_one strategy retains the immediate-abort-on-failure behavior. (KeyboardInterrupt now propagates, per the helper PR.)

For the non-interactive ALL_AT_ONCE strategy, restart every node's pod and then run the post-restart health/metric waits concurrently instead of node-by-node, which was the main source of slow rollouts. Add ServiceRestarter.restart_all driving two parallel phases (restart, then wait) via run_in_parallel, gated by a new --max-parallelism flag (default 16). ONE_BY_ONE and NO_RESTART stay sequential since they prompt the user between nodes. Thread the flag through all four entry scripts. Also fix _wait_for_pod_to_satisfy_condition to return True on success (it returned None, so callers always logged a spurious failure). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

matanl-starkware · 2026-06-03T17:46:57Z

@ron-starkware — next in the stack 🙏 #14301 merged, so I rebased this onto main and its Reviewable check reset. Could you re-approve this revision when you have a moment? (No code changes from your prior approval — just the rebase.)

matanl-starkware requested a review from ron-starkware June 3, 2026 06:29

cursor Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread scripts/prod/restarter_lib.py

ron-starkware approved these changes Jun 3, 2026

View reviewed changes

matanl-starkware force-pushed the matanl/prod-metrics-thread-safe branch from 878e911 to fcc2da4 Compare June 3, 2026 08:04

matanl-starkware force-pushed the matanl/prod-parallel-restart branch from 244baa3 to 773b823 Compare June 3, 2026 08:04

matanl-starkware force-pushed the matanl/prod-metrics-thread-safe branch from fcc2da4 to 8d5a423 Compare June 3, 2026 08:29

matanl-starkware force-pushed the matanl/prod-parallel-restart branch from 773b823 to 85f43e8 Compare June 3, 2026 08:29

matanl-starkware mentioned this pull request Jun 3, 2026

ci: exclude top-level scripts/prod files from hybrid system test #14311

Closed

matanl-starkware force-pushed the matanl/prod-parallel-restart branch from 85f43e8 to 85fcb3f Compare June 3, 2026 08:36

matanl-starkware force-pushed the matanl/prod-metrics-thread-safe branch 2 times, most recently from 41e48cc to 699ce08 Compare June 3, 2026 11:56

matanl-starkware force-pushed the matanl/prod-parallel-restart branch 2 times, most recently from 3223758 to 6ce3eed Compare June 3, 2026 12:19

matanl-starkware force-pushed the matanl/prod-metrics-thread-safe branch from 699ce08 to 32948e4 Compare June 3, 2026 12:19

matanl-starkware mentioned this pull request Jun 3, 2026

ci: fix hybrid system test for prod-script changes and k3d install #14317

Merged

matanl-starkware force-pushed the matanl/prod-parallel-restart branch from 6ce3eed to 9b72c31 Compare June 3, 2026 12:22

matanl-starkware force-pushed the matanl/prod-metrics-thread-safe branch from 32948e4 to 137d975 Compare June 3, 2026 12:22

matanl-starkware force-pushed the matanl/prod-parallel-restart branch from 9b72c31 to 8f348f3 Compare June 3, 2026 12:33

matanl-starkware force-pushed the matanl/prod-metrics-thread-safe branch 2 times, most recently from 0270f55 to ed98ef3 Compare June 3, 2026 13:50

matanl-starkware force-pushed the matanl/prod-parallel-restart branch from 8f348f3 to 94d2f12 Compare June 3, 2026 13:50

matanl-starkware force-pushed the matanl/prod-metrics-thread-safe branch from ed98ef3 to fc1a2f9 Compare June 3, 2026 14:00

matanl-starkware force-pushed the matanl/prod-parallel-restart branch from 94d2f12 to a769e94 Compare June 3, 2026 14:00

matanl-starkware force-pushed the matanl/prod-metrics-thread-safe branch from fc1a2f9 to a69f19c Compare June 3, 2026 14:16

matanl-starkware force-pushed the matanl/prod-parallel-restart branch from a769e94 to 179d2e9 Compare June 3, 2026 14:16

matanl-starkware changed the base branch from matanl/prod-metrics-thread-safe to main June 3, 2026 15:11

matanl-starkware force-pushed the matanl/prod-parallel-restart branch from 179d2e9 to 852f831 Compare June 3, 2026 17:46

matanl-starkware added this pull request to the merge queue Jun 3, 2026

Merged via the queue into main with commit 287e678 Jun 3, 2026
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

scripts: restart and wait on nodes in parallel#14302

scripts: restart and wait on nodes in parallel#14302
matanl-starkware merged 1 commit into
mainfrom
matanl/prod-parallel-restart

matanl-starkware commented Jun 3, 2026

Uh oh!

cursor Bot commented Jun 3, 2026 •

edited

Loading

Uh oh!

reviewable-StarkWare commented Jun 3, 2026

Uh oh!

matanl-starkware commented Jun 3, 2026 •

edited

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

ron-starkware left a comment

Uh oh!

matanl-starkware commented Jun 3, 2026

Uh oh!

matanl-starkware commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

matanl-starkware commented Jun 3, 2026

Uh oh!

cursor Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

reviewable-StarkWare commented Jun 3, 2026

Uh oh!

matanl-starkware commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ron-starkware left a comment

Choose a reason for hiding this comment

Uh oh!

matanl-starkware commented Jun 3, 2026

Uh oh!

matanl-starkware commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cursor Bot commented Jun 3, 2026 •

edited

Loading

matanl-starkware commented Jun 3, 2026 •

edited

Loading