[iris] Adaptive rolling worker restart with observation window#4635
[iris] Adaptive rolling worker restart with observation window#4635
Conversation
|
Claude finished @rjpower's task in 2m 44s —— View job PR Review: Adaptive Rolling Worker Restart
Nice, clean PR. The progressive batch-sizing strategy (1 → 2 → 4 → ... → max) with an observation window is a solid approach for safe rolling restarts. The helper extraction ( One bug found:
Two suggestions:
pyproject.toml change — removing the unrecognized Overall the logic is correct and the code is well-structured. The |
ca121c9 to
a4cfe27
Compare
| click.echo(f" Batch OK ({succeeded}/{total} complete)") | ||
|
|
||
| # Double batch size for next round, capped at max_batch | ||
| batch_size = min(batch_size * 2, max_batch) |
There was a problem hiding this comment.
Bug: --max-batch 0 causes an infinite loop.
After the first batch (size 1), batch_size becomes min(2, 0) = 0. Subsequent iterations slice an empty batch, offset never advances, and the loop spins forever.
Consider adding an early guard:
| batch_size = min(batch_size * 2, max_batch) | |
| batch_size = min(batch_size * 2, max(max_batch, 1)) |
Or validate the option at the top of the function with if max_batch < 1: raise click.BadParameter(...).
| raise SystemExit(1) | ||
|
|
||
| click.echo(f" All {len(batch)} worker(s) healthy. Observing for {observation_window}s...") | ||
| time.sleep(observation_window) |
There was a problem hiding this comment.
Nit: The observation window sleeps the full duration then checks health once at the end. A worker that crashes 1 second in won't be detected until the full window elapses.
Consider polling at intervals during the window (e.g., every 5-10s) and aborting early on failure — that would make the command feel more responsive on large clusters.
| pass | ||
| return len(remaining) == 0 | ||
|
|
||
| backoff.wait_until(_all_healthy, timeout=Duration.from_seconds(timeout)) |
There was a problem hiding this comment.
Ultra-minor: the return value of wait_until (bool) is silently discarded here. The function works correctly because remaining is mutated by the closure, but assigning to _ would make the intent explicit:
| backoff.wait_until(_all_healthy, timeout=Duration.from_seconds(timeout)) | |
| _timed_out = backoff.wait_until(_all_healthy, timeout=Duration.from_seconds(timeout)) |
(Or just _ = ... — either way signals "I know this returns something and I'm ignoring it on purpose.")
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a4cfe27b0d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| click.echo(f" Batch OK ({succeeded}/{total} complete)") | ||
|
|
||
| # Double batch size for next round, capped at max_batch | ||
| batch_size = min(batch_size * 2, max_batch) |
There was a problem hiding this comment.
Validate --max-batch is at least 1
If --max-batch is 0 or negative, batch_size = min(batch_size * 2, max_batch) eventually makes batch_size non-positive, so the next slice is empty and neither succeeded nor offset increases. At that point the while offset < total loop never progresses and the command can hang indefinitely (sleeping each observation cycle) instead of failing fast on invalid input.
Useful? React with 👍 / 👎.
Replace the simple one-at-a-time worker-restart with progressive batch sizing (1, 2, 4, ... up to --max-batch). Each batch waits for workers to become healthy, then observes for --observation-window seconds checking for heartbeat failures before advancing. Aborts immediately on any failure. --worker-id is now repeatable to target specific workers. Also removes the unrecognized disable-project-excludes-heuristics key from pyrefly config.
|
@claude pls rebase to main and address PR feedback |
a4cfe27 to
690bacc
Compare
|
Claude finished @rjpower's task in 4m 53s —— View job Rebase and Address PR Feedback
All three review items addressed in |
c6048cc to
c5cb813
Compare
Replace the simple one-at-a-time worker-restart with progressive batch sizing
(1, 2, 4, ... up to --max-batch). Each batch waits for all workers to become
healthy, then observes for --observation-window seconds checking for heartbeat
failures before advancing. Aborts immediately on any failure. --worker-id is
now repeatable to target specific workers.
Also removes the unrecognized disable-project-excludes-heuristics key from
pyrefly config.