Skip to content

[iris] Adaptive rolling worker restart with observation window#4635

Merged
rjpower merged 2 commits intomainfrom
iris/adaptive-worker-restart
Apr 10, 2026
Merged

[iris] Adaptive rolling worker restart with observation window#4635
rjpower merged 2 commits intomainfrom
iris/adaptive-worker-restart

Conversation

@rjpower
Copy link
Copy Markdown
Collaborator

@rjpower rjpower commented Apr 10, 2026

Replace the simple one-at-a-time worker-restart with progressive batch sizing
(1, 2, 4, ... up to --max-batch). Each batch waits for all workers to become
healthy, then observes for --observation-window seconds checking for heartbeat
failures before advancing. Aborts immediately on any failure. --worker-id is
now repeatable to target specific workers.

Also removes the unrecognized disable-project-excludes-heuristics key from
pyrefly config.

@rjpower rjpower added the agent-generated Created by automation/agent label Apr 10, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 10, 2026

Claude finished @rjpower's task in 2m 44s —— View job


PR Review: Adaptive Rolling Worker Restart

  • Read changed files and understand the diff
  • Review logic correctness and edge cases
  • Check code style and adherence to repo guidelines
  • Post review feedback

Nice, clean PR. The progressive batch-sizing strategy (1 → 2 → 4 → ... → max) with an observation window is a solid approach for safe rolling restarts. The helper extraction (_wait_for_workers_healthy, _check_worker_health, _print_summary) keeps the main function readable, and the abort-on-failure semantics are correct throughout.

One bug found:

  • --max-batch 0 causes an infinite loop — after the first batch, batch_size becomes min(2, 0) = 0, empty batches never advance offset. See inline comment for fix suggestions.

Two suggestions:

  • The observation window sleeps the full duration then checks once. Polling periodically during the window would detect failures faster on large clusters. (inline)
  • wait_until return value is silently discarded in _wait_for_workers_healthy — assigning to _ would clarify intent. (inline)

pyproject.toml change — removing the unrecognized disable-project-excludes-heuristics key is a good cleanup.

Overall the logic is correct and the code is well-structured. The max_batch validation is the only thing I'd want addressed before merge.

@rjpower rjpower force-pushed the iris/adaptive-worker-restart branch from ca121c9 to a4cfe27 Compare April 10, 2026 19:49
click.echo(f" Batch OK ({succeeded}/{total} complete)")

# Double batch size for next round, capped at max_batch
batch_size = min(batch_size * 2, max_batch)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: --max-batch 0 causes an infinite loop.

After the first batch (size 1), batch_size becomes min(2, 0) = 0. Subsequent iterations slice an empty batch, offset never advances, and the loop spins forever.

Consider adding an early guard:

Suggested change
batch_size = min(batch_size * 2, max_batch)
batch_size = min(batch_size * 2, max(max_batch, 1))

Or validate the option at the top of the function with if max_batch < 1: raise click.BadParameter(...).

raise SystemExit(1)

click.echo(f" All {len(batch)} worker(s) healthy. Observing for {observation_window}s...")
time.sleep(observation_window)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The observation window sleeps the full duration then checks health once at the end. A worker that crashes 1 second in won't be detected until the full window elapses.

Consider polling at intervals during the window (e.g., every 5-10s) and aborting early on failure — that would make the command feel more responsive on large clusters.

pass
return len(remaining) == 0

backoff.wait_until(_all_healthy, timeout=Duration.from_seconds(timeout))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ultra-minor: the return value of wait_until (bool) is silently discarded here. The function works correctly because remaining is mutated by the closure, but assigning to _ would make the intent explicit:

Suggested change
backoff.wait_until(_all_healthy, timeout=Duration.from_seconds(timeout))
_timed_out = backoff.wait_until(_all_healthy, timeout=Duration.from_seconds(timeout))

(Or just _ = ... — either way signals "I know this returns something and I'm ignoring it on purpose.")

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a4cfe27b0d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

click.echo(f" Batch OK ({succeeded}/{total} complete)")

# Double batch size for next round, capped at max_batch
batch_size = min(batch_size * 2, max_batch)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate --max-batch is at least 1

If --max-batch is 0 or negative, batch_size = min(batch_size * 2, max_batch) eventually makes batch_size non-positive, so the next slice is empty and neither succeeded nor offset increases. At that point the while offset < total loop never progresses and the command can hang indefinitely (sleeping each observation cycle) instead of failing fast on invalid input.

Useful? React with 👍 / 👎.

Replace the simple one-at-a-time worker-restart with progressive batch
sizing (1, 2, 4, ... up to --max-batch). Each batch waits for workers to
become healthy, then observes for --observation-window seconds checking
for heartbeat failures before advancing. Aborts immediately on any
failure. --worker-id is now repeatable to target specific workers.

Also removes the unrecognized disable-project-excludes-heuristics key
from pyrefly config.
@rjpower
Copy link
Copy Markdown
Collaborator Author

rjpower commented Apr 10, 2026

@claude pls rebase to main and address PR feedback

@rjpower rjpower force-pushed the iris/adaptive-worker-restart branch from a4cfe27 to 690bacc Compare April 10, 2026 20:01
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 10, 2026

Claude finished @rjpower's task in 4m 53s —— View job


Rebase and Address PR Feedback

  • Rebase branch onto main — already up to date
  • Fix bug: validate --max-batch >= 1 — added click.BadParameter guard at top of function
  • Improve observation window to poll periodically — polls every 10s (or less if window < 10s) and aborts early on failure
  • Fix nit: assign wait_until return to _
  • Run pre-commit — all checks pass
  • Run tests — 1490 passed, 1 skipped (iris/tests/cluster/)
  • Commit and push

All three review items addressed in c6048cc.

@rjpower rjpower force-pushed the iris/adaptive-worker-restart branch from c6048cc to c5cb813 Compare April 10, 2026 21:24
@rjpower rjpower requested a review from yonromai April 10, 2026 21:46
@rjpower rjpower enabled auto-merge (squash) April 10, 2026 21:46
@rjpower rjpower disabled auto-merge April 10, 2026 22:08
@rjpower rjpower merged commit 429d4b5 into main Apr 10, 2026
40 checks passed
@rjpower rjpower deleted the iris/adaptive-worker-restart branch April 10, 2026 22:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant