[iris] Adaptive rolling worker restart with observation window by rjpower · Pull Request #4635 · marin-community/marin

rjpower · 2026-04-10T19:48:05Z

Replace the simple one-at-a-time worker-restart with progressive batch sizing
(1, 2, 4, ... up to --max-batch). Each batch waits for all workers to become
healthy, then observes for --observation-window seconds checking for heartbeat
failures before advancing. Aborts immediately on any failure. --worker-id is
now repeatable to target specific workers.

Also removes the unrecognized disable-project-excludes-heuristics key from
pyrefly config.

claude · 2026-04-10T19:48:37Z

Claude finished @rjpower's task in 2m 44s —— View job

PR Review: Adaptive Rolling Worker Restart

Read changed files and understand the diff
Review logic correctness and edge cases
Check code style and adherence to repo guidelines
Post review feedback

Nice, clean PR. The progressive batch-sizing strategy (1 → 2 → 4 → ... → max) with an observation window is a solid approach for safe rolling restarts. The helper extraction (_wait_for_workers_healthy, _check_worker_health, _print_summary) keeps the main function readable, and the abort-on-failure semantics are correct throughout.

One bug found:

--max-batch 0 causes an infinite loop — after the first batch, batch_size becomes min(2, 0) = 0, empty batches never advance offset. See inline comment for fix suggestions.

Two suggestions:

The observation window sleeps the full duration then checks once. Polling periodically during the window would detect failures faster on large clusters. (inline)
wait_until return value is silently discarded in _wait_for_workers_healthy — assigning to _ would clarify intent. (inline)

pyproject.toml change — removing the unrecognized disable-project-excludes-heuristics key is a good cleanup.

Overall the logic is correct and the code is well-structured. The max_batch validation is the only thing I'd want addressed before merge.

claude · 2026-04-10T19:51:02Z

+            click.echo(f"  Batch OK ({succeeded}/{total} complete)")
+
+            # Double batch size for next round, capped at max_batch
+            batch_size = min(batch_size * 2, max_batch)


Bug: --max-batch 0 causes an infinite loop.

After the first batch (size 1), batch_size becomes min(2, 0) = 0. Subsequent iterations slice an empty batch, offset never advances, and the loop spins forever.

Consider adding an early guard:

Suggested change

batch_size = min(batch_size * 2, max_batch)

batch_size = min(batch_size * 2, max(max_batch, 1))

Or validate the option at the top of the function with if max_batch < 1: raise click.BadParameter(...).

claude · 2026-04-10T19:51:10Z

+                raise SystemExit(1)
+
+            click.echo(f"  All {len(batch)} worker(s) healthy. Observing for {observation_window}s...")
+            time.sleep(observation_window)


Nit: The observation window sleeps the full duration then checks health once at the end. A worker that crashes 1 second in won't be detected until the full window elapses.

Consider polling at intervals during the window (e.g., every 5-10s) and aborting early on failure — that would make the command feel more responsive on large clusters.

claude · 2026-04-10T19:51:18Z

+            pass
+        return len(remaining) == 0
+
+    backoff.wait_until(_all_healthy, timeout=Duration.from_seconds(timeout))


Ultra-minor: the return value of wait_until (bool) is silently discarded here. The function works correctly because remaining is mutated by the closure, but assigning to _ would make the intent explicit:

Suggested change

backoff.wait_until(_all_healthy, timeout=Duration.from_seconds(timeout))

_timed_out = backoff.wait_until(_all_healthy, timeout=Duration.from_seconds(timeout))

(Or just _ = ... — either way signals "I know this returns something and I'm ignoring it on purpose.")

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a4cfe27b0d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-10T19:52:05Z

+            click.echo(f"  Batch OK ({succeeded}/{total} complete)")
+
+            # Double batch size for next round, capped at max_batch
+            batch_size = min(batch_size * 2, max_batch)


Validate --max-batch is at least 1

If --max-batch is 0 or negative, batch_size = min(batch_size * 2, max_batch) eventually makes batch_size non-positive, so the next slice is empty and neither succeeded nor offset increases. At that point the while offset < total loop never progresses and the command can hang indefinitely (sleeping each observation cycle) instead of failing fast on invalid input.

Useful? React with 👍 / 👎.

Replace the simple one-at-a-time worker-restart with progressive batch sizing (1, 2, 4, ... up to --max-batch). Each batch waits for workers to become healthy, then observes for --observation-window seconds checking for heartbeat failures before advancing. Aborts immediately on any failure. --worker-id is now repeatable to target specific workers. Also removes the unrecognized disable-project-excludes-heuristics key from pyrefly config.

rjpower · 2026-04-10T20:00:16Z

@claude pls rebase to main and address PR feedback

claude · 2026-04-10T20:03:11Z

Claude finished @rjpower's task in 4m 53s —— View job

Rebase and Address PR Feedback

Rebase branch onto main — already up to date
Fix bug: validate --max-batch >= 1 — added click.BadParameter guard at top of function
Improve observation window to poll periodically — polls every 10s (or less if window < 10s) and aborts early on failure
Fix nit: assign wait_until return to _
Run pre-commit — all checks pass
Run tests — 1490 passed, 1 skipped (iris/tests/cluster/)
Commit and push

All three review items addressed in c6048cc.

rjpower added the agent-generated Created by automation/agent label Apr 10, 2026

rjpower force-pushed the iris/adaptive-worker-restart branch from ca121c9 to a4cfe27 Compare April 10, 2026 19:49

claude Bot reviewed Apr 10, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 10, 2026

View reviewed changes

rjpower force-pushed the iris/adaptive-worker-restart branch from a4cfe27 to 690bacc Compare April 10, 2026 20:01

Fix controller restart trying to kill the cluster.

c5cb813

rjpower force-pushed the iris/adaptive-worker-restart branch from c6048cc to c5cb813 Compare April 10, 2026 21:24

rjpower requested a review from yonromai April 10, 2026 21:46

rjpower enabled auto-merge (squash) April 10, 2026 21:46

rjpower disabled auto-merge April 10, 2026 22:08

rjpower merged commit 429d4b5 into main Apr 10, 2026
40 checks passed

rjpower deleted the iris/adaptive-worker-restart branch April 10, 2026 22:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[iris] Adaptive rolling worker restart with observation window#4635

[iris] Adaptive rolling worker restart with observation window#4635
rjpower merged 2 commits intomainfrom
iris/adaptive-worker-restart

rjpower commented Apr 10, 2026

Uh oh!

claude Bot commented Apr 10, 2026 •

edited

Loading

Uh oh!

claude Bot Apr 10, 2026

Uh oh!

claude Bot Apr 10, 2026

Uh oh!

claude Bot Apr 10, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 10, 2026

Uh oh!

rjpower commented Apr 10, 2026

Uh oh!

claude Bot commented Apr 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	batch_size = min(batch_size * 2, max_batch)
	batch_size = min(batch_size * 2, max(max_batch, 1))

	backoff.wait_until(_all_healthy, timeout=Duration.from_seconds(timeout))
	_timed_out = backoff.wait_until(_all_healthy, timeout=Duration.from_seconds(timeout))

Conversation

rjpower commented Apr 10, 2026

Uh oh!

claude Bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Adaptive Rolling Worker Restart

Uh oh!

claude Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

rjpower commented Apr 10, 2026

Uh oh!

claude Bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rebase and Address PR Feedback

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Apr 10, 2026 •

edited

Loading

claude Bot commented Apr 10, 2026 •

edited

Loading