Tolerate controller unavailability for up to 1h in job monitoring by rjpower · Pull Request #4176 · marin-community/marin

rjpower · 2026-03-26T16:46:22Z

call_with_retry gains a max_elapsed parameter for time-based retry cutoff (in addition to max_attempts)
wait_for_job and wait_for_job_with_streaming now catch retryable RPC errors from get_job_status and back off with exponential backoff (1s → 60s) for up to 1 hour before treating controller unavailability as a job error
Log fetch failures in wait_for_job_with_streaming are now fully non-fatal — they log a warning but never abort monitoring (previously 5 consecutive failures = crash)

The job keeps running server-side regardless; this only affects the client's ability to poll. When the controller comes back, the unavailability timer resets and monitoring resumes normally.

Analyze controller startup flow, scheduling loop, worker sync, autoscaler, and checkpoint systems. Identify all side-effectful operations that need gating for a --dry-run flag.

The monitoring loops (wait_for_job, wait_for_job_with_streaming) now catch retryable RPC errors and back off for up to 1 hour before giving up, instead of treating a temporarily offline controller as a job failure. Changes: - call_with_retry: add max_elapsed parameter for time-based retry cutoff - wait_for_job: track controller unavailability with 1h tolerance - wait_for_job_with_streaming: same, plus log fetch failures are non-fatal

claude · 2026-03-26T16:47:20Z

Claude finished @rjpower's task in 1m 21s —— View job

Review

Adds time-based retry tolerance for controller unavailability in wait_for_job and wait_for_job_with_streaming, plus a max_elapsed parameter to call_with_retry. The implementation matches the PR description and the changes are correct.

Read changed files and diff
Check for correctness and regressions
Post review

Findings:

Duplicated unavailability handling (remote_client.py:161-206 and remote_client.py:245-283): The retry-with-backoff-on-unavailable block is copy-pasted between wait_for_job and wait_for_job_with_streaming — ~40 lines of identical logic (try/except around get_job_status, unavailable timer, backoff sleep, reset on recovery). This is a maintenance risk since the two copies could drift. Consider extracting a helper like _poll_status_with_tolerance(job_id, deadline, unavailable_backoff, unavailable_since) that returns (status, unavailable_since) on success or raises after tolerance is exceeded.

Fix this →
iris-controller-dry-run-analysis.md in repo root (iris-controller-dry-run-analysis.md): This research document is committed to the repository root. It reads like a working note / analysis artifact rather than something that belongs in the tree. Should this be in docs/ or removed from this PR?
Interaction between job timeout and unavailability tolerance: During controller unavailability, the job deadline can expire but the loop continues retrying until CONTROLLER_UNAVAILABLE_TOLERANCE (1h). This means a job with timeout=300 could block the client for up to 1 hour if the controller is down. This appears intentional per the PR description but is worth calling out — callers relying on timeout as a hard upper bound on wall-clock time will see different behavior during outages.

No bugs or correctness issues found. The call_with_retry changes are clean and well-tested. The log-fetch simplification (removing the consecutive failure counter) is a good change — log failures should not crash monitoring.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7eb1027ed7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-26T16:49:17Z

+                time.sleep(min(interval, deadline.remaining_seconds()))
+                continue


Honor wait timeout during controller unavailability

This retry branch continues polling even after the caller's timeout has expired, because it sleeps for remaining_seconds() (which becomes 0) and immediately loops without raising TimeoutError. In outage scenarios, that means wait_for_job can run far longer than the documented timeout (it keeps invoking get_job_status, which itself retries), so a call like timeout=60 may block for many minutes or up to the new 1-hour tolerance; the same pattern is also present in wait_for_job_with_streaming.

Useful? React with 👍 / 👎.

yonromai

rjpower · 2026-03-26T16:54:38Z

@claude review PR feedback and clean this up.

we should respect the users timeouts. the main thing we want is for lib/marin/.../executor code to not panic and terminate jobs if the controller is temporarily unavailable.

The retry-with-backoff-on-unavailable block is copy-pasted between wait_for_job and wait_for_job_with_streaming — ~40 lines of identical logic (try/except around get_job_status, unavailable timer, backoff sleep, reset on recovery).

this feels like it should be part of the errors.py retry library?

claude · 2026-03-26T16:55:07Z

Claude finished @rjpower's task in 5m 22s —— View job

Cleaned up PR based on feedback

Read current code (remote_client.py, errors.py, tests)
Move unavailability retry logic into errors.py retry library
Respect user timeouts (don't let unavailability tolerance override them)
Remove duplicated code from wait_for_job / wait_for_job_with_streaming
Check for stray files (iris-controller-dry-run-analysis.md) — already deleted in prior commit
Run pre-commit and tests
Commit and push

What changed:

New poll_with_retries() in errors.py — encapsulates the "poll an RPC, tolerate transient unavailability with backoff, respect a deadline" pattern. This lives alongside call_with_retry in the retry library where it belongs.
Deadline is now respected during unavailability — if the caller's timeout expires while the controller is down, poll_with_retries raises TimeoutError instead of continuing to retry for the full 1-hour tolerance. This addresses the Codex review feedback: timeout=60 will now actually time out within ~60s even during outages.
Deduplicated ~40 lines — both wait_for_job and wait_for_job_with_streaming now call poll_with_retries() instead of duplicating the unavailability tracking, backoff, and recovery logging.
5 new tests for poll_with_retries: immediate success, retry-then-succeed, deadline expiry during unavailability, tolerance expiry, and non-retryable error passthrough. All 15 tests in test_errors.py pass.

Testing: ./infra/pre-commit.py --all-files --fix passes ✅ | uv run --package iris pytest lib/iris/tests/rpc/test_errors.py -v — 15/15 pass ✅
| Branch

Move the retry-with-backoff-on-unavailable logic from wait_for_job and wait_for_job_with_streaming into a shared poll_with_retries() function in the errors.py retry library. The new function respects the caller's deadline — if timeout expires during controller unavailability, it raises TimeoutError instead of continuing to retry for the full tolerance window. Co-authored-by: Russell Power <rjpower@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

) - `call_with_retry` gains a `max_elapsed` parameter for time-based retry cutoff (in addition to `max_attempts`) - `wait_for_job` and `wait_for_job_with_streaming` now catch retryable RPC errors from `get_job_status` and back off with exponential backoff (1s → 60s) for up to 1 hour before treating controller unavailability as a job error - Log fetch failures in `wait_for_job_with_streaming` are now fully non-fatal — they log a warning but never abort monitoring (previously 5 consecutive failures = crash) The job keeps running server-side regardless; this only affects the client's ability to poll. When the controller comes back, the unavailability timer resets and monitoring resumes normally.

rjpower added 2 commits March 25, 2026 11:02

research: iris controller dry-run mode analysis

1a6bc2e

Analyze controller startup flow, scheduling loop, worker sync, autoscaler, and checkpoint systems. Identify all side-effectful operations that need gating for a --dry-run flag.

rjpower added the agent-generated Created by automation/agent label Mar 26, 2026

chatgpt-codex-connector Bot reviewed Mar 26, 2026

View reviewed changes

Delete iris-controller-dry-run-analysis.md

260706c

rjpower requested a review from yonromai March 26, 2026 16:50

yonromai approved these changes Mar 26, 2026

View reviewed changes

rjpower merged commit 9499121 into main Mar 26, 2026
40 of 41 checks passed

rjpower deleted the work/P0nyDNQN branch March 26, 2026 17:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tolerate controller unavailability for up to 1h in job monitoring#4176

Tolerate controller unavailability for up to 1h in job monitoring#4176
rjpower merged 4 commits intomainfrom
work/P0nyDNQN

rjpower commented Mar 26, 2026 •

edited

Loading

Uh oh!

claude Bot commented Mar 26, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Mar 26, 2026

Uh oh!

yonromai left a comment

Uh oh!

rjpower commented Mar 26, 2026

Uh oh!

claude Bot commented Mar 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		time.sleep(min(interval, deadline.remaining_seconds()))
		continue

Conversation

rjpower commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

yonromai left a comment

Choose a reason for hiding this comment

Uh oh!

rjpower commented Mar 26, 2026

Uh oh!

claude Bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Cleaned up PR based on feedback

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rjpower commented Mar 26, 2026 •

edited

Loading

claude Bot commented Mar 26, 2026 •

edited

Loading

claude Bot commented Mar 26, 2026 •

edited

Loading