Skip to content

Tolerate controller unavailability for up to 1h in job monitoring#4176

Merged
rjpower merged 4 commits intomainfrom
work/P0nyDNQN
Mar 26, 2026
Merged

Tolerate controller unavailability for up to 1h in job monitoring#4176
rjpower merged 4 commits intomainfrom
work/P0nyDNQN

Conversation

@rjpower
Copy link
Copy Markdown
Collaborator

@rjpower rjpower commented Mar 26, 2026

  • call_with_retry gains a max_elapsed parameter for time-based retry cutoff (in addition to max_attempts)
  • wait_for_job and wait_for_job_with_streaming now catch retryable RPC errors from get_job_status and back off with exponential backoff (1s → 60s) for up to 1 hour before treating controller unavailability as a job error
  • Log fetch failures in wait_for_job_with_streaming are now fully non-fatal — they log a warning but never abort monitoring (previously 5 consecutive failures = crash)

The job keeps running server-side regardless; this only affects the client's ability to poll. When the controller comes back, the unavailability timer resets and monitoring resumes normally.

rjpower added 2 commits March 25, 2026 11:02
Analyze controller startup flow, scheduling loop, worker sync,
autoscaler, and checkpoint systems. Identify all side-effectful
operations that need gating for a --dry-run flag.
The monitoring loops (wait_for_job, wait_for_job_with_streaming) now catch
retryable RPC errors and back off for up to 1 hour before giving up, instead
of treating a temporarily offline controller as a job failure.

Changes:
- call_with_retry: add max_elapsed parameter for time-based retry cutoff
- wait_for_job: track controller unavailability with 1h tolerance
- wait_for_job_with_streaming: same, plus log fetch failures are non-fatal
@rjpower rjpower added the agent-generated Created by automation/agent label Mar 26, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Mar 26, 2026

Claude finished @rjpower's task in 1m 21s —— View job


Review

Adds time-based retry tolerance for controller unavailability in wait_for_job and wait_for_job_with_streaming, plus a max_elapsed parameter to call_with_retry. The implementation matches the PR description and the changes are correct.

  • Read changed files and diff
  • Check for correctness and regressions
  • Post review

Findings:

  1. Duplicated unavailability handling (remote_client.py:161-206 and remote_client.py:245-283): The retry-with-backoff-on-unavailable block is copy-pasted between wait_for_job and wait_for_job_with_streaming — ~40 lines of identical logic (try/except around get_job_status, unavailable timer, backoff sleep, reset on recovery). This is a maintenance risk since the two copies could drift. Consider extracting a helper like _poll_status_with_tolerance(job_id, deadline, unavailable_backoff, unavailable_since) that returns (status, unavailable_since) on success or raises after tolerance is exceeded.

    Fix this →

  2. iris-controller-dry-run-analysis.md in repo root (iris-controller-dry-run-analysis.md): This research document is committed to the repository root. It reads like a working note / analysis artifact rather than something that belongs in the tree. Should this be in docs/ or removed from this PR?

  3. Interaction between job timeout and unavailability tolerance: During controller unavailability, the job deadline can expire but the loop continues retrying until CONTROLLER_UNAVAILABLE_TOLERANCE (1h). This means a job with timeout=300 could block the client for up to 1 hour if the controller is down. This appears intentional per the PR description but is worth calling out — callers relying on timeout as a hard upper bound on wall-clock time will see different behavior during outages.

No bugs or correctness issues found. The call_with_retry changes are clean and well-tested. The log-fetch simplification (removing the consecutive failure counter) is a good change — log failures should not crash monitoring.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7eb1027ed7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +185 to +186
time.sleep(min(interval, deadline.remaining_seconds()))
continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Honor wait timeout during controller unavailability

This retry branch continues polling even after the caller's timeout has expired, because it sleeps for remaining_seconds() (which becomes 0) and immediately loops without raising TimeoutError. In outage scenarios, that means wait_for_job can run far longer than the documented timeout (it keeps invoking get_job_status, which itself retries), so a call like timeout=60 may block for many minutes or up to the new 1-hour tolerance; the same pattern is also present in wait_for_job_with_streaming.

Useful? React with 👍 / 👎.

@rjpower rjpower requested a review from yonromai March 26, 2026 16:50
Copy link
Copy Markdown
Contributor

@yonromai yonromai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@rjpower
Copy link
Copy Markdown
Collaborator Author

rjpower commented Mar 26, 2026

@claude review PR feedback and clean this up.

we should respect the users timeouts. the main thing we want is for lib/marin/.../executor code to not panic and terminate jobs if the controller is temporarily unavailable.

The retry-with-backoff-on-unavailable block is copy-pasted between wait_for_job and wait_for_job_with_streaming — ~40 lines of identical logic (try/except around get_job_status, unavailable timer, backoff sleep, reset on recovery).

this feels like it should be part of the errors.py retry library?

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Mar 26, 2026

Claude finished @rjpower's task in 5m 22s —— View job


Cleaned up PR based on feedback

  • Read current code (remote_client.py, errors.py, tests)
  • Move unavailability retry logic into errors.py retry library
  • Respect user timeouts (don't let unavailability tolerance override them)
  • Remove duplicated code from wait_for_job / wait_for_job_with_streaming
  • Check for stray files (iris-controller-dry-run-analysis.md) — already deleted in prior commit
  • Run pre-commit and tests
  • Commit and push

What changed:

  1. New poll_with_retries() in errors.py — encapsulates the "poll an RPC, tolerate transient unavailability with backoff, respect a deadline" pattern. This lives alongside call_with_retry in the retry library where it belongs.

  2. Deadline is now respected during unavailability — if the caller's timeout expires while the controller is down, poll_with_retries raises TimeoutError instead of continuing to retry for the full 1-hour tolerance. This addresses the Codex review feedback: timeout=60 will now actually time out within ~60s even during outages.

  3. Deduplicated ~40 lines — both wait_for_job and wait_for_job_with_streaming now call poll_with_retries() instead of duplicating the unavailability tracking, backoff, and recovery logging.

  4. 5 new tests for poll_with_retries: immediate success, retry-then-succeed, deadline expiry during unavailability, tolerance expiry, and non-retryable error passthrough. All 15 tests in test_errors.py pass.

Testing: ./infra/pre-commit.py --all-files --fix passes ✅ | uv run --package iris pytest lib/iris/tests/rpc/test_errors.py -v — 15/15 pass ✅
| Branch

Move the retry-with-backoff-on-unavailable logic from wait_for_job and
wait_for_job_with_streaming into a shared poll_with_retries() function in
the errors.py retry library. The new function respects the caller's
deadline — if timeout expires during controller unavailability, it raises
TimeoutError instead of continuing to retry for the full tolerance window.

Co-authored-by: Russell Power <rjpower@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rjpower rjpower merged commit 9499121 into main Mar 26, 2026
40 of 41 checks passed
@rjpower rjpower deleted the work/P0nyDNQN branch March 26, 2026 17:41
ravwojdyla pushed a commit that referenced this pull request Mar 26, 2026
)

- `call_with_retry` gains a `max_elapsed` parameter for time-based retry
cutoff (in addition to `max_attempts`)
- `wait_for_job` and `wait_for_job_with_streaming` now catch retryable
RPC errors from `get_job_status` and back off with exponential backoff
(1s → 60s) for up to 1 hour before treating controller unavailability as
a job error
- Log fetch failures in `wait_for_job_with_streaming` are now fully
non-fatal — they log a warning but never abort monitoring (previously 5
consecutive failures = crash)

The job keeps running server-side regardless; this only affects the
client's ability to poll. When the controller comes back, the
unavailability timer resets and monitoring resumes normally.
Helw150 pushed a commit that referenced this pull request Apr 8, 2026
)

- `call_with_retry` gains a `max_elapsed` parameter for time-based retry
cutoff (in addition to `max_attempts`)
- `wait_for_job` and `wait_for_job_with_streaming` now catch retryable
RPC errors from `get_job_status` and back off with exponential backoff
(1s → 60s) for up to 1 hour before treating controller unavailability as
a job error
- Log fetch failures in `wait_for_job_with_streaming` are now fully
non-fatal — they log a warning but never abort monitoring (previously 5
consecutive failures = crash)

The job keeps running server-side regardless; this only affects the
client's ability to poll. When the controller comes back, the
unavailability timer resets and monitoring resumes normally.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants