Tolerate controller unavailability for up to 1h in job monitoring#4176
Tolerate controller unavailability for up to 1h in job monitoring#4176
Conversation
Analyze controller startup flow, scheduling loop, worker sync, autoscaler, and checkpoint systems. Identify all side-effectful operations that need gating for a --dry-run flag.
The monitoring loops (wait_for_job, wait_for_job_with_streaming) now catch retryable RPC errors and back off for up to 1 hour before giving up, instead of treating a temporarily offline controller as a job failure. Changes: - call_with_retry: add max_elapsed parameter for time-based retry cutoff - wait_for_job: track controller unavailability with 1h tolerance - wait_for_job_with_streaming: same, plus log fetch failures are non-fatal
|
Claude finished @rjpower's task in 1m 21s —— View job ReviewAdds time-based retry tolerance for controller unavailability in
Findings:
No bugs or correctness issues found. The |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7eb1027ed7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| time.sleep(min(interval, deadline.remaining_seconds())) | ||
| continue |
There was a problem hiding this comment.
Honor wait timeout during controller unavailability
This retry branch continues polling even after the caller's timeout has expired, because it sleeps for remaining_seconds() (which becomes 0) and immediately loops without raising TimeoutError. In outage scenarios, that means wait_for_job can run far longer than the documented timeout (it keeps invoking get_job_status, which itself retries), so a call like timeout=60 may block for many minutes or up to the new 1-hour tolerance; the same pattern is also present in wait_for_job_with_streaming.
Useful? React with 👍 / 👎.
|
@claude review PR feedback and clean this up. we should respect the users timeouts. the main thing we want is for lib/marin/.../executor code to not panic and terminate jobs if the controller is temporarily unavailable.
this feels like it should be part of the errors.py retry library? |
|
Claude finished @rjpower's task in 5m 22s —— View job Cleaned up PR based on feedback
What changed:
Testing: |
Move the retry-with-backoff-on-unavailable logic from wait_for_job and wait_for_job_with_streaming into a shared poll_with_retries() function in the errors.py retry library. The new function respects the caller's deadline — if timeout expires during controller unavailability, it raises TimeoutError instead of continuing to retry for the full tolerance window. Co-authored-by: Russell Power <rjpower@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
) - `call_with_retry` gains a `max_elapsed` parameter for time-based retry cutoff (in addition to `max_attempts`) - `wait_for_job` and `wait_for_job_with_streaming` now catch retryable RPC errors from `get_job_status` and back off with exponential backoff (1s → 60s) for up to 1 hour before treating controller unavailability as a job error - Log fetch failures in `wait_for_job_with_streaming` are now fully non-fatal — they log a warning but never abort monitoring (previously 5 consecutive failures = crash) The job keeps running server-side regardless; this only affects the client's ability to poll. When the controller comes back, the unavailability timer resets and monitoring resumes normally.
) - `call_with_retry` gains a `max_elapsed` parameter for time-based retry cutoff (in addition to `max_attempts`) - `wait_for_job` and `wait_for_job_with_streaming` now catch retryable RPC errors from `get_job_status` and back off with exponential backoff (1s → 60s) for up to 1 hour before treating controller unavailability as a job error - Log fetch failures in `wait_for_job_with_streaming` are now fully non-fatal — they log a warning but never abort monitoring (previously 5 consecutive failures = crash) The job keeps running server-side regardless; this only affects the client's ability to poll. When the controller comes back, the unavailability timer resets and monitoring resumes normally.
call_with_retrygains amax_elapsedparameter for time-based retry cutoff (in addition tomax_attempts)wait_for_jobandwait_for_job_with_streamingnow catch retryable RPC errors fromget_job_statusand back off with exponential backoff (1s → 60s) for up to 1 hour before treating controller unavailability as a job errorwait_for_job_with_streamingare now fully non-fatal — they log a warning but never abort monitoring (previously 5 consecutive failures = crash)The job keeps running server-side regardless; this only affects the client's ability to poll. When the controller comes back, the unavailability timer resets and monitoring resumes normally.