[iris] Warn on log push RPC failures instead of silently swallowing#4709
[iris] Warn on log push RPC failures instead of silently swallowing#4709claude[bot] wants to merge 5 commits intomainfrom
Conversation
RPC failures in LogPusher._send() and task_attempt log helpers were logged at DEBUG level, making server outages invisible. Promote to WARNING with rate limiting (first failure + every 60s) and log recovery when sends resume. Adds tests for warn, rate-limit, and recovery behavior. Fixes #4707
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 995dce86a3
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| logger.warning( | ||
| "Failed to push %d log entries for key %s (%d consecutive failures)", | ||
| len(entries), | ||
| key, | ||
| self._consecutive_failures, | ||
| exc_info=True, | ||
| ) |
There was a problem hiding this comment.
Avoid re-entrant warning logs while holding _send_lock
In LogPusher._send, the new logger.warning(...) runs while _send_lock is held. In worker mode, RemoteLogHandler is attached to the root logger, so this warning is fed back into the same LogPusher.push() path; if that handler key buffer is at batch_size - 1 (or batch_size=1), push() immediately re-enters _send() and blocks on the same non-reentrant lock. This creates a deadlock path during log-server RPC failures, which is exactly when this branch executes.
Useful? React with 👍 / 👎.
| except Exception: | ||
| logger.debug("Failed to push %d log entries for key %s", len(entries), key, exc_info=True) | ||
| self._consecutive_failures += 1 | ||
| now = time.monotonic() |
There was a problem hiding this comment.
Use Iris timing primitives instead of raw time.monotonic
lib/iris/AGENTS.md (scoped to this file) requires Iris code to use rigging.timing utilities for time-related logic, but this commit introduces direct time.monotonic() usage for warning throttling. This is an explicit project-rule violation in the touched path and should be converted to the shared timing abstractions used elsewhere in Iris.
Useful? React with 👍 / 👎.
|
@claude fix the issues |
|
Claude finished @rjpower's task in 5m 17s —— View job Fixing Codex Review Issues
SummaryFixed both P1 issues from the Codex review: 1. Deadlock prevention — 2. Use Also cleaned up dead test code (unused Testing: |
Move logger.warning()/logger.info() calls outside _send_lock to prevent re-entrant deadlock when RemoteLogHandler routes warnings back through LogPusher.push(). Replace raw time.monotonic() tracking with rigging.timing.RateLimiter per Iris coding conventions. Clean up dead test code (unused FailingPusher subclass). Co-authored-by: Russell Power <rjpower@users.noreply.github.com>
|
@claude can we not use a special logger for the log pusher itself which bypasses the re-entrant issue? that seems safer than this dance. If we fail to push for a batch, can we retain the batch to try pushing again later? |
|
Claude finished @rjpower's task in 8m 16s —— View job Implementing LogPusher improvements
SummaryImplemented both suggestions: 1. Dedicated non-propagating logger — Created 2. Retry failed batches — Testing: |
Replace the warn_args/warn_exc_info dance with a dedicated _pusher_logger that has propagate=False and writes directly to stderr. This structurally prevents the re-entrant deadlock via RemoteLogHandler without needing to log outside the lock. On RPC failure, _send() now calls _requeue() to put entries back in the buffer for retry on the next flush cycle, capped at _MAX_BUFFER_PER_KEY (10,000) per key to prevent unbounded memory growth. Co-authored-by: Russell Power <rjpower@users.noreply.github.com>
RPC failures in LogPusher._send() and the task_attempt log helpers were logged at
DEBUG level, making log-server outages completely invisible to operators. Changed
to WARNING with rate limiting (warn on first failure, then at most every 60s) so
persistent outages don't spam. Logs an INFO recovery message when sends resume
after a failure streak. Also promotes the three debug-level catches in
task_attempt.py (_push_logs, _stream_logs, _cleanup) to warning.
Fixes #4707