Skip to content

[iris] Warn on log push RPC failures instead of silently swallowing#4709

Open
claude[bot] wants to merge 5 commits intomainfrom
agent/20260413-fix-4707
Open

[iris] Warn on log push RPC failures instead of silently swallowing#4709
claude[bot] wants to merge 5 commits intomainfrom
agent/20260413-fix-4707

Conversation

@claude
Copy link
Copy Markdown
Contributor

@claude claude bot commented Apr 14, 2026

RPC failures in LogPusher._send() and the task_attempt log helpers were logged at
DEBUG level, making log-server outages completely invisible to operators. Changed
to WARNING with rate limiting (warn on first failure, then at most every 60s) so
persistent outages don't spam. Logs an INFO recovery message when sends resume
after a failure streak. Also promotes the three debug-level catches in
task_attempt.py (_push_logs, _stream_logs, _cleanup) to warning.

Fixes #4707

RPC failures in LogPusher._send() and task_attempt log helpers were
logged at DEBUG level, making server outages invisible. Promote to
WARNING with rate limiting (first failure + every 60s) and log
recovery when sends resume. Adds tests for warn, rate-limit, and
recovery behavior.

Fixes #4707
@claude claude bot added the agent-generated Created by automation/agent label Apr 14, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 995dce86a3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread lib/iris/src/iris/log_server/client.py Outdated
Comment on lines +104 to +110
logger.warning(
"Failed to push %d log entries for key %s (%d consecutive failures)",
len(entries),
key,
self._consecutive_failures,
exc_info=True,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid re-entrant warning logs while holding _send_lock

In LogPusher._send, the new logger.warning(...) runs while _send_lock is held. In worker mode, RemoteLogHandler is attached to the root logger, so this warning is fed back into the same LogPusher.push() path; if that handler key buffer is at batch_size - 1 (or batch_size=1), push() immediately re-enters _send() and blocks on the same non-reentrant lock. This creates a deadlock path during log-server RPC failures, which is exactly when this branch executes.

Useful? React with 👍 / 👎.

Comment thread lib/iris/src/iris/log_server/client.py Outdated
except Exception:
logger.debug("Failed to push %d log entries for key %s", len(entries), key, exc_info=True)
self._consecutive_failures += 1
now = time.monotonic()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use Iris timing primitives instead of raw time.monotonic

lib/iris/AGENTS.md (scoped to this file) requires Iris code to use rigging.timing utilities for time-related logic, but this commit introduces direct time.monotonic() usage for warning throttling. This is an explicit project-rule violation in the touched path and should be converted to the shared timing abstractions used elsewhere in Iris.

Useful? React with 👍 / 👎.

@rjpower
Copy link
Copy Markdown
Collaborator

rjpower commented Apr 14, 2026

@claude fix the issues

@claude
Copy link
Copy Markdown
Contributor Author

claude bot commented Apr 14, 2026

Claude finished @rjpower's task in 5m 17s —— View job


Fixing Codex Review Issues

  • Gather context: read changed files and Iris timing primitives
  • Fix P1: Avoid re-entrant warning logs while holding _send_lock (deadlock risk)
  • Fix P1: Use rigging.timing.RateLimiter instead of raw time.monotonic
  • Update tests to match new implementation
  • Run pre-commit and tests
  • Commit and push

Summary

Fixed both P1 issues from the Codex review:

1. Deadlock preventionlogger.warning() and logger.info() in LogPusher._send() were called while holding _send_lock. Since RemoteLogHandler is attached to the root logger, this warning could re-enter push()_send() and block on the same non-reentrant lock. Fix: capture exception info and failure counts inside the lock, then log after releasing it using try/except/else control flow.

2. Use rigging.timing.RateLimiter — Replaced raw time.monotonic() + _last_fail_warn_time with rigging.timing.RateLimiter(interval_seconds=_WARN_INTERVAL), per Iris's coding convention to use rigging.timing for all time-related operations. The RateLimiter.should_run() naturally handles both first-failure (returns True when no prior run) and rate-limited subsequent failures.

Also cleaned up dead test code (unused FailingPusher subclass and sent_count variable).

Testing: ./infra/pre-commit.py --all-files --fix passes. All 9 tests in lib/iris/tests/test_remote_log_handler.py pass ✅
·

Move logger.warning()/logger.info() calls outside _send_lock to prevent
re-entrant deadlock when RemoteLogHandler routes warnings back through
LogPusher.push(). Replace raw time.monotonic() tracking with
rigging.timing.RateLimiter per Iris coding conventions. Clean up dead
test code (unused FailingPusher subclass).

Co-authored-by: Russell Power <rjpower@users.noreply.github.com>
@rjpower rjpower requested a review from yonromai April 14, 2026 00:56
@rjpower
Copy link
Copy Markdown
Collaborator

rjpower commented Apr 14, 2026

@claude can we not use a special logger for the log pusher itself which bypasses the re-entrant issue? that seems safer than this dance.

If we fail to push for a batch, can we retain the batch to try pushing again later?

@claude
Copy link
Copy Markdown
Contributor Author

claude bot commented Apr 14, 2026

Claude finished @rjpower's task in 8m 16s —— View job


Implementing LogPusher improvements

  • Read current implementation (client.py, task_attempt.py, tests)
  • Use a dedicated non-propagating logger for LogPusher diagnostics
  • Retain failed batches for retry on next flush (with buffer cap)
  • Update tests
  • Run pre-commit and tests
  • Commit and push

Summary

Implemented both suggestions:

1. Dedicated non-propagating logger — Created _pusher_logger (iris.log_server.client._internal) with propagate=False and a StreamHandler to stderr. This structurally prevents the re-entrant deadlock: since _pusher_logger never reaches the root logger, a RemoteLogHandler attached there can't feed warnings back into push()_send(). This replaces the warn_args/warn_exc_info dance of capturing state inside the lock and logging outside it — now _send() just logs directly.

2. Retry failed batches_send() now calls _requeue(key, entries) on failure, putting entries back into the buffer for the next flush cycle. Buffer is capped at _MAX_BUFFER_PER_KEY = 10,000 entries per key to prevent unbounded memory growth during extended outages; oldest entries are dropped when the cap is exceeded.

Testing: ./infra/pre-commit.py --all-files --fix passes ✅ | All 11 tests in test_remote_log_handler.py pass ✅ (including 2 new tests: test_failed_entries_retried_on_next_flush and test_requeue_caps_buffer_size)
· Branch

github-actions bot and others added 2 commits April 14, 2026 02:06
Replace the warn_args/warn_exc_info dance with a dedicated _pusher_logger
that has propagate=False and writes directly to stderr.  This structurally
prevents the re-entrant deadlock via RemoteLogHandler without needing to
log outside the lock.

On RPC failure, _send() now calls _requeue() to put entries back in the
buffer for retry on the next flush cycle, capped at _MAX_BUFFER_PER_KEY
(10,000) per key to prevent unbounded memory growth.

Co-authored-by: Russell Power <rjpower@users.noreply.github.com>
@rjpower rjpower enabled auto-merge (squash) April 14, 2026 20:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

iris: log pusher should warn on RPC failures/slow-down

2 participants