[iris] Warn on log push RPC failures instead of silently swallowing by claude[bot] · Pull Request #4709 · marin-community/marin

claude · 2026-04-14T00:00:39Z

RPC failures in LogPusher._send() and the task_attempt log helpers were logged at
DEBUG level, making log-server outages completely invisible to operators. Changed
to WARNING with rate limiting (warn on first failure, then at most every 60s) so
persistent outages don't spam. Logs an INFO recovery message when sends resume
after a failure streak. Also promotes the three debug-level catches in
task_attempt.py (_push_logs, _stream_logs, _cleanup) to warning.

Fixes #4707

RPC failures in LogPusher._send() and task_attempt log helpers were logged at DEBUG level, making server outages invisible. Promote to WARNING with rate limiting (first failure + every 60s) and log recovery when sends resume. Adds tests for warn, rate-limit, and recovery behavior. Fixes #4707

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 995dce86a3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-14T00:06:07Z

+                    logger.warning(
+                        "Failed to push %d log entries for key %s (%d consecutive failures)",
+                        len(entries),
+                        key,
+                        self._consecutive_failures,
+                        exc_info=True,
+                    )


Avoid re-entrant warning logs while holding _send_lock

In LogPusher._send, the new logger.warning(...) runs while _send_lock is held. In worker mode, RemoteLogHandler is attached to the root logger, so this warning is fed back into the same LogPusher.push() path; if that handler key buffer is at batch_size - 1 (or batch_size=1), push() immediately re-enters _send() and blocks on the same non-reentrant lock. This creates a deadlock path during log-server RPC failures, which is exactly when this branch executes.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-14T00:06:07Z

            except Exception:
-                logger.debug("Failed to push %d log entries for key %s", len(entries), key, exc_info=True)
+                self._consecutive_failures += 1
+                now = time.monotonic()


Use Iris timing primitives instead of raw time.monotonic

lib/iris/AGENTS.md (scoped to this file) requires Iris code to use rigging.timing utilities for time-related logic, but this commit introduces direct time.monotonic() usage for warning throttling. This is an explicit project-rule violation in the touched path and should be converted to the shared timing abstractions used elsewhere in Iris.

Useful? React with 👍 / 👎.

rjpower · 2026-04-14T00:49:12Z

@claude fix the issues

claude · 2026-04-14T00:49:44Z

Claude finished @rjpower's task in 5m 17s —— View job

Fixing Codex Review Issues

Gather context: read changed files and Iris timing primitives
Fix P1: Avoid re-entrant warning logs while holding _send_lock (deadlock risk)
Fix P1: Use rigging.timing.RateLimiter instead of raw time.monotonic
Update tests to match new implementation
Run pre-commit and tests
Commit and push

Summary

Fixed both P1 issues from the Codex review:

1. Deadlock prevention — logger.warning() and logger.info() in LogPusher._send() were called while holding _send_lock. Since RemoteLogHandler is attached to the root logger, this warning could re-enter push() → _send() and block on the same non-reentrant lock. Fix: capture exception info and failure counts inside the lock, then log after releasing it using try/except/else control flow.

2. Use rigging.timing.RateLimiter — Replaced raw time.monotonic() + _last_fail_warn_time with rigging.timing.RateLimiter(interval_seconds=_WARN_INTERVAL), per Iris's coding convention to use rigging.timing for all time-related operations. The RateLimiter.should_run() naturally handles both first-failure (returns True when no prior run) and rate-limited subsequent failures.

Also cleaned up dead test code (unused FailingPusher subclass and sent_count variable).

Testing: ./infra/pre-commit.py --all-files --fix passes. All 9 tests in lib/iris/tests/test_remote_log_handler.py pass ✅
·

Move logger.warning()/logger.info() calls outside _send_lock to prevent re-entrant deadlock when RemoteLogHandler routes warnings back through LogPusher.push(). Replace raw time.monotonic() tracking with rigging.timing.RateLimiter per Iris coding conventions. Clean up dead test code (unused FailingPusher subclass). Co-authored-by: Russell Power <rjpower@users.noreply.github.com>

rjpower · 2026-04-14T01:57:39Z

@claude can we not use a special logger for the log pusher itself which bypasses the re-entrant issue? that seems safer than this dance.

If we fail to push for a batch, can we retain the batch to try pushing again later?

claude · 2026-04-14T01:58:16Z

Claude finished @rjpower's task in 8m 16s —— View job

Implementing LogPusher improvements

Read current implementation (client.py, task_attempt.py, tests)
Use a dedicated non-propagating logger for LogPusher diagnostics
Retain failed batches for retry on next flush (with buffer cap)
Update tests
Run pre-commit and tests
Commit and push

Summary

Implemented both suggestions:

1. Dedicated non-propagating logger — Created _pusher_logger (iris.log_server.client._internal) with propagate=False and a StreamHandler to stderr. This structurally prevents the re-entrant deadlock: since _pusher_logger never reaches the root logger, a RemoteLogHandler attached there can't feed warnings back into push() → _send(). This replaces the warn_args/warn_exc_info dance of capturing state inside the lock and logging outside it — now _send() just logs directly.

2. Retry failed batches — _send() now calls _requeue(key, entries) on failure, putting entries back into the buffer for the next flush cycle. Buffer is capped at _MAX_BUFFER_PER_KEY = 10,000 entries per key to prevent unbounded memory growth during extended outages; oldest entries are dropped when the cap is exceeded.

Testing: ./infra/pre-commit.py --all-files --fix passes ✅ | All 11 tests in test_remote_log_handler.py pass ✅ (including 2 new tests: test_failed_entries_retried_on_next_flush and test_requeue_caps_buffer_size)
· Branch

Replace the warn_args/warn_exc_info dance with a dedicated _pusher_logger that has propagate=False and writes directly to stderr. This structurally prevents the re-entrant deadlock via RemoteLogHandler without needing to log outside the lock. On RPC failure, _send() now calls _requeue() to put entries back in the buffer for retry on the next flush cycle, capped at _MAX_BUFFER_PER_KEY (10,000) per key to prevent unbounded memory growth. Co-authored-by: Russell Power <rjpower@users.noreply.github.com>

claude bot added the agent-generated Created by automation/agent label Apr 14, 2026

claude bot mentioned this pull request Apr 14, 2026

iris: log pusher should warn on RPC failures/slow-down #4707

Open

chatgpt-codex-connector bot reviewed Apr 14, 2026

View reviewed changes

rjpower requested a review from yonromai April 14, 2026 00:56

yonromai approved these changes Apr 14, 2026

View reviewed changes

github-actions bot and others added 2 commits April 14, 2026 02:06

Merge branch 'main' into agent/20260413-fix-4707

3afd23d

rjpower enabled auto-merge (squash) April 14, 2026 20:08

Merge branch 'main' into agent/20260413-fix-4707

615d54b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[iris] Warn on log push RPC failures instead of silently swallowing#4709

[iris] Warn on log push RPC failures instead of silently swallowing#4709
claude[bot] wants to merge 5 commits intomainfrom
agent/20260413-fix-4707

claude bot commented Apr 14, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 14, 2026

Uh oh!

chatgpt-codex-connector bot Apr 14, 2026

Uh oh!

rjpower commented Apr 14, 2026

Uh oh!

claude bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

rjpower commented Apr 14, 2026

Uh oh!

claude bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

claude bot commented Apr 14, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

rjpower commented Apr 14, 2026

Uh oh!

claude bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fixing Codex Review Issues

Summary

Uh oh!

rjpower commented Apr 14, 2026

Uh oh!

claude bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implementing LogPusher improvements

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude bot commented Apr 14, 2026 •

edited

Loading

claude bot commented Apr 14, 2026 •

edited

Loading