[iris] Attach worker log handler before register so bootstrap logs ship by claude[bot] · Pull Request #4795 · marin-community/marin

claude · 2026-04-15T22:02:26Z

Workers created the RemoteLogHandler only after _register() returned, so container bring-up, disk probes, and registration rejections left no searchable remote logs. Resolve /system/log-server via ListEndpoints before register and attach the handler under a bootstrap_log_key keyed by slice_id + ip; rekey to the canonical worker_log_key once the controller assigns a worker_id. Tolerate transient resolve failures without crashing the lifecycle thread.

Fixes #4794

Workers previously created the RemoteLogHandler only after _register() returned, so container bring-up, disk probes, and registration rejections left no searchable remote logs. Resolve /system/log-server via ListEndpoints before register and attach the handler under a bootstrap_log_key keyed by slice_id + ip; rekey to the canonical worker_log_key once the controller assigns a worker_id. Tolerate a transient resolve failure without crashing the lifecycle thread. Fixes #4794

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ccd843a321

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-15T22:05:17Z

+        if self._log_handler is not None:
+            self._log_handler.key = self._current_log_key()
            return


Recreate log pusher when a handler already exists

This early return keeps the existing LogPusher forever and only rewrites the key, so lifecycle re-registration no longer refreshes the /system/log-server endpoint. If the log service address changes after a controller/log-server restart or failover, the worker continues sending to a stale destination and remote process logs stop arriving. The previous behavior detached/recreated the handler on each registration cycle, so this introduces a regression in long-running workers.

Useful? React with 👍 / 👎.

rjpower · 2026-04-15T22:28:14Z

@claude don't we know our worker id already from the environment information when we booted up? Do we need to wait for the controller for this? The review correctly notes that we don't recreate the LogPusher which seems like a bad call.

Also the tests are mocking everything and add a lot of complexity for little lift.

claude · 2026-04-15T22:28:43Z

Claude finished @rjpower's task in 8m 18s —— View job

Done — simplified per your feedback

Confirm worker_id is available at boot (it is: IRIS_WORKER_ID → slice_id + TPU index → TPU metadata → GCE instance name, all resolved in Worker.__init__)
Fix the LogPusher early-return regression Codex flagged
Strip the complexity out of the tests
Pre-commit + tests green
Commit & push → b3e545c3b

What changed

lib/iris/src/iris/cluster/worker/worker.py

Dropped bootstrap_log_key / _current_log_key / _rekey_log_handler_for_worker.
_attach_log_handler now always detaches + rebuilds the LogPusher, re-resolving /system/log-server each call. Fixes the Codex-flagged regression where a log-server failover would be invisible to long-running workers.

Attach pre-register when self._worker_id is known locally (the common case). In the rare case where the controller assigns the id, _run_lifecycle re-attaches under the canonical key only if the id changed:

self._attach_log_handler()
worker_id = self._register(stop_event)
if worker_id is None:
    break
if worker_id != self._worker_id:
    self._worker_id = worker_id
    self._attach_log_handler()
self._serve(stop_event)

lib/iris/src/iris/cluster/log_store/{__init__.py,_types.py}

Removed bootstrap_log_key — no longer needed.

lib/iris/tests/cluster/worker/test_worker.py

Replaced the two long tests (~125 lines) with three focused ones (~95 lines): _log_server_endpoints helper + a recording_log_pusher fixture do the mocking once.
- test_attach_log_handler_uses_worker_log_key_before_register — attach before register uses worker_log_key(self._worker_id).
- test_attach_log_handler_tolerates_resolve_failure — ListEndpoints failures don't crash.
- test_attach_log_handler_rebuilds_pusher_on_reattach — second attach picks up a new log-server address (regression guard for the Codex issue).

Net diff: +104 / −142 (shrinks the PR rather than growing it).

How I tested

./infra/pre-commit.py --all-files --fix → OK
uv run pytest lib/iris/tests/cluster/worker/test_worker.py → 36 passed in 29.94s
• agent/20260415-fix-4794 · commit b3e545c3b

…er each cycle Addresses review feedback on #4795: - Worker._worker_id is resolved at boot from IRIS_WORKER_ID / slice_id + TPU index / TPU metadata / GCE instance name, so we almost never need to wait for the controller to name us. Drop the bootstrap_log_key + rekey dance: attach pre-register under worker_log_key when the id is known locally; only re-attach post-register when the controller assigned a different id (the rare fallback path). - Fix the regression Codex flagged: _attach_log_handler now always detaches and rebuilds the LogPusher, so each lifecycle cycle re-resolves /system/log-server and picks up log-server failover instead of sending to a stale address forever. - Collapse the two mock-heavy tests into three tight ones: attach uses worker_log_key, resolve failure doesn't crash, repeated attach rebuilds the pusher. Co-authored-by: Russell Power <rjpower@users.noreply.github.com>

claude bot added the agent-generated Created by automation/agent label Apr 15, 2026

claude bot mentioned this pull request Apr 15, 2026

[iris] Let workers send bootstrap logs before registration #4794

Closed

chatgpt-codex-connector bot reviewed Apr 15, 2026

View reviewed changes

rjpower approved these changes Apr 15, 2026

View reviewed changes

dlwh merged commit 15e98d6 into main Apr 15, 2026
42 checks passed

dlwh deleted the agent/20260415-fix-4794 branch April 15, 2026 22:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[iris] Attach worker log handler before register so bootstrap logs ship#4795

[iris] Attach worker log handler before register so bootstrap logs ship#4795
dlwh merged 2 commits intomainfrom
agent/20260415-fix-4794

claude bot commented Apr 15, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 15, 2026

Uh oh!

rjpower commented Apr 15, 2026

Uh oh!

claude bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

claude bot commented Apr 15, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

rjpower commented Apr 15, 2026

Uh oh!

claude bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Done — simplified per your feedback

What changed

How I tested

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude bot commented Apr 15, 2026 •

edited

Loading