Skip to content

[iris] Attach worker log handler before register so bootstrap logs ship#4795

Merged
dlwh merged 2 commits intomainfrom
agent/20260415-fix-4794
Apr 15, 2026
Merged

[iris] Attach worker log handler before register so bootstrap logs ship#4795
dlwh merged 2 commits intomainfrom
agent/20260415-fix-4794

Conversation

@claude
Copy link
Copy Markdown
Contributor

@claude claude bot commented Apr 15, 2026

Workers created the RemoteLogHandler only after _register() returned, so container bring-up, disk probes, and registration rejections left no searchable remote logs. Resolve /system/log-server via ListEndpoints before register and attach the handler under a bootstrap_log_key keyed by slice_id + ip; rekey to the canonical worker_log_key once the controller assigns a worker_id. Tolerate transient resolve failures without crashing the lifecycle thread.

Fixes #4794

Workers previously created the RemoteLogHandler only after _register()
returned, so container bring-up, disk probes, and registration
rejections left no searchable remote logs. Resolve /system/log-server
via ListEndpoints before register and attach the handler under a
bootstrap_log_key keyed by slice_id + ip; rekey to the canonical
worker_log_key once the controller assigns a worker_id. Tolerate a
transient resolve failure without crashing the lifecycle thread.

Fixes #4794
@claude claude bot added the agent-generated Created by automation/agent label Apr 15, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ccd843a321

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 494 to 496
if self._log_handler is not None:
self._log_handler.key = self._current_log_key()
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Recreate log pusher when a handler already exists

This early return keeps the existing LogPusher forever and only rewrites the key, so lifecycle re-registration no longer refreshes the /system/log-server endpoint. If the log service address changes after a controller/log-server restart or failover, the worker continues sending to a stale destination and remote process logs stop arriving. The previous behavior detached/recreated the handler on each registration cycle, so this introduces a regression in long-running workers.

Useful? React with 👍 / 👎.

@rjpower
Copy link
Copy Markdown
Collaborator

rjpower commented Apr 15, 2026

@claude don't we know our worker id already from the environment information when we booted up? Do we need to wait for the controller for this? The review correctly notes that we don't recreate the LogPusher which seems like a bad call.

Also the tests are mocking everything and add a lot of complexity for little lift.

@claude
Copy link
Copy Markdown
Contributor Author

claude bot commented Apr 15, 2026

Claude finished @rjpower's task in 8m 18s —— View job


Done — simplified per your feedback

  • Confirm worker_id is available at boot (it is: IRIS_WORKER_IDslice_id + TPU index → TPU metadata → GCE instance name, all resolved in Worker.__init__)
  • Fix the LogPusher early-return regression Codex flagged
  • Strip the complexity out of the tests
  • Pre-commit + tests green
  • Commit & push → b3e545c3b

What changed

lib/iris/src/iris/cluster/worker/worker.py

  • Dropped bootstrap_log_key / _current_log_key / _rekey_log_handler_for_worker.
  • _attach_log_handler now always detaches + rebuilds the LogPusher, re-resolving /system/log-server each call. Fixes the Codex-flagged regression where a log-server failover would be invisible to long-running workers.
  • Attach pre-register when self._worker_id is known locally (the common case). In the rare case where the controller assigns the id, _run_lifecycle re-attaches under the canonical key only if the id changed:
    self._attach_log_handler()
    worker_id = self._register(stop_event)
    if worker_id is None:
        break
    if worker_id != self._worker_id:
        self._worker_id = worker_id
        self._attach_log_handler()
    self._serve(stop_event)

lib/iris/src/iris/cluster/log_store/{__init__.py,_types.py}

  • Removed bootstrap_log_key — no longer needed.

lib/iris/tests/cluster/worker/test_worker.py

  • Replaced the two long tests (~125 lines) with three focused ones (~95 lines): _log_server_endpoints helper + a recording_log_pusher fixture do the mocking once.
    • test_attach_log_handler_uses_worker_log_key_before_register — attach before register uses worker_log_key(self._worker_id).
    • test_attach_log_handler_tolerates_resolve_failureListEndpoints failures don't crash.
    • test_attach_log_handler_rebuilds_pusher_on_reattach — second attach picks up a new log-server address (regression guard for the Codex issue).

Net diff: +104 / −142 (shrinks the PR rather than growing it).

How I tested

  • ./infra/pre-commit.py --all-files --fix → OK
  • uv run pytest lib/iris/tests/cluster/worker/test_worker.py36 passed in 29.94s
    agent/20260415-fix-4794 · commit b3e545c3b

…er each cycle

Addresses review feedback on #4795:

- Worker._worker_id is resolved at boot from IRIS_WORKER_ID / slice_id +
  TPU index / TPU metadata / GCE instance name, so we almost never need
  to wait for the controller to name us. Drop the bootstrap_log_key +
  rekey dance: attach pre-register under worker_log_key when the id is
  known locally; only re-attach post-register when the controller
  assigned a different id (the rare fallback path).
- Fix the regression Codex flagged: _attach_log_handler now always
  detaches and rebuilds the LogPusher, so each lifecycle cycle
  re-resolves /system/log-server and picks up log-server failover
  instead of sending to a stale address forever.
- Collapse the two mock-heavy tests into three tight ones: attach uses
  worker_log_key, resolve failure doesn't crash, repeated attach rebuilds
  the pusher.

Co-authored-by: Russell Power <rjpower@users.noreply.github.com>
@dlwh dlwh merged commit 15e98d6 into main Apr 15, 2026
42 checks passed
@dlwh dlwh deleted the agent/20260415-fix-4794 branch April 15, 2026 22:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[iris] Let workers send bootstrap logs before registration

2 participants