[iris] Attach worker log handler before register so bootstrap logs ship#4795
[iris] Attach worker log handler before register so bootstrap logs ship#4795
Conversation
Workers previously created the RemoteLogHandler only after _register() returned, so container bring-up, disk probes, and registration rejections left no searchable remote logs. Resolve /system/log-server via ListEndpoints before register and attach the handler under a bootstrap_log_key keyed by slice_id + ip; rekey to the canonical worker_log_key once the controller assigns a worker_id. Tolerate a transient resolve failure without crashing the lifecycle thread. Fixes #4794
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ccd843a321
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if self._log_handler is not None: | ||
| self._log_handler.key = self._current_log_key() | ||
| return |
There was a problem hiding this comment.
Recreate log pusher when a handler already exists
This early return keeps the existing LogPusher forever and only rewrites the key, so lifecycle re-registration no longer refreshes the /system/log-server endpoint. If the log service address changes after a controller/log-server restart or failover, the worker continues sending to a stale destination and remote process logs stop arriving. The previous behavior detached/recreated the handler on each registration cycle, so this introduces a regression in long-running workers.
Useful? React with 👍 / 👎.
|
@claude don't we know our worker id already from the environment information when we booted up? Do we need to wait for the controller for this? The review correctly notes that we don't recreate the LogPusher which seems like a bad call. Also the tests are mocking everything and add a lot of complexity for little lift. |
|
Claude finished @rjpower's task in 8m 18s —— View job Done — simplified per your feedback
What changed
Net diff: +104 / −142 (shrinks the PR rather than growing it). How I tested
|
…er each cycle Addresses review feedback on #4795: - Worker._worker_id is resolved at boot from IRIS_WORKER_ID / slice_id + TPU index / TPU metadata / GCE instance name, so we almost never need to wait for the controller to name us. Drop the bootstrap_log_key + rekey dance: attach pre-register under worker_log_key when the id is known locally; only re-attach post-register when the controller assigned a different id (the rare fallback path). - Fix the regression Codex flagged: _attach_log_handler now always detaches and rebuilds the LogPusher, so each lifecycle cycle re-resolves /system/log-server and picks up log-server failover instead of sending to a stale address forever. - Collapse the two mock-heavy tests into three tight ones: attach uses worker_log_key, resolve failure doesn't crash, repeated attach rebuilds the pusher. Co-authored-by: Russell Power <rjpower@users.noreply.github.com>
Workers created the RemoteLogHandler only after _register() returned, so container bring-up, disk probes, and registration rejections left no searchable remote logs. Resolve /system/log-server via ListEndpoints before register and attach the handler under a bootstrap_log_key keyed by slice_id + ip; rekey to the canonical worker_log_key once the controller assigns a worker_id. Tolerate transient resolve failures without crashing the lifecycle thread.
Fixes #4794