fix(zephyr): add lifecycle logging to coordinator thread#4006
fix(zephyr): add lifecycle logging to coordinator thread#4006
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 21af0845fb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| # Checked after completion so a clean shutdown racing the final | ||
| # task can never false-positive — only true crashes reach here. | ||
| if not self._coordinator_thread.is_alive(): |
There was a problem hiding this comment.
Guard dead-thread check for coordinators not started via initialize()
_wait_for_stage() now assumes initialize() has already created _coordinator_thread, but this class still exposes the legacy direct-stage flow (start_stage()/pull_task()/report_result()) without starting the background loop. In that path self._coordinator_thread is still None, so calling _wait_for_stage() now crashes with AttributeError here instead of using the existing no-workers timeout/recovery behavior. That regresses the direct ZephyrCoordinator API surface that is still exercised in lib/zephyr/tests/test_execution.py and by any caller using the legacy compat methods.
Useful? React with 👍 / 👎.
1dacc01 to
f9f2da9
Compare
The coordinator thread had no entry, exit, or error logging, making production hangs impossible to diagnose. Wrap the loop in try/except with full traceback logging, and have _wait_for_stage fail fast when the coordinator thread is dead instead of spinning forever. Closes #4004 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests that construct ZephyrCoordinator without initialize() leave _coordinator_thread as None. The guard is needed for that path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
f9f2da9 to
6a8cc6d
Compare
Summary
Fixes #4004 — the coordinator thread (
_coordinator_loop) had no lifecycle logging, making production hangs undiagnosable.Changes
_coordinator_loopnow logs on entry and on clean exittry/except; unhandled exceptions are logged at ERROR with full traceback (exc_info=True) and propagated to the main thread via_fatal_error_wait_for_stagechecks_coordinator_thread.is_alive()after the completion check each poll iteration and raisesZephyrWorkerErrorimmediately if the thread is gone, instead of spinning foreverWhat this would have changed in the #3996 incident
ERROR Coordinator loop crashed with unhandled exception+ traceback_wait_for_stagespun forever at N-1/NZephyrWorkerError: Coordinator thread is no longer alivekubectl exec+threading.enumerate()to diagnoseTest plan
🤖 Generated with Claude Code