Skip to content

fix(zephyr): add lifecycle logging to coordinator thread#4006

Closed
yonromai wants to merge 2 commits intomainfrom
fix/coordinator-lifecycle-logging
Closed

fix(zephyr): add lifecycle logging to coordinator thread#4006
yonromai wants to merge 2 commits intomainfrom
fix/coordinator-lifecycle-logging

Conversation

@yonromai
Copy link
Copy Markdown
Contributor

Summary

Fixes #4004 — the coordinator thread (_coordinator_loop) had no lifecycle logging, making production hangs undiagnosable.

Changes

  • Start/exit logging: _coordinator_loop now logs on entry and on clean exit
  • Crash handling: Wrapped the loop body in try/except; unhandled exceptions are logged at ERROR with full traceback (exc_info=True) and propagated to the main thread via _fatal_error
  • Dead-thread detection: _wait_for_stage checks _coordinator_thread.is_alive() after the completion check each poll iteration and raises ZephyrWorkerError immediately if the thread is gone, instead of spinning forever

What this would have changed in the #3996 incident

Before After
No log when coordinator thread died ERROR Coordinator loop crashed with unhandled exception + traceback
_wait_for_stage spun forever at N-1/N Immediate ZephyrWorkerError: Coordinator thread is no longer alive
Required kubectl exec + threading.enumerate() to diagnose Root cause visible in pod logs

Test plan

  • Manual smoke test: verified start log, exit log, crash log + traceback, and dead-thread detection all produce expected output
  • CI green

🤖 Generated with Claude Code

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 21af0845fb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lib/zephyr/src/zephyr/execution.py Outdated

# Checked after completion so a clean shutdown racing the final
# task can never false-positive — only true crashes reach here.
if not self._coordinator_thread.is_alive():
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Guard dead-thread check for coordinators not started via initialize()

_wait_for_stage() now assumes initialize() has already created _coordinator_thread, but this class still exposes the legacy direct-stage flow (start_stage()/pull_task()/report_result()) without starting the background loop. In that path self._coordinator_thread is still None, so calling _wait_for_stage() now crashes with AttributeError here instead of using the existing no-workers timeout/recovery behavior. That regresses the direct ZephyrCoordinator API surface that is still exercised in lib/zephyr/tests/test_execution.py and by any caller using the legacy compat methods.

Useful? React with 👍 / 👎.

yoblin and others added 2 commits March 23, 2026 20:05
The coordinator thread had no entry, exit, or error logging, making
production hangs impossible to diagnose. Wrap the loop in try/except
with full traceback logging, and have _wait_for_stage fail fast when
the coordinator thread is dead instead of spinning forever.

Closes #4004

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests that construct ZephyrCoordinator without initialize() leave
_coordinator_thread as None. The guard is needed for that path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@yonromai yonromai force-pushed the fix/coordinator-lifecycle-logging branch from f9f2da9 to 6a8cc6d Compare March 23, 2026 20:05
@yonromai yonromai closed this Mar 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[zephyr] Coordinator thread has no lifecycle logging — hangs are undiagnosable

2 participants