fix(zephyr): add lifecycle logging to coordinator thread by yonromai · Pull Request #4006 · marin-community/marin

yonromai · 2026-03-23T19:16:36Z

Summary

Fixes #4004 — the coordinator thread (_coordinator_loop) had no lifecycle logging, making production hangs undiagnosable.

Changes

Start/exit logging: _coordinator_loop now logs on entry and on clean exit
Crash handling: Wrapped the loop body in try/except; unhandled exceptions are logged at ERROR with full traceback (exc_info=True) and propagated to the main thread via _fatal_error
Dead-thread detection: _wait_for_stage checks _coordinator_thread.is_alive() after the completion check each poll iteration and raises ZephyrWorkerError immediately if the thread is gone, instead of spinning forever

What this would have changed in the #3996 incident

Before	After
No log when coordinator thread died	`ERROR Coordinator loop crashed with unhandled exception` + traceback
`_wait_for_stage` spun forever at N-1/N	Immediate `ZephyrWorkerError: Coordinator thread is no longer alive`
Required `kubectl exec` + `threading.enumerate()` to diagnose	Root cause visible in pod logs

Test plan

Manual smoke test: verified start log, exit log, crash log + traceback, and dead-thread detection all produce expected output
CI green

🤖 Generated with Claude Code

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 21af0845fb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-23T19:19:48Z


+            # Checked after completion so a clean shutdown racing the final
+            # task can never false-positive — only true crashes reach here.
+            if not self._coordinator_thread.is_alive():


Guard dead-thread check for coordinators not started via initialize()

_wait_for_stage() now assumes initialize() has already created _coordinator_thread, but this class still exposes the legacy direct-stage flow (start_stage()/pull_task()/report_result()) without starting the background loop. In that path self._coordinator_thread is still None, so calling _wait_for_stage() now crashes with AttributeError here instead of using the existing no-workers timeout/recovery behavior. That regresses the direct ZephyrCoordinator API surface that is still exercised in lib/zephyr/tests/test_execution.py and by any caller using the legacy compat methods.

Useful? React with 👍 / 👎.

The coordinator thread had no entry, exit, or error logging, making production hangs impossible to diagnose. Wrap the loop in try/except with full traceback logging, and have _wait_for_stage fail fast when the coordinator thread is dead instead of spinning forever. Closes #4004 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Tests that construct ZephyrCoordinator without initialize() leave _coordinator_thread as None. The guard is needed for that path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector Bot reviewed Mar 23, 2026

View reviewed changes

yonromai force-pushed the fix/coordinator-lifecycle-logging branch from 1dacc01 to f9f2da9 Compare March 23, 2026 19:33

yonromai mentioned this pull request Mar 23, 2026

[zephyr] Fix coordinator loop crash causing silent pipeline hangs #4008

Merged

2 tasks

yoblin and others added 2 commits March 23, 2026 20:05

fix: restore None guard for legacy direct-stage callers

6a8cc6d

Tests that construct ZephyrCoordinator without initialize() leave _coordinator_thread as None. The guard is needed for that path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

yonromai force-pushed the fix/coordinator-lifecycle-logging branch from f9f2da9 to 6a8cc6d Compare March 23, 2026 20:05

yonromai closed this Mar 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(zephyr): add lifecycle logging to coordinator thread#4006

fix(zephyr): add lifecycle logging to coordinator thread#4006
yonromai wants to merge 2 commits intomainfrom
fix/coordinator-lifecycle-logging

yonromai commented Mar 23, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yonromai commented Mar 23, 2026

Summary

Changes

What this would have changed in the #3996 incident

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants