[zephyr] Fix coordinator loop crash causing silent pipeline hangs by yonromai · Pull Request #4008 · marin-community/marin

yonromai · 2026-03-23T19:40:18Z

Summary

Fixes #3996. The coordinator daemon thread could crash from an unhandled RuntimeError (dict mutation during iteration), killing heartbeat checking and causing pipelines to hang at N-1/N with no error.

_log_status(): snapshot _worker_states under self._lock before iterating
_coordinator_loop(): wrap loop body in try/except that calls abort(), so any crash sets _fatal_error and unblocks _wait_for_stage immediately

Test plan

test_coordinator_loop_crash_aborts_pipeline — verified red before patch, green after: injected crash sets _fatal_error instead of silently killing the thread
All 29 local execution tests pass

🤖 Generated with Claude Code

yonromai · 2026-03-23T19:44:06Z

    def _log_status(self) -> None:
-        alive = sum(1 for s in self._worker_states.values() if s in {WorkerState.READY, WorkerState.BUSY})
-        dead = sum(1 for s in self._worker_states.values() if s in {WorkerState.FAILED, WorkerState.DEAD})
+        with self._lock:


FYI @rjpower : Claude told me this ^ race condition killed my tokenize job, and after I told it that it's very unlikely to trigger, it accepted that the cause of the tokenization job failing was likely something else (hence additional logging in #4006 ). It still insisted this race condition can occur (see test) and bullied me into opening this PR.

seems legit. if we have 2000 workers, and log_status happens to trigger around when a worker pings?

rjpower · 2026-03-23T19:50:13Z

    def _log_status(self) -> None:
-        alive = sum(1 for s in self._worker_states.values() if s in {WorkerState.READY, WorkerState.BUSY})
-        dead = sum(1 for s in self._worker_states.values() if s in {WorkerState.FAILED, WorkerState.DEAD})
+        with self._lock:


seems legit. if we have 2000 workers, and log_status happens to trigger around when a worker pings?

) _log_status() iterated _worker_states without the lock, racing with register_worker() on RPC threads. The resulting RuntimeError killed the coordinator daemon thread silently — no abort, no _fatal_error — so _wait_for_stage() hung forever on the last in-flight shard. Fix: snapshot shared state under the lock in _log_status(), and wrap the coordinator loop body in try/except that routes crashes through abort() so pipelines fail fast instead of hanging. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

) ## Summary Fixes #3996. The coordinator daemon thread could crash from an unhandled `RuntimeError` (dict mutation during iteration), killing heartbeat checking and causing pipelines to hang at N-1/N with no error. - **`_log_status()`**: snapshot `_worker_states` under `self._lock` before iterating - **`_coordinator_loop()`**: wrap loop body in `try/except` that calls `abort()`, so any crash sets `_fatal_error` and unblocks `_wait_for_stage` immediately ## Test plan - [x] `test_coordinator_loop_crash_aborts_pipeline` — verified red before patch, green after: injected crash sets `_fatal_error` instead of silently killing the thread - [x] All 29 local execution tests pass 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: yoblin <268258002+yoblin@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

yonromai added the bug Something isn't working label Mar 23, 2026

yonromai commented Mar 23, 2026

View reviewed changes

yonromai requested a review from rjpower March 23, 2026 19:44

yonromai enabled auto-merge (squash) March 23, 2026 19:44

rjpower approved these changes Mar 23, 2026

View reviewed changes

yonromai force-pushed the fix/zephyr-coordinator-crash branch from 82802c0 to cdb72c8 Compare March 23, 2026 20:04

yonromai merged commit 3552abc into main Mar 23, 2026
38 checks passed

yonromai deleted the fix/zephyr-coordinator-crash branch March 23, 2026 20:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[zephyr] Fix coordinator loop crash causing silent pipeline hangs#4008

[zephyr] Fix coordinator loop crash causing silent pipeline hangs#4008
yonromai merged 1 commit intomainfrom
fix/zephyr-coordinator-crash

yonromai commented Mar 23, 2026

Uh oh!

yonromai Mar 23, 2026

Uh oh!

rjpower Mar 23, 2026

Uh oh!

rjpower Mar 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yonromai commented Mar 23, 2026

Summary

Test plan

Uh oh!

yonromai Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

rjpower Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

rjpower Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants