Skip to content

[zephyr] Fix coordinator loop crash causing silent pipeline hangs#4008

Merged
yonromai merged 1 commit intomainfrom
fix/zephyr-coordinator-crash
Mar 23, 2026
Merged

[zephyr] Fix coordinator loop crash causing silent pipeline hangs#4008
yonromai merged 1 commit intomainfrom
fix/zephyr-coordinator-crash

Conversation

@yonromai
Copy link
Copy Markdown
Contributor

Summary

Fixes #3996. The coordinator daemon thread could crash from an unhandled RuntimeError (dict mutation during iteration), killing heartbeat checking and causing pipelines to hang at N-1/N with no error.

  • _log_status(): snapshot _worker_states under self._lock before iterating
  • _coordinator_loop(): wrap loop body in try/except that calls abort(), so any crash sets _fatal_error and unblocks _wait_for_stage immediately

Test plan

  • test_coordinator_loop_crash_aborts_pipeline — verified red before patch, green after: injected crash sets _fatal_error instead of silently killing the thread
  • All 29 local execution tests pass

🤖 Generated with Claude Code

@yonromai yonromai added the bug Something isn't working label Mar 23, 2026
def _log_status(self) -> None:
alive = sum(1 for s in self._worker_states.values() if s in {WorkerState.READY, WorkerState.BUSY})
dead = sum(1 for s in self._worker_states.values() if s in {WorkerState.FAILED, WorkerState.DEAD})
with self._lock:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @rjpower : Claude told me this ^ race condition killed my tokenize job, and after I told it that it's very unlikely to trigger, it accepted that the cause of the tokenization job failing was likely something else (hence additional logging in #4006 ). It still insisted this race condition can occur (see test) and bullied me into opening this PR.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems legit. if we have 2000 workers, and log_status happens to trigger around when a worker pings?

@yonromai yonromai requested a review from rjpower March 23, 2026 19:44
@yonromai yonromai enabled auto-merge (squash) March 23, 2026 19:44
def _log_status(self) -> None:
alive = sum(1 for s in self._worker_states.values() if s in {WorkerState.READY, WorkerState.BUSY})
dead = sum(1 for s in self._worker_states.values() if s in {WorkerState.FAILED, WorkerState.DEAD})
with self._lock:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems legit. if we have 2000 workers, and log_status happens to trigger around when a worker pings?

)

_log_status() iterated _worker_states without the lock, racing with
register_worker() on RPC threads. The resulting RuntimeError killed
the coordinator daemon thread silently — no abort, no _fatal_error —
so _wait_for_stage() hung forever on the last in-flight shard.

Fix: snapshot shared state under the lock in _log_status(), and wrap
the coordinator loop body in try/except that routes crashes through
abort() so pipelines fail fast instead of hanging.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@yonromai yonromai force-pushed the fix/zephyr-coordinator-crash branch from 82802c0 to cdb72c8 Compare March 23, 2026 20:04
@yonromai yonromai merged commit 3552abc into main Mar 23, 2026
38 checks passed
@yonromai yonromai deleted the fix/zephyr-coordinator-crash branch March 23, 2026 20:07
Helw150 pushed a commit that referenced this pull request Apr 8, 2026
)

## Summary

Fixes #3996. The coordinator daemon thread could crash from an unhandled
`RuntimeError` (dict mutation during iteration), killing heartbeat
checking and causing pipelines to hang at N-1/N with no error.

- **`_log_status()`**: snapshot `_worker_states` under `self._lock`
before iterating
- **`_coordinator_loop()`**: wrap loop body in `try/except` that calls
`abort()`, so any crash sets `_fatal_error` and unblocks
`_wait_for_stage` immediately

## Test plan

- [x] `test_coordinator_loop_crash_aborts_pipeline` — verified red
before patch, green after: injected crash sets `_fatal_error` instead of
silently killing the thread
- [x] All 29 local execution tests pass

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: yoblin <268258002+yoblin@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[zephyr] Coordinator thread crashes silently, causing pipelines to hang at N-1/N

3 participants