[zephyr] Fix dead threading.Event in _wait_for_stage by hsuhanooi · Pull Request #5048 · marin-community/marin

hsuhanooi · 2026-04-22T16:41:35Z

_wait_for_stage created a local stage_done = threading.Event() and called wait() on it, but nothing ever called set() on that local object. Stage transitions were driven by a time.sleep(backoff.next_interval()) loop, so the coordinator waited up to 1 second after each shard completed instead of waking immediately.

Add self._stage_event on the coordinator, call set() in report_result, report_error, abort, and register_worker, and replace the sleep loop with self._stage_event.wait(timeout) + clear(). Stage transitions now propagate in microseconds.

The bug was introduced in the refactor in #3923, which replaced sleep() with event.wait() but never wired up the set() calls.

…g with event-driven wakeup _wait_for_stage created a local threading.Event that was never signaled, making its wait() call a pure sleep of up to 1 second per iteration. Each stage transition (scatter→reduce→fold) paid this latency needlessly. Replace it with self._stage_event, signaled by every coordinator method that changes stage-relevant state: report_result, report_error, abort, and register_worker. _start_stage clears the event so signals from the previous stage don't bleed over. The backoff timeout is retained as a backstop for the alive-worker check and periodic log lines. In the normal (no-failure) path, stage transitions now complete within microseconds of the last shard result arriving. Benchmark (8 shards, 3-stage group_by pipeline, 70MB synthetic data): Before: 14.9s / 14.6s / 17.1s (avg ~15.6s, high variance) After: 13.4s / 13.7s / 13.4s (avg ~13.5s, low variance) ~13% faster; ~1.5s saved from eliminating poll-interval latency at each of the 3 stage boundaries. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

wmoss · 2026-04-22T17:14:22Z

Good catch, definitely a bug. The new implementation looks right.

Controversial opinion 🙈: 1 second isn't that long to wait and we could just strip out all the signaling code (including the old broken code, of course) and just rely on the 1 second poll interval (which I guess is what happened before #3923)

hsuhanooi · 2026-04-23T03:07:39Z

Yeah that’s reasonable too. Although maybe it’s useful to fix this here and then test removing signaling in a different pr?

hsuhanooi force-pushed the zephyr-stage-event-fix branch from 5341e44 to 365f9b8 Compare April 22, 2026 17:01

hsuhanooi changed the title ~~[zephyr] Fix dead threading.Event in _wait_for_stage~~ [zephyr] Fix dead threading.Event and reduce coordinator/worker polling overhead Apr 22, 2026

hsuhanooi force-pushed the zephyr-stage-event-fix branch from 365f9b8 to f98e020 Compare April 22, 2026 17:03

hsuhanooi changed the title ~~[zephyr] Fix dead threading.Event and reduce coordinator/worker polling overhead~~ [zephyr] Fix dead threading.Event in _wait_for_stage Apr 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[zephyr] Fix dead threading.Event in _wait_for_stage#5048

[zephyr] Fix dead threading.Event in _wait_for_stage#5048
hsuhanooi wants to merge 1 commit intomarin-community:mainfrom
hsuhanooi:zephyr-stage-event-fix

hsuhanooi commented Apr 22, 2026 •

edited

Loading

Uh oh!

wmoss commented Apr 22, 2026

Uh oh!

hsuhanooi commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hsuhanooi commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wmoss commented Apr 22, 2026

Uh oh!

hsuhanooi commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hsuhanooi commented Apr 22, 2026 •

edited

Loading