Skip to content

[zephyr] Fix dead threading.Event in _wait_for_stage#5048

Open
hsuhanooi wants to merge 1 commit intomarin-community:mainfrom
hsuhanooi:zephyr-stage-event-fix
Open

[zephyr] Fix dead threading.Event in _wait_for_stage#5048
hsuhanooi wants to merge 1 commit intomarin-community:mainfrom
hsuhanooi:zephyr-stage-event-fix

Conversation

@hsuhanooi
Copy link
Copy Markdown
Contributor

@hsuhanooi hsuhanooi commented Apr 22, 2026

_wait_for_stage created a local stage_done = threading.Event() and called wait() on it, but nothing ever called set() on that local object. Stage transitions were driven by a time.sleep(backoff.next_interval()) loop, so the coordinator waited up to 1 second after each shard completed instead of waking immediately.

Add self._stage_event on the coordinator, call set() in report_result, report_error, abort, and register_worker, and replace the sleep loop with self._stage_event.wait(timeout) + clear(). Stage transitions now propagate in microseconds.

The bug was introduced in the refactor in #3923, which replaced sleep() with event.wait() but never wired up the set() calls.

…g with event-driven wakeup

_wait_for_stage created a local threading.Event that was never signaled,
making its wait() call a pure sleep of up to 1 second per iteration.
Each stage transition (scatter→reduce→fold) paid this latency needlessly.

Replace it with self._stage_event, signaled by every coordinator method
that changes stage-relevant state: report_result, report_error, abort,
and register_worker. _start_stage clears the event so signals from the
previous stage don't bleed over.

The backoff timeout is retained as a backstop for the alive-worker check
and periodic log lines. In the normal (no-failure) path, stage transitions
now complete within microseconds of the last shard result arriving.

Benchmark (8 shards, 3-stage group_by pipeline, 70MB synthetic data):
  Before: 14.9s / 14.6s / 17.1s  (avg ~15.6s, high variance)
  After:  13.4s / 13.7s / 13.4s  (avg ~13.5s, low variance)
  ~13% faster; ~1.5s saved from eliminating poll-interval latency at
  each of the 3 stage boundaries.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@hsuhanooi hsuhanooi force-pushed the zephyr-stage-event-fix branch from 5341e44 to 365f9b8 Compare April 22, 2026 17:01
@hsuhanooi hsuhanooi changed the title [zephyr] Fix dead threading.Event in _wait_for_stage [zephyr] Fix dead threading.Event and reduce coordinator/worker polling overhead Apr 22, 2026
@hsuhanooi hsuhanooi force-pushed the zephyr-stage-event-fix branch from 365f9b8 to f98e020 Compare April 22, 2026 17:03
@hsuhanooi hsuhanooi changed the title [zephyr] Fix dead threading.Event and reduce coordinator/worker polling overhead [zephyr] Fix dead threading.Event in _wait_for_stage Apr 22, 2026
@wmoss
Copy link
Copy Markdown
Contributor

wmoss commented Apr 22, 2026

Good catch, definitely a bug. The new implementation looks right.

Controversial opinion 🙈: 1 second isn't that long to wait and we could just strip out all the signaling code (including the old broken code, of course) and just rely on the 1 second poll interval (which I guess is what happened before #3923)

@hsuhanooi
Copy link
Copy Markdown
Contributor Author

Yeah that’s reasonable too. Although maybe it’s useful to fix this here and then test removing signaling in a different pr?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants