You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
zephyr: fix dead threading.Event in _wait_for_stage, replacing polling with event-driven wakeup
_wait_for_stage created a local threading.Event that was never signaled,
making its wait() call a pure sleep of up to 1 second per iteration.
Each stage transition (scatter→reduce→fold) paid this latency needlessly.
Replace it with self._stage_event, signaled by every coordinator method
that changes stage-relevant state: report_result, report_error, abort,
and register_worker. _start_stage clears the event so signals from the
previous stage don't bleed over.
The backoff timeout is retained as a backstop for the alive-worker check
and periodic log lines. In the normal (no-failure) path, stage transitions
now complete within microseconds of the last shard result arriving.
Benchmark (8 shards, 3-stage group_by pipeline, 70MB synthetic data):
Before: 14.9s / 14.6s / 17.1s (avg ~15.6s, high variance)
After: 13.4s / 13.7s / 13.4s (avg ~13.5s, low variance)
~13% faster; ~1.5s saved from eliminating poll-interval latency at
each of the 3 stage boundaries.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
0 commit comments