[zephyr] Raise worker idle poll backoff cap from 1.0s to 5.0s (#5051)

hsuhanooi · web-flow · commit 0c4926637e6d · 2026-04-22T11:33:36.000-07:00
`_poll_loop` backed off up to 1.0s between pull_task calls when no task was available. Each task now runs in a fresh subprocess taking roughly 1s, so re-polling every second caused busy-waiting between subprocess launches. The cap was set before subprocess-per-shard isolation landed in #4522 and was never revisited; 5.0s matches the typical subprocess task duration. Each pull_task RPC that returns None still has to go through the full coordinator path: RPC deserialization, lock acquisition, dict lookups, lock release, serialization. With 64 idle workers polling every 1.0s you get 64 wasted RPCs/second. At 5.0s cap that drops to ~13/second. The coordinator is also getting ~13 heartbeat RPCs/second from those same 64 workers (one per worker per 5s heartbeat interval), so the idle polling at 1.0s was actually more traffic than the heartbeats themselves. Raising the cap brings the two closer to the same rate. Whether this is perceptible depends on worker count. With 16 workers it's noise either way. With 128+ idle workers in a straggler tail it could show up as a few percent of coordinator CPU. The coordinator is provisioned small (2g RAM, 1 CPU by default from 6c0b22c) so any reduction in unnecessary RPC handling there is genuinely useful.
diff --git a/lib/zephyr/src/zephyr/execution.py b/lib/zephyr/src/zephyr/execution.py
@@ -1122,7 +1122,7 @@ def _heartbeat_loop(
     def _poll_loop(self, coordinator: ActorHandle) -> None:
         """Pure polling loop. Exits on SHUTDOWN signal, coordinator death, or shutdown event."""
         task_count = 0
-        backoff = ExponentialBackoff(initial=0.1, maximum=1.0)
+        backoff = ExponentialBackoff(initial=0.1, maximum=5.0)
 
         future: ActorFuture | None = None
         future_start = 0.0