Commit 0c49266
authored
[zephyr] Raise worker idle poll backoff cap from 1.0s to 5.0s (#5051)
`_poll_loop` backed off up to 1.0s between pull_task calls when no task
was available.
Each task now runs in a fresh subprocess taking roughly 1s, so
re-polling every second caused busy-waiting between subprocess launches.
The cap was set before subprocess-per-shard isolation landed in #4522
and was never revisited; 5.0s matches the typical subprocess task
duration.
Each pull_task RPC that returns None still has to go through the full
coordinator path: RPC deserialization, lock acquisition, dict lookups,
lock release, serialization. With 64 idle workers polling every 1.0s you
get 64 wasted RPCs/second. At 5.0s cap that drops to ~13/second.
The coordinator is also getting ~13 heartbeat RPCs/second from those
same 64 workers (one per worker per 5s heartbeat interval), so the idle
polling at 1.0s was actually more traffic than the heartbeats
themselves. Raising the cap brings the two closer to the same rate.
Whether this is perceptible depends on worker count. With 16 workers
it's noise either way. With 128+ idle workers in a straggler tail it
could show up as a few percent of coordinator CPU. The coordinator is
provisioned small (2g RAM, 1 CPU by default from 6c0b22c) so any
reduction in unnecessary RPC handling there is genuinely useful.1 parent ff9b2a9 commit 0c49266
1 file changed
Lines changed: 1 addition & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1122 | 1122 | | |
1123 | 1123 | | |
1124 | 1124 | | |
1125 | | - | |
| 1125 | + | |
1126 | 1126 | | |
1127 | 1127 | | |
1128 | 1128 | | |
| |||
0 commit comments