Increase default work-stealing interval by 10x#8997
Conversation
e5c2909 to
241ab33
Compare
jacobtomlinson
left a comment
There was a problem hiding this comment.
This seems reasonable. I wonder what impact this might have on other deployments like HPC, are things still the same? Or is communication faster there so this is less noticible?
cc @guillaumeeb
|
I think this change should be generally beneficial. IIRC, our current advice is that tasks should take at least 100 ms to avoid overhead from becoming too large. With the current default that would mean that we balance tasks after every iteration. This seems like overkill also given that balancing isn't cheap. |
|
FWIW, the impact is somewhat hard to establish because of a bug in the balancing logic that I will address in another PR. |
|
To expand on this a bit, I think an appropriate lower bound for the stealing interval is something like this
Especially when very busy, the server latency can be quite high (hundreds ms? scheduler and workers may be busy for different reasons). 1s sounds like a reasonable default. |
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 27 files ±0 27 suites ±0 11h 39m 22s ⏱️ + 7m 46s For more details on these failures and errors, see this check. Results for commit 7133528. ± Comparison against base commit fd3722d. This pull request removes 1 and adds 1 tests. Note that renamed tests count towards both.♻️ This comment has been updated with latest results. |
jacobtomlinson
left a comment
There was a problem hiding this comment.
Thanks for diving deeper here, sounds good to me!
Just answering as I've been tagged: even on HPC, Dask is generally using TCP over IB, which means high bandwith, but not the latency you could have with real IB protocol. Nevertheless, I thing that 1s between work stealing is really short enough for all workflows intended for Dask! |
On a normal cloud setup, the staling interval is barely large enough to accomodate the roundtrips required for moving the tasks, not to mention fetching dependencies or performing actual work. I'm increasing the interval to one second (10x) which gives a little more time for actual progress to be made.
pre-commit run --all-files