Description
What happened:
I have a fully parallel workload of 30 tasks (no dependencies), each of which takes ~20 minutes. My cluster autoscales between 10 and 50 workers. When I start the task graph, I find the 30 jobs distributed 3 a-piece on each initial worker. Eventually, the cluster scales to 30 workers -- but sometimes the tasks will not redistribute.
Even after jobs finish and timing would be known, I have seen a situation where 7 remaining jobs distributed 2-2-2-1 on 4 workers, while there are plenty empty workers.
What you expected to happen:
As new workers come online, they steal tasks from existing workers.
Minimal Complete Verifiable Example:
I tried to replicate on a local cluster, but could not. Things seem to work as expected, and tasks are immediately redistributed.
Anything else we need to know?:
I'm attaching debug schedule/worker logs per @fjetter 's script, at the point where 30 workers are available and 10 workers each have 3 tasks.
Environment:
- Dask version: 2021.12.0
- Python version: 3.8.12
- Operating System: Ubuntu 20
- Install method (conda, pip, source): conda