Skip to content

workload not balancing during scale up on dask-gateway #5599

Open
@chrisroat

Description

@chrisroat

What happened:

I have a fully parallel workload of 30 tasks (no dependencies), each of which takes ~20 minutes. My cluster autoscales between 10 and 50 workers. When I start the task graph, I find the 30 jobs distributed 3 a-piece on each initial worker. Eventually, the cluster scales to 30 workers -- but sometimes the tasks will not redistribute.

Even after jobs finish and timing would be known, I have seen a situation where 7 remaining jobs distributed 2-2-2-1 on 4 workers, while there are plenty empty workers.

What you expected to happen:

As new workers come online, they steal tasks from existing workers.

Minimal Complete Verifiable Example:

I tried to replicate on a local cluster, but could not. Things seem to work as expected, and tasks are immediately redistributed.

Anything else we need to know?:

I'm attaching debug schedule/worker logs per @fjetter 's script, at the point where 30 workers are available and 10 workers each have 3 tasks.

Environment:

  • Dask version: 2021.12.0
  • Python version: 3.8.12
  • Operating System: Ubuntu 20
  • Install method (conda, pip, source): conda

scheduler_20211214.pkl.gz

worker_20211214.pkl.gz

Metadata

Metadata

Assignees

No one assigned

    Labels

    adaptiveAll things relating to adaptive scalingneeds infoNeeds further information from the userstealing

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions