Use cases for work stealing

Work stealing is a fairly complex machinery intended to redistribute tasks on a cluster to achieve a homogeneous occupancy, i.e. all workers will be busy for approximately the same time.

I'm currently aware of two use cases for it

A) Adaptive scaling or generally upscaling scenarios, i.e. adding more workers to a cluster require some kind of load balancing. Without work stealing, a newly added worker would sit idle until the scheduler _might_ assign a task which is not even guaranteed or might work very poorly (e.g. https://github.com/dask/distributed/issues/4471)
By having work stealing enabled, we are automatically ensuring that any newly added worker is able to start working since it gets tasks assigned via the stealing mechanism. However, this is known to not work well (list not exhaustive)

- https://github.com/dask/distributed/issues/4471
- https://github.com/dask/distributed/issues/5599

B) Another application would be a workload with vastly different runtimes in a TaskGroup. This is particularly concerning if there are few tasks in this task group or the runtime distribution is asymmetrical such that even after running many tasks the runtime differences would not cancel themselves and we'd have few workers with very large queues, effectively extending overall runtime by having a large tail in the computation.


**I am not entirely sure if this usecase is actually very relevant and would appreciate some additional information around it**. If this is indeed relevant we may benefit from an improved runtime tracking, e.g. with error measurement (e.g. https://github.com/dask/distributed/pull/4028) in combination with a simpler, more selective algorithm.


The current work stealing algorithm has a couple of issues. Currently open issues can be filtered by the label [stealing](https://github.com/dask/distributed/issues?q=is%3Aopen+is%3Aissue+label%3Astealing)

Stealing also is known to be a trigger for deadlocks (at least four have been reported and fixed by now) since it requires a handshake that can cause timing issues (see e.g. https://github.com/dask/distributed/pulls?q=is%3Apr+is%3Aclosed+stealing+label%3Astealing+label%3Adeadlock)


There are even cases where work stealing is known to cause harm by reverting smart scheduler decisions, e.g. https://github.com/dask/distributed/issues/6573

I'm currently trying to estimate whether we should pursue work stealing and try to make it robust or abandon this extension in favor of a less general but more robust solution for A and possibly B.

Thoughts?

cc @mrocklin @crusaderky @gjoseph92

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Use cases for work stealing #6600

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Use cases for work stealing #6600

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions