Skip to content

Use cases for work stealing #6600

Open
Open
@fjetter

Description

@fjetter

Work stealing is a fairly complex machinery intended to redistribute tasks on a cluster to achieve a homogeneous occupancy, i.e. all workers will be busy for approximately the same time.

I'm currently aware of two use cases for it

A) Adaptive scaling or generally upscaling scenarios, i.e. adding more workers to a cluster require some kind of load balancing. Without work stealing, a newly added worker would sit idle until the scheduler might assign a task which is not even guaranteed or might work very poorly (e.g. #4471)
By having work stealing enabled, we are automatically ensuring that any newly added worker is able to start working since it gets tasks assigned via the stealing mechanism. However, this is known to not work well (list not exhaustive)

B) Another application would be a workload with vastly different runtimes in a TaskGroup. This is particularly concerning if there are few tasks in this task group or the runtime distribution is asymmetrical such that even after running many tasks the runtime differences would not cancel themselves and we'd have few workers with very large queues, effectively extending overall runtime by having a large tail in the computation.

I am not entirely sure if this usecase is actually very relevant and would appreciate some additional information around it. If this is indeed relevant we may benefit from an improved runtime tracking, e.g. with error measurement (e.g. #4028) in combination with a simpler, more selective algorithm.

The current work stealing algorithm has a couple of issues. Currently open issues can be filtered by the label stealing

Stealing also is known to be a trigger for deadlocks (at least four have been reported and fixed by now) since it requires a handshake that can cause timing issues (see e.g. https://github.com/dask/distributed/pulls?q=is%3Apr+is%3Aclosed+stealing+label%3Astealing+label%3Adeadlock)

There are even cases where work stealing is known to cause harm by reverting smart scheduler decisions, e.g. #6573

I'm currently trying to estimate whether we should pursue work stealing and try to make it robust or abandon this extension in favor of a less general but more robust solution for A and possibly B.

Thoughts?

cc @mrocklin @crusaderky @gjoseph92

Metadata

Metadata

Assignees

No one assigned

    Labels

    adaptiveAll things relating to adaptive scalingdiscussionDiscussing a topic with no specific actions yetperformanceschedulingstabilityIssue or feature related to cluster stability (e.g. deadlock)stealing

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions