A failsafe for hung adaptive clusters

https://github.com/dask/distributed/issues/5999#issuecomment-1177457188 identified a use case where a cluster, while being somewhat active, will accrue no progress whatsoever.

Such a use case is extremely expensive for adaptive clusters; e.g. someone might start a 2-hours run on friday night, go home for the weekend, and find on monday morning that the whole cluster remained active for the whole time, costing $$$.

# Proposed design
Implement a new, fairly long (e.g. 1h by default) timeout in the scheduler, which 
- starts when *any* task becomes pending or executing 
- stops when *no* tasks are pending or executing
- is reset when *any* task completes

When that timeout expires, *all* pending or executing tasks are marked as failed. This in turn must release any in-memory dependent tasks and let the cluster shrink down.

Note that this design will also kill off runs that are blocked due to missing a worker with specific resources.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

A failsafe for hung adaptive clusters #6825

Proposed design

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

A failsafe for hung adaptive clusters #6825

Description

Proposed design

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions