Skip to content

A failsafe for hung adaptive clusters #6825

Open
@crusaderky

Description

@crusaderky

#5999 (comment) identified a use case where a cluster, while being somewhat active, will accrue no progress whatsoever.

Such a use case is extremely expensive for adaptive clusters; e.g. someone might start a 2-hours run on friday night, go home for the weekend, and find on monday morning that the whole cluster remained active for the whole time, costing $$$.

Proposed design

Implement a new, fairly long (e.g. 1h by default) timeout in the scheduler, which

  • starts when any task becomes pending or executing
  • stops when no tasks are pending or executing
  • is reset when any task completes

When that timeout expires, all pending or executing tasks are marked as failed. This in turn must release any in-memory dependent tasks and let the cluster shrink down.

Note that this design will also kill off runs that are blocked due to missing a worker with specific resources.

Metadata

Metadata

Assignees

No one assigned

    Labels

    adaptiveAll things relating to adaptive scaling

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions