Open
Description
#5999 (comment) identified a use case where a cluster, while being somewhat active, will accrue no progress whatsoever.
Such a use case is extremely expensive for adaptive clusters; e.g. someone might start a 2-hours run on friday night, go home for the weekend, and find on monday morning that the whole cluster remained active for the whole time, costing $$$.
Proposed design
Implement a new, fairly long (e.g. 1h by default) timeout in the scheduler, which
- starts when any task becomes pending or executing
- stops when no tasks are pending or executing
- is reset when any task completes
When that timeout expires, all pending or executing tasks are marked as failed. This in turn must release any in-memory dependent tasks and let the cluster shrink down.
Note that this design will also kill off runs that are blocked due to missing a worker with specific resources.