Description
There was some discussion on gitter about adaptive clusters and whether the user should be able to request a target number of workers.
Context
The rationale for wanting this kind of behavior is that you could be in a scenario where you know you'll need a lot at the beginning of a task, so you want to manually scale up, but then let the workers kill off as they finish.
From @d-v-b: This describes most reductions (e.g., min of a big array) and I found the adaptive cluster performs terribly for these workloads.
From @lesteve:
As I mentioned during my Dask-Jobqueue talk at the Dask workshop, my feeling is that a lot of use cases for Adaptive in a HPC context are embarrassingly parallel and have essentially two phases:
- scale up as fast as your HPC cluster lets you
- once computation nears towards the end, Dask workers that don't have tasks left should be killed quickly to free up resources.
I was planning to look at trying to do that manually at one point but never got around to it. If someone makes some progress in this direction, let me know!
Also from user feed-back (not first-hand experience) it feels like Adaptive has some problems and @d-v-b confirmed my impression at the workshop.
Proposal
I was originally thinking about this from the scheduler perspective (interacted with via the client object). I was imagining that you could set target
on the scheduler and that would be used to convey the intention to the cluster (similar to the adaptive_target
pattern). But @martindurant pointed out that you'd need some concept of how long the workers should hang around for before adaptive starts trying to tear them down. Perhaps the right approach would be you don't start spinning up the workers until some work is submitted (similar to Adaptive), but you wait to start work until target
workers (or more likely min(maximum, target)
) are available.
ref #3526