Open
Description
Active Memory Manager (AMM) data transfers run on hardcoded priority 1:
distributed/distributed/worker_state_machine.py
Lines 2822 to 2826 in 9148770
This means that if there is a network-heavy workload going on on the default priority 0, to the point that it saturates the network, then AMM will yield and slow down in order not to hamper the workload.
This is generally desirable for general purpose rebalancing and replication. There are two use cases however where it risks being a poor idea:
- Graceful worker retirement (AMM RetireWorker) happens chiefly in three cases:
- whenever a watchdog has intel that the worker is going to die soon. Namely, on AWS you get a 2 minutes warning when an instance is going to be forcefully shut down by Amazon. In this case, graceful retirement is time sensitive, and should be prioritised over computations.
- on an adaptive cluster, when the workload has dwindled to the point where it can't saturate the cluster anymore. In this case we should expect modest data transfers from the computation, so it shouldn't hurt to raise AMM priority anyway.
- Graceful worker retirement will try pushing all the unique data out of a worker at once and hang indefinitely as soon as there's no more capacity anywhere else on the cluster, e.g. the retirement causes all surviving workers to go beyond 80% memory and get paused. If there's a hard shutdown incoming after a certain time, this will mean losing any remaining data and having to recompute it. However, not all data on a worker is equal: task outputs can be recomputed somewhere else; scattered data can't and will cause all computations that rely on it to fall over.
Proposed design
- In the AMM framework, offer a hook to the policies to specify a priority for
replicate
suggestions and default to 1 if omitted - Add a
{key: priority}
attribute toAcquireReplicasEvent
- AMM RetireWorker should replicate with priority -2 for scattered data and -1 for all other data (both are higher than default compute() calls)