Description
Corollary to #6038. In that issue, I described a situation where workers thought a key (which most tasks depended on) had 82 replicas, but in reality it only had 1.
This issue is about the fact that ReduceReplicas
maybe shouldn't try to delete copies of that critical key so aggressively.
* * * * * *
\ \ \ / / /
x y
In this case x
and y
are going to be reused by every task, so they will end up having replicas on most workers. Constantly deleting them is inefficient—as soon as you delete it, the next task that wants to run on that worker is going to have to transfer it back again.
(Of course, once most of the *
tasks are done, then you should start reducing replicas. But while the cluster is fully saturated with *
tasks, there's no benefit to doing this.)
I'm not sure what metric to use for this. Ideas explored in #4967, #5325, #5326 could be interesting here.
Really, this issue is just about how to calculate a smarter target for this desired_replicas
count automatically based on the task's waiters
, number of current workers, etc.: