Skip to content

Making AMM ReduceReplicas less aggressive towards widely-shared dependencies #6056

Open
@gjoseph92

Description

@gjoseph92

Corollary to #6038. In that issue, I described a situation where workers thought a key (which most tasks depended on) had 82 replicas, but in reality it only had 1.

This issue is about the fact that ReduceReplicas maybe shouldn't try to delete copies of that critical key so aggressively.

* * * * * *
\ \ \ / / /
    x y

In this case x and y are going to be reused by every task, so they will end up having replicas on most workers. Constantly deleting them is inefficient—as soon as you delete it, the next task that wants to run on that worker is going to have to transfer it back again.

(Of course, once most of the * tasks are done, then you should start reducing replicas. But while the cluster is fully saturated with * tasks, there's no benefit to doing this.)

I'm not sure what metric to use for this. Ideas explored in #4967, #5325, #5326 could be interesting here.

Really, this issue is just about how to calculate a smarter target for this desired_replicas count automatically based on the task's waiters, number of current workers, etc.:

desired_replicas = 1 # TODO have a marker on TaskState

Metadata

Metadata

Assignees

No one assigned

    Labels

    discussionDiscussing a topic with no specific actions yetenhancementImprove existing functionality or make things work bettermemoryperformance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions