Skip to content

Queuing does not prevent root task overproduction unless you have enough tasks #7273

Open
@gjoseph92

Description

@gjoseph92

Queuing #6614 is meant to prevent root task overproduction #5555. And it's shown to be very effective at doing so: #7128.

However, due to the heuristic of what counts as a "root-ish" task, it'll only stop root task overproduction if you have > total_nthreads * 2 root tasks.

Overproduction can occur any time there are > total_nthreads root tasks. So in this middle case, queuing won't kick in and the worker-saturation value won't be respected.

This would be confusing behavior to users. If you make your problem size smaller, or make your cluster bigger—two things that you'd expect to reduce per-worker memory usage—you may cross an opaque magic threshold at which your workload suddenly uses up to 2x more memory.

EDIT:

To be clear, I propose a two-character change to fix this. Just drop the * 2 part:

diff --git a/distributed/scheduler.py b/distributed/scheduler.py
index b99e3f19..df20e807 100644
--- a/distributed/scheduler.py
+++ b/distributed/scheduler.py
@@ -3033,7 +3033,7 @@ class SchedulerState:
         tg = ts.group
         # TODO short-circuit to True if `not ts.dependencies`?
         return (
-            len(tg) > self.total_nthreads * 2
+            len(tg) > self.total_nthreads
             and len(tg.dependencies) < 5
             and sum(map(len, tg.dependencies)) < 5
         )

The * 2 is a number @mrocklin and I just made up back in #4967. There wasn't any benchmarking or empirical reason for it. Just saying > nthreads is more logical and easier to justify.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions