-
-
Notifications
You must be signed in to change notification settings - Fork 730
adaptive_target: no more workers then runnable tasks #4155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Doesn't this mean that we want
Yes, the linked dask_jobqueue issue but the same also happens for the LocalCluster (see testcase import time
from dask.distributed import Client, LocalCluster
from distributed.worker_client import get_worker
def test_fun():
time.sleep(5)
return get_worker().name
if __name__ == "__main__":
# create simple task graph
n = 4
graph = {f"test_{i}": (test_fun,) for i in range(n)}
targets = [f"test_{i}" for i in range(n)]
workers = 10 # we shouldn't need more than n=4 workers
# LocalCluster
kwargs = {
"n_workers": 1,
"diagnostics_port": None,
"threads_per_worker": 1,
"memory_limit": "2G",
"processes": True,
}
with LocalCluster(**kwargs) as cluster:
cluster.adapt(minimum=1, maximum=workers)
with Client(cluster) as client:
client.register_worker_callbacks(
lambda dask_worker: print("setup", dask_worker.name)
)
print(client.get(graph, targets)) # prints only 0s. If memory_limit=0, more workers are spawned |
4472c05
to
e186634
Compare
I think the remaining tests would pass if #4108 was fixed. |
76c9952
to
aa9a890
Compare
99e9897
to
e843342
Compare
@mrocklin if you do not agree with the memory-related part, I can separate the "not spawning more jobs than runnable tasks" into its own PR |
02af313
to
e215c73
Compare
4c5162e
to
19af2d1
Compare
dc24900
to
d0eae6c
Compare
56563b4
to
d86172b
Compare
bump @mrocklin @guillaumeeb Other people seem to encounter the same issue (SLURMCluster not adapting) esi-neuroscience/acme#14 I have been using this patch for more than a month on a SLURM cluster and it addressed my main concern (wasting resources) using dask on a cluster:
|
I don't think so. For example, if I have a cluster with 100TB of memory (limit=1e14) and it stores no data (total=0) then I think that we don't want to scale up. |
thank you @mrocklin for your comments! I think I understand my misconception :) I will create a separate PR for the number of runnable task (after I switched to |
if tasks_processing > cpu: | ||
break | ||
else: | ||
cpu = min(tasks_processing, cpu) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code prevented more worker to spawn for me. It is not needed anymore as the code now limits the number of targeted workers by the number of runnable tasks.
limit = sum(ws.memory_limit for ws in self.workers.values()) | ||
used = sum(ws.nbytes for ws in self.workers.values()) | ||
memory = 0 | ||
if used > 0.6 * limit: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might be good to change this to used > 0.6 * limit and limit > 0
. With no memory limit set, it tries to always increase the number of workers (that is why I initially thought it was related to memory).
I think that all changes in this PR are not addressing the root cause: If all tasks take much longer than spawning a worker, something like the following is an ugly workaround: with dask.config.set({"distributed.adaptive.target-duration": 0.1}):
... The only reason we don't see this behavior when no memory limit is set ( Iterating over |
For the record, the initial problem was that the graph used for the computation had not enough information to enable adaptive scaling. See dask/dask-jobqueue#463. |
I think that the memory condition in
Scheduler.adaptive_target
is swapped. If the total memory of all workers is less than half (or for some reason less than 60%) of the memory limit, then you should spawn more workers (and not the other way round)?If I understand the code correctly,
adapt
worked only when the memory limit was set to zero.Fixes dask/dask-jobqueue#463
Edit:
Limit the number of workers by the number of runnable tasks
Fixes #4108