Description
Hello from a new user! I'm putting this here rather than opening a new issue, but let me know if I should do the latter instead.
Following the documentation, I am trying to run my very first "hello dask" script that looks like the following:
from dask.distributed import Client from dask_jobqueue.slurm import SLURMRunner with SLURMRunner() as runner: with Client(runner) as client: client.wait_for_workers(runner.n_workers) print(f"Number of workers = {runner.n_workers}")When I submit the job using slurm, I get the following network-related warning
2025-02-12 16:22:11,565 - distributed.scheduler - INFO - State start /home/sm69/.conda/envs/pyathena/lib/python3.13/site-packages/distributed/utils.py:189: RuntimeWarning: Couldn't detect a suitable IP address for reaching '8.8.8.8', defaulting to hostname: [Errno 101] Network is unreachable warnings.warn( 2025-02-12 16:22:11,569 - distributed.scheduler - INFO - Scheduler at: tcp://10.33.81.152:35737 2025-02-12 16:22:11,569 - distributed.scheduler - INFO - dashboard at: http://10.33.81.152:8787/status 2025-02-12 16:22:11,569 - distributed.scheduler - INFO - Registering Worker plugin shuffle 2025-02-12 16:22:11,647 - distributed.scheduler - INFO - Receive client connection: Client-6c2bbb5b-e987-11ef-b579-78ac4413ab38 2025-02-12 16:22:11,647 - distributed.core - INFO - Starting established connection to tcp://10.33.81.152:58686 2025-02-12 16:22:11,658 - distributed.worker - INFO - Start worker at: tcp://10.33.81.152:42115 2025-02-12 16:22:11,658 - distributed.worker - INFO - Listening to: tcp://10.33.81.152:42115 2025-02-12 16:22:11,658 - distributed.worker - INFO - Start worker at: tcp://10.33.81.152:38967 2025-02-12 16:22:11,658 - distributed.worker - INFO - Start worker at: tcp://10.33.81.152:44313 2025-02-12 16:22:11,658 - distributed.worker - INFO - Start worker at: tcp://10.33.81.152:42309 2025-02-12 16:22:11,658 - distributed.worker - INFO - Worker name: 9 2025-02-12 16:22:11,659 - distributed.worker - INFO - dashboard at: 10.33.81.152:46699 2025-02-12 16:22:11,659 - distributed.worker - INFO - Waiting to connect to: tcp://10.33.81.152:35737 2025-02-12 16:22:11,659 - distributed.worker - INFO - Start worker at: tcp://10.33.81.152:34517 ...
Followed by
StreamClosedError
andCommClosedError
Before get into the Runner, I have already tried using Cluster, by, e.g.,
ncores = 96 SLURMCluster(cores=ncores, memory='720 GiB', processes=ncores, interface="ib0")As you can see here, I had to set
interface="ib0"
(the cluster uses infiniband for inter-node communication); otherwise I got similar error.This made me think that I have to do something similar to
interface="ib0"
when usingSLURMRunner
as well, but I couldn't find such thing in the documentation. Could you guide me what to do?Somewhat related feedback from a new user's perspective: It was a surprise to me when I first realize
SLURMCluster
does not support multi-node job. I was not mentioned explicitly in the documentation, and I had to surf through several issues to come to realize that is the case. I think one of the main motivation to use dask is to overcome single node memory bound when analyzing large simulation data, so I naively assumed thatdask-jobqueue
would support multi-node job. It might be very helpful that documentation explicitly says thatSLURMCluster
cannot submit multi-node job.
Originally posted by @sanghyukmoon in #638