You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/source/cluster/vms/user-guides/community/slurm.rst
+20-41Lines changed: 20 additions & 41 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,15 +8,9 @@ Slurm usage with Ray can be a little bit unintuitive.
8
8
* SLURM requires multiple copies of the same program are submitted multiple times to the same cluster to do cluster programming. This is particularly well-suited for MPI-based workloads.
9
9
* Ray, on the other hand, expects a head-worker architecture with a single point of entry. That is, you'll need to start a Ray head node, multiple Ray worker nodes, and run your Ray script on the head node.
10
10
11
-
.. warning::
11
+
To bridge this gap, Ray 2.49 and above introduces ``ray symmetric-run`` command, which will start a Ray cluster on all nodes with given CPU and GPU resources and run your entrypoint script ONLY the head node.
12
12
13
-
SLURM support is still a work in progress. SLURM users should be aware
14
-
of current limitations regarding networking.
15
-
See :ref:`here <slurm-network-ray>` for more explanations.
16
-
17
-
SLURM support is community-maintained. Maintainer GitHub handle: tupui.
18
-
19
-
This document aims to clarify how to run Ray on SLURM.
13
+
Below, we provide a walkthrough using ``ray symmetric-run`` to run Ray on SLURM.
20
14
21
15
.. contents::
22
16
:local:
@@ -107,46 +101,27 @@ Next, we'll want to obtain a hostname and a node IP address for the head node. T
107
101
:start-after: __doc_head_address_start__
108
102
:end-before: __doc_head_address_end__
109
103
104
+
.. note:: In Ray 2.49 and above, you can use IPv6 addresses/hostnames.
110
105
111
106
112
-
Starting the Ray head node
113
-
~~~~~~~~~~~~~~~~~~~~~~~~~~
107
+
Starting Ray and executing your script
108
+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
114
109
115
-
After detecting the head node hostname and head node IP, we'll want to create
116
-
a Ray head node runtime. We'll do this by using ``srun`` as a background task
117
-
as a single task/node (recall that ``tasks-per-node=1``).
110
+
.. note:: `ray symmetric-run` is available in Ray 2.49 and above. Check older versions of the documentation if you are using an older version of Ray.
111
+
112
+
Now, we'll use `ray symmetric-run` to start Ray on all nodes with given CPU and GPU resources and run your entrypoint script ONLY the head node.
118
113
119
114
Below, you'll see that we explicitly specify the number of CPUs (``num-cpus``)
120
115
and number of GPUs (``num-gpus``) to Ray, as this will prevent Ray from using
121
116
more resources than allocated. We also need to explicitly
122
-
indicate the ``node-ip-address`` for the Ray head runtime:
After the training job is completed, the Ray cluster will be stopped automatically.
150
125
151
126
.. note:: The -u argument tells python to print to stdout unbuffered, which is important with how slurm deals with rerouting output. If this argument is not included, you may get strange printing behavior such as printed statements not being logged by slurm until the program has terminated.
152
127
@@ -165,6 +140,7 @@ One common use of a SLURM cluster is to have multiple users running concurrent
165
140
jobs on the same infrastructure. This can easily conflict with Ray due to the
166
141
way the head node communicates with its workers.
167
142
143
+
168
144
Considering 2 users, if they both schedule a SLURM job using Ray
169
145
at the same time, they are both creating a head node. In the backend, Ray will
170
146
assign some internal ports to a few services. The issue is that as soon as the
@@ -183,13 +159,12 @@ adjusted. For an explanation on ports, see :ref:`here <ray-ports>`::
183
159
--ray-client-server-port
184
160
--redis-shard-ports
185
161
186
-
For instance, again with 2 users, they would have to adapt the instructions
187
-
seen above to:
162
+
For instance, again with 2 users, they would run the following commands. Note that we don't use symmetric-run here
163
+
because it does not currently work in multi-tenant environments:
188
164
189
165
.. code-block:: bash
190
166
191
167
# user 1
192
-
# same as above
193
168
...
194
169
srun --nodes=1 --ntasks=1 -w "$head_node" \
195
170
ray start --head --node-ip-address="$head_node_ip" \
0 commit comments