You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[Slurm-GCP](https://github.com/GoogleCloudPlatform/slurm-gcp) is the set of
4
+
scripts and tools that automate the installation, deployment, and certain
5
+
operational aspects of [Slurm](https://slurm.schedmd.com/overview.html) on
6
+
Google Cloud Platform. The Cluster Toolkit team has finished transitioning to
7
+
Slurm-GCP v6 and has removed all v5 modules and blueprints. Slurm-GCP v6 is the
8
+
only supported option for provisioning Slurm on Google Cloud.
9
+
10
+
### Major Changes in from Slurm GCP v5 to v6
11
+
12
+
* Robust reconfiguration
13
+
14
+
Reconfiguration is now managed by a service that runs on each instance. This has removed the dependency on the Pub/Sub Google cloud service, and provides a more consistent reconfiguration experience (when calling `gcluster deploy blueprint.yaml -w`). Reconfiguration has also been enabled by default.
15
+
16
+
* Faster deployments
17
+
18
+
Simple cluster deploys up to 3x faster.
19
+
20
+
* Lift the restriction on the number of deployments in a single project.
21
+
22
+
Slurm GCP v6 has eliminated the use of project metadata to store cluster configuration. Project metadata was both slow to update and had an absolute storage limit. This restricted the number of clusters that could be deployed in a single project. Configs are now stored in a Google Storage Bucket.
23
+
24
+
* Fewer dependencies in the deployment environment
25
+
26
+
Reconfiguration and compute node cleanup no longer require users to install local python dependencies in the deploy
27
+
ent environment (where gcluster is called). This has allowed for these features to be enabled by default.
28
+
29
+
* Flexible node to partition relation
30
+
31
+
The v5 concept of "node-group" was replaced by "nodeset" to align with Slurm naming convention. Nodeset can be attr
32
+
buted to multiple partitions, as well as partitions can include multiple nodesets.
33
+
34
+
* Upgrade Slurm to 23.11
35
+
* TPU v3, v4 support
36
+
37
+
### Unsupported use of End-of-Life modules
38
+
39
+
### v5
40
+
41
+
The final release of Slurm-GCP v5 was made as part of
42
+
[Cluster Toolkit v1.44.1][v1.44.1]. Any remaining use of Slurm-GCP v5 is
43
+
unsupported, however this release can be used to build the Toolkit binary
44
+
and review v5 modules and examples as references.
45
+
46
+
### v4
47
+
48
+
The final release of Slurm-GCP v4 was made as part of
49
+
[Cluster Toolkit v1.27.0][v1.27.0]. Any remaining use of Slurm-GCP v4 is
50
+
unsupported, however this release can be used to build the Toolkit binary
[Slurm-GCP](https://github.com/GoogleCloudPlatform/slurm-gcp) is the set of
141
-
scripts and tools that automate the installation, deployment, and certain
142
-
operational aspects of [Slurm](https://slurm.schedmd.com/overview.html) on
143
-
Google Cloud Platform. It is recommended to use Slurm-GCP through the Cluster
144
-
Toolkit where it is exposed as various modules.
145
-
146
-
The Cluster Toolkit team has finished transitioning from Slurm-GCP v5 to Slurm-GCP v6 and
147
-
as of 10/11/2024, Slurm-GCP v6 is the recommended option. Blueprint naming is as
148
-
follows:
149
-
150
-
* Slurm v5: hpc-slurm-v5-legacy.yaml
151
-
* Slurm v6: hpc-slurm.yaml
152
-
153
-
> [!IMPORTANT]
154
-
> Slurm-GCP v5 modules are now marked as deprecated and will be maintained in our
155
-
> repo till January 6, 2025. After that, the modules will be removed from the Cluster
156
-
> Toolkit repo and regression tests will no longer run for V5. Those who choose
157
-
> to not upgrade to V6 will still be able to use V5 modules by referencing
158
-
> specific git tags in the module source lines.
159
-
160
-
### Major Changes in from Slurm GCP v5 to v6
161
-
162
-
* Robust reconfiguration
163
-
164
-
Reconfiguration is now managed by a service that runs on each instance. This has removed the dependency on the Pub/Sub Google cloud service, and provides a more consistent reconfiguration experience (when calling `gcluster deploy blueprint.yaml -w`). Reconfiguration has also been enabled by default.
165
-
166
-
* Faster deployments
167
-
168
-
Simple cluster deploys up to 3x faster.
169
-
170
-
* Lift the restriction on the number of deployments in a single project.
171
-
172
-
Slurm GCP v6 has eliminated the use of project metadata to store cluster configuration. Project metadata was both slow to update and had an absolute storage limit. This restricted the number of clusters that could be deployed in a single project. Configs are now stored in a Google Storage Bucket.
173
-
174
-
* Fewer dependencies in the deployment environment
175
-
176
-
Reconfiguration and compute node cleanup no longer require users to install local python dependencies in the deployment environment (where gcluster is called). This has allowed for these features to be enabled by default.
177
-
178
-
* Flexible node to partition relation
179
-
180
-
The v5 concept of "node-group" was replaced by "nodeset" to align with Slurm naming convention. Nodeset can be attributed to multiple partitions, as well as partitions can include multiple nodesets.
181
-
182
-
* Upgrade Slurm to 23.11
183
-
* TPU v3, v4 support
184
-
185
-
_For a full accounting of changes, see the changelog._
@@ -1603,37 +1548,8 @@ To avoid these issues, the `ghpc_stage` function can be used to copy a file (or
1603
1548
The `ghpc_stage` function will always look first in the path specified in the blueprint. If the file is not found at this path then `ghpc_stage` will look for the staged file in the deployment folder, if a deployment folder exists.
1604
1549
This means that you can redeploy a blueprint (`gcluster deploy <blueprint> -w`) so long as you have the deployment folder from the original deployment, even if locally referenced files are not available.
1605
1550
1606
-
<!-- BEGINNING OF PRE-COMMIT-TERRAFORM DOCS HOOK -->
Refer to [AI Hypercomputer Documentation](https://cloud.google.com/ai-hypercomputer/docs/create/gke-ai-hypercompute#create-cluster) for instructions.
1
+
Refer to [Create an AI-optimized GKE cluster with default configuration](https://cloud.google.com/ai-hypercomputer/docs/create/gke-ai-hypercompute#use-cluster-toolkit) for instructions on creating the GKE-A3U cluster.
2
+
3
+
Refer to [Deploy and run NCCL test with Topology Aware Scheduling (TAS)](https://cloud.google.com/ai-hypercomputer/docs/create/gke-ai-hypercompute#deploy-run-nccl-tas-test) for instructions on running a NCCL test on the GKE-A3U cluster.
0 commit comments