Skip to content

Commit 5254651

Browse files
authored
Merge pull request #3919 from GoogleCloudPlatform/release-candidate
Documentation hotfix release v1.48.1
2 parents 057bfe2 + 1bdd325 commit 5254651

File tree

9 files changed

+89
-368
lines changed

9 files changed

+89
-368
lines changed

README.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,16 @@ networking, storage, etc.) following Google Cloud best-practices, in a repeatabl
1010
manner. The Cluster Toolkit is designed to be highly customizable and extensible,
1111
and intends to address the AI/ML and HPC deployment needs of a broad range of customers.
1212

13+
## AI/ML Hypercomputer
14+
15+
The Cluster Toolkit is an integral part of [Google Cloud AI Hypercomputer][aihc].
16+
Documentation concerning AI Hypercomputer solutions is available for
17+
[GKE][aihc-gke] and for [Slurm][aihc-slurm].
18+
19+
[aihc]: https://cloud.google.com/ai-hypercomputer/docs
20+
[aihc-gke]: https://cloud.google.com/ai-hypercomputer/docs/create/gke-ai-hypercompute
21+
[aihc-slurm]: https://cloud.google.com/ai-hypercomputer/docs/create/create-slurm-cluster
22+
1323
## Detailed documentation and examples
1424

1525
The Toolkit comes with a suite of [tutorials], [examples], and full

community/examples/README.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
# Example Blueprints
22

3-
This directory contains a set of community example blueprint files that can be
4-
fed into gHPC to create a deployment. For more information on how to read, write
5-
and configure a custom blueprint, see
6-
[the core examples folder](../../examples/README.md).
3+
This directory contains blueprints contributed externally by the community.
4+
These can be used by the Toolkit to provision your infrastructure.
5+
6+
They are documented in the [the core examples folder](../../examples/README.md)
7+
along with [community support guidelines](../../examples/README.md#blueprint-descriptions).

docs/slurm-gcp-support.md

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
## Completed Migration to Slurm-GCP v6
2+
3+
[Slurm-GCP](https://github.com/GoogleCloudPlatform/slurm-gcp) is the set of
4+
scripts and tools that automate the installation, deployment, and certain
5+
operational aspects of [Slurm](https://slurm.schedmd.com/overview.html) on
6+
Google Cloud Platform. The Cluster Toolkit team has finished transitioning to
7+
Slurm-GCP v6 and has removed all v5 modules and blueprints. Slurm-GCP v6 is the
8+
only supported option for provisioning Slurm on Google Cloud.
9+
10+
### Major Changes in from Slurm GCP v5 to v6
11+
12+
* Robust reconfiguration
13+
14+
Reconfiguration is now managed by a service that runs on each instance. This has removed the dependency on the Pub/Sub Google cloud service, and provides a more consistent reconfiguration experience (when calling `gcluster deploy blueprint.yaml -w`). Reconfiguration has also been enabled by default.
15+
16+
* Faster deployments
17+
18+
Simple cluster deploys up to 3x faster.
19+
20+
* Lift the restriction on the number of deployments in a single project.
21+
22+
Slurm GCP v6 has eliminated the use of project metadata to store cluster configuration. Project metadata was both slow to update and had an absolute storage limit. This restricted the number of clusters that could be deployed in a single project. Configs are now stored in a Google Storage Bucket.
23+
24+
* Fewer dependencies in the deployment environment
25+
26+
Reconfiguration and compute node cleanup no longer require users to install local python dependencies in the deploy
27+
ent environment (where gcluster is called). This has allowed for these features to be enabled by default.
28+
29+
* Flexible node to partition relation
30+
31+
The v5 concept of "node-group" was replaced by "nodeset" to align with Slurm naming convention. Nodeset can be attr
32+
buted to multiple partitions, as well as partitions can include multiple nodesets.
33+
34+
* Upgrade Slurm to 23.11
35+
* TPU v3, v4 support
36+
37+
### Unsupported use of End-of-Life modules
38+
39+
### v5
40+
41+
The final release of Slurm-GCP v5 was made as part of
42+
[Cluster Toolkit v1.44.1][v1.44.1]. Any remaining use of Slurm-GCP v5 is
43+
unsupported, however this release can be used to build the Toolkit binary
44+
and review v5 modules and examples as references.
45+
46+
### v4
47+
48+
The final release of Slurm-GCP v4 was made as part of
49+
[Cluster Toolkit v1.27.0][v1.27.0]. Any remaining use of Slurm-GCP v4 is
50+
unsupported, however this release can be used to build the Toolkit binary
51+
and review v4 modules and examples as references.
52+
53+
[v1.27.0]: https://github.com/GoogleCloudPlatform/hpc-toolkit/releases/tag/v1.27.0
54+
[v1.44.1]: https://github.com/GoogleCloudPlatform/hpc-toolkit/releases/tag/v1.44.1

examples/README.md

Lines changed: 4 additions & 88 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,5 @@
11
# Example Blueprints
22

3-
> [!NOTE]
4-
> Migration to Slurm-GCP v6 is completed. See
5-
> [this update](#completed-migration-to-slurm-gcp-v6) for specific recommendations
6-
> and timelines.
7-
83
This directory contains a set of example blueprint files that can be fed into
94
gHPC to create a deployment.
105

@@ -15,7 +10,6 @@ md_toc github examples/README.md | sed -e "s/\s-\s/ * /"
1510

1611
* [Instructions](#instructions)
1712
* [(Optional) Setting up a remote terraform state](#optional-setting-up-a-remote-terraform-state)
18-
* [Completed Migration to Slurm-GCP v6](#completed-migration-to-slurm-gcp-v6)
1913
* [Blueprint Descriptions](#blueprint-descriptions)
2014
* [hpc-slurm.yaml](#hpc-slurmyaml-) ![core-badge]
2115
* [hpc-enterprise-slurm.yaml](#hpc-enterprise-slurmyaml-) ![core-badge]
@@ -135,55 +129,6 @@ subcommands as well:
135129
[configuration block]: https://developer.hashicorp.com/terraform/language/settings/backends/configuration#using-a-backend-block
136130
[gcs]: https://developer.hashicorp.com/terraform/language/settings/backends/gcs
137131

138-
## Completed Migration to Slurm-GCP v6
139-
140-
[Slurm-GCP](https://github.com/GoogleCloudPlatform/slurm-gcp) is the set of
141-
scripts and tools that automate the installation, deployment, and certain
142-
operational aspects of [Slurm](https://slurm.schedmd.com/overview.html) on
143-
Google Cloud Platform. It is recommended to use Slurm-GCP through the Cluster
144-
Toolkit where it is exposed as various modules.
145-
146-
The Cluster Toolkit team has finished transitioning from Slurm-GCP v5 to Slurm-GCP v6 and
147-
as of 10/11/2024, Slurm-GCP v6 is the recommended option. Blueprint naming is as
148-
follows:
149-
150-
* Slurm v5: hpc-slurm-v5-legacy.yaml
151-
* Slurm v6: hpc-slurm.yaml
152-
153-
> [!IMPORTANT]
154-
> Slurm-GCP v5 modules are now marked as deprecated and will be maintained in our
155-
> repo till January 6, 2025. After that, the modules will be removed from the Cluster
156-
> Toolkit repo and regression tests will no longer run for V5. Those who choose
157-
> to not upgrade to V6 will still be able to use V5 modules by referencing
158-
> specific git tags in the module source lines.
159-
160-
### Major Changes in from Slurm GCP v5 to v6
161-
162-
* Robust reconfiguration
163-
164-
Reconfiguration is now managed by a service that runs on each instance. This has removed the dependency on the Pub/Sub Google cloud service, and provides a more consistent reconfiguration experience (when calling `gcluster deploy blueprint.yaml -w`). Reconfiguration has also been enabled by default.
165-
166-
* Faster deployments
167-
168-
Simple cluster deploys up to 3x faster.
169-
170-
* Lift the restriction on the number of deployments in a single project.
171-
172-
Slurm GCP v6 has eliminated the use of project metadata to store cluster configuration. Project metadata was both slow to update and had an absolute storage limit. This restricted the number of clusters that could be deployed in a single project. Configs are now stored in a Google Storage Bucket.
173-
174-
* Fewer dependencies in the deployment environment
175-
176-
Reconfiguration and compute node cleanup no longer require users to install local python dependencies in the deployment environment (where gcluster is called). This has allowed for these features to be enabled by default.
177-
178-
* Flexible node to partition relation
179-
180-
The v5 concept of "node-group" was replaced by "nodeset" to align with Slurm naming convention. Nodeset can be attributed to multiple partitions, as well as partitions can include multiple nodesets.
181-
182-
* Upgrade Slurm to 23.11
183-
* TPU v3, v4 support
184-
185-
_For a full accounting of changes, see the changelog._
186-
187132
## Blueprint Descriptions
188133

189134
[core-badge]: https://img.shields.io/badge/-core-blue?style=plastic
@@ -1603,37 +1548,8 @@ To avoid these issues, the `ghpc_stage` function can be used to copy a file (or
16031548
The `ghpc_stage` function will always look first in the path specified in the blueprint. If the file is not found at this path then `ghpc_stage` will look for the staged file in the deployment folder, if a deployment folder exists.
16041549
This means that you can redeploy a blueprint (`gcluster deploy <blueprint> -w`) so long as you have the deployment folder from the original deployment, even if locally referenced files are not available.
16051550

1606-
<!-- BEGINNING OF PRE-COMMIT-TERRAFORM DOCS HOOK -->
1607-
## Requirements
1608-
1609-
No requirements.
1610-
1611-
## Providers
1612-
1613-
| Name | Version |
1614-
|------|---------|
1615-
| <a name="provider_google-beta"></a> [google-beta](#provider\_google-beta) | n/a |
1616-
1617-
## Modules
1618-
1619-
No modules.
1620-
1621-
## Resources
1622-
1623-
| Name | Type |
1624-
|------|------|
1625-
| [google-beta_google_compute_global_address.private_ip_alloc](https://registry.terraform.io/providers/hashicorp/google-beta/latest/docs/resources/google_compute_global_address) | resource |
1626-
| [google-beta_google_compute_network.network](https://registry.terraform.io/providers/hashicorp/google-beta/latest/docs/resources/google_compute_network) | resource |
1627-
| [google-beta_google_parallelstore_instance.instance](https://registry.terraform.io/providers/hashicorp/google-beta/latest/docs/resources/google_parallelstore_instance) | resource |
1628-
| [google-beta_google_service_networking_connection.default](https://registry.terraform.io/providers/hashicorp/google-beta/latest/docs/resources/google_service_networking_connection) | resource |
1629-
1630-
## Inputs
1631-
1632-
No inputs.
1633-
1634-
## Outputs
1551+
## Completed Migration to Slurm-GCP v6
16351552

1636-
| Name | Description |
1637-
|------|-------------|
1638-
| <a name="output_access_points"></a> [access\_points](#output\_access\_points) | Output access points |
1639-
<!-- END OF PRE-COMMIT-TERRAFORM DOCS HOOK -->
1553+
Slurm-GCP v5 users should read [Slurm-GCP v5 EOL](../docs/slurm-gcp-support.md)
1554+
for information on v5 retirement and feature highlights for v6. Slurm-GCP v6 is
1555+
only supported option within the Toolkit.

examples/gke-a3-ultragpu/README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,3 @@
1-
Refer to [AI Hypercomputer Documentation](https://cloud.google.com/ai-hypercomputer/docs/create/gke-ai-hypercompute#create-cluster) for instructions.
1+
Refer to [Create an AI-optimized GKE cluster with default configuration](https://cloud.google.com/ai-hypercomputer/docs/create/gke-ai-hypercompute#use-cluster-toolkit) for instructions on creating the GKE-A3U cluster.
2+
3+
Refer to [Deploy and run NCCL test with Topology Aware Scheduling (TAS)](https://cloud.google.com/ai-hypercomputer/docs/create/gke-ai-hypercompute#deploy-run-nccl-tas-test) for instructions on running a NCCL test on the GKE-A3U cluster.

0 commit comments

Comments
 (0)