diff --git a/community/examples/eda/README.md b/community/examples/eda/README.md new file mode 100644 index 0000000000..59dee255d5 --- /dev/null +++ b/community/examples/eda/README.md @@ -0,0 +1,251 @@ +# Electronic Design Automation (EDA) Reference Architecture + +The Electronic Design Automation (EDA) blueprints in +this folder captures a reference architecture where the right cloud components +are assembled to optimally cater to the requirements of EDA workloads. + +For file IO, Google Cloud NetApp Volumes NFS storage services are available. +It scales from small to high capacity and high performance and provides fan-out +caching of on-premises ONTAP systems into Google Cloud to enable hybrid cloud +architecture. The scheduling of the workloads is done by a workload +manager. + +## Architecture +The EDA blueprints are intended to be a starting point for more tailored +explorations of EDA. + +This blueprint features a general setup suited for EDA applications on +Google Cloud including: + +- Google Compute Engine partitions +- Google Cloud NetApp Volumes NFS-based shared storage +- Slurm workload scheduler + +Two example blueprints are provided. + +### Blueprint [eda-all-on-cloud](eda-all-on-cloud.yaml) + +This blueprint assumes that all compute and data resides in the cloud. + +In the base deployment group (see [deployment stages](#deployment_stages)) it provisions a new network and multiple NetApp Volumes volumes to store your data. Adjust the volume sizes to suit your requirements before deployment. If your volumes are larger than 15 TiB, creating them as [large volumes](https://cloud.google.com/netapp/volumes/docs/configure-and-use/volumes/overview#large-capacity-volumes) adds performance benefits. One limitation currently is that Slurm will only use the first IP of a large volume. If you need to utilize the full performance of the 6 IP addresses a large volume provides, you can instead utilize the approach with pre-existing volumes and CloudDNS mentioned in eda-hybrid-cloud blueprint description. + +The cluster deployment group deploys a managed instance group which is managed by Slurm. + +When scaling down the deployment, make sure to only destroy the *compute* deployment group. If you destroy the *base* group too, all the volumes will be deleted and you will lose your data. + +### Blueprint [eda-hybrid-cloud](./eda-hybrid-cloud.yaml) + +This blueprint assumes you are using NetApp Volumes [FlexCache](https://docs.cloud.google.com/netapp/volumes/docs/configure-and-use/volumes/cache-ontap-volumes/overview) to enable a [hybrid cloud EDA](https://community.netapp.com/t5/Tech-ONTAP-Blogs/NetApp-FlexCache-Enhancing-hybrid-EDA-with-Google-Cloud-NetApp-Volumes/ba-p/462768) environment. + +The base deployment group (see [deployment stages](#deployment_stages)) connects to an existing network and mounts multiple NetApp Volumes volumes. This blueprint assumes you have pre-existing volumes for "tools", "libraries", "home" and "scratch". Before deployment, update `server_ip` and `remote_mount` parameters of the respective volumes in the blueprint declarations to reflect the actual IP and export path of your existing volumes. Using existing volumes also avoids the danger of being deleted accidentally when deleting the base deployment group. + +The volumes used can be regular NetApp Volume [volumes](https://cloud.google.com/netapp/volumes/docs/configure-and-use/volumes/overview), [large volumes](https://cloud.google.com/netapp/volumes/docs/configure-and-use/volumes/overview#large-capacity-volumes) or [FlexCache volumes](https://cloud.google.com/netapp/volumes/docs/configure-and-use/volumes/cache-ontap-volumes/overview). + +FlexCache offers the following features which enable bursting on-premises workloads into Google Cloud to use its powerful compute options: + +- Read-writable sparse volume +- Block-level, “pull only” paradigm +- 100% consistent, coherent, current +- write-around +- LAN-like latencies after first read +- Fan-out. Use multiple caches to scale out workload + +It can accelerate metadata- or throughput-heavy read workloads considerably. +It can accelerate metadata- or throughput-heavy read workloads considerably. + +FlexCache and Large Volumes offer six IP addresses per volume which all provide access to the same data. Currently Cluster Toolkit only uses one of these IPs. Support for using all 6 IPs is planned for a later release. To spread your compute nodes over all IPs today, you can use CloudDNS to create an DNS record with all 6 IPs and specify that DNS name instead of individual IPs in the blueprint. CloudDNS will return one of the 6 IPs in a round-robin fashion on lookups. + +The cluster deployment group deploys a managed instance group which is managed by Slurm. + +## Getting Started +To explore the reference architecture, you should follow these steps: + +Before you start, make sure your prerequisites and dependencies are set up: +[Set up Cluster Toolkit](https://cloud.google.com/cluster-toolkit/docs/setup/configure-environment). + +For deploying the EDA reference blueprint follow the +[Deployment Instructions](#deployment-instructions). + +### Deployment Stages + +This blueprint has the following deployment groups: + +- `base`: Setup backbone infrastructure such as networking and file systems +- `software_installation`(_optional_): This deployment group is a stub for + custom software installation on the network storage before the cluster is brought up +- `cluster`: Deploys an auto-scaling cluster + +Having multiple deployment groups decouples the life cycle of some +infrastructure. For example a) you can tear down the cluster while leaving the +storage intact and b) you can build software before you deploy your cluster. + +## Deployment Instructions + +> [!WARNING] +> Installing this blueprint uses the following billable components of Google +> Cloud: +> +> - Compute Engine +> - NetApp Volumes +> +> To avoid continued billing after use closely follow the +> [teardown instructions](#teardown-instructions). To generate a cost estimate based on +> your projected usage, use the [pricing calculator](https://cloud.google.com/products/calculator). +> +> [!WARNING] +> Before attempting to execute the following instructions, it is important to +> consider your project's quota. The blueprints create an +> autoscaling cluster that, when fully scaled up, can deploy many powerful VMs. +> +> This is merely an example for an instance of this reference architecture. +> Node counts can easily be adjusted in the blueprint. + +1. Clone the repo + + ```bash + git clone https://github.com/GoogleCloudPlatform/cluster-toolkit.git + cd cluster-toolkit + ``` + +1. Build the Cluster Toolkit + + ```bash + make + ``` + +1. Change parameters in your blueprint file to reflect your requirements. Examples are VPC names for existing networks, H4D instance group node limits or export paths of existing NFS volumes. + +1. Generate the deployment folder after replacing `` with the name of the blueprint (`eda-all-cloud` or `eda-hybrid-cloud`) and ``, `region` and `zone` with your project details. + + ```bash + ./gcluster create community/examples/eda/.yaml --vars "project_id=${GOOGLE_CLOUD_PROJECT}" --vars region=us-central1 --vars zone=us-central1-a + ``` + +1. Deploy the `base` group + + Call the following gcluster command to deploy the blueprint. + + ```bash + ./gcluster deploy CLUSTER-NAME + ``` + + Replace `CLUSTER-NAME` with the deployment_name (`eda-all-on-cloud` or + `eda-hybrid-cloud`) used in the blueprint vars block. + + The next `gcluster` prompt will ask you to **display**, **apply**, **stop**, or + **continue** without applying the `base` group. Select 'apply'. + + This group will create a network and file systems to be used by the cluster. + + > [!WARNING] + > This gcluster command will run through 2 deployment groups (3 if you populate + > & activate the `software_installation` stage) and prompt you to apply each one. + > If the command is cancelled or exited by accident before finishing, it can + > be rerun to continue deploying the blueprint. + +1. Deploy the `software_installation` group (_optional_). + + > [!NOTE] + > Installation processes differ between applications. Some come as a + > precompiled binary with all dependencies included, others may need to + > be built from source, while others can be deployed through package + > managers such as spack. This deployment group is intended to be used + > if the software installation process requires substantial amount of time (e.g. + > compilation from source). By building the software in a separate + > deployment group, this process can be done before the cluster is + > up, minimizing costs. + > + > [!NOTE] + > By default, this deployment group is disabled in the reference design. See + > [Software Installation Patterns](#software-installation-patterns) for more information. + + If this deployment group is used (needs to be uncommented in the blueprint first), + you can return to the gcluster command which will ask you to **display**, **apply**, + **stop**, or **continue** without applying the `software_installation` group. + Select 'apply'. + +1. Deploy the `cluster` group + + The next `gcluster` prompt will ask you to **display**, **apply**, **stop**, or + **continue** without applying the `cluster` group. Select 'apply'. + + This deployment group contains the Slurm cluster and compute partitions. + +## Teardown Instructions + +> [!NOTE] +> If you created a new project for testing of the EDA solution, the easiest way to +> eliminate billing is to delete the project. + +When you would like to tear down the deployment, each stage must be destroyed. +Since the `software_installation` and `cluster` depend on the network deployed +in the `base` stage, they must be destroyed first. You can use the following +commands to destroy the deployment in this reverse order. You will be prompted +to confirm the deletion of each stage. + +```bash +./gcluster destroy CLUSTER-NAME +``` + +Replace `CLUSTER-NAME` with the deployment_name (`eda-all-on-cloud` or +`eda-hybrid-cloud`) used in the blueprint vars block. + +> [!WARNING] +> If you do not destroy all three deployment groups then there may be continued +> associated costs. + +## Software Installation Patterns + +This section is intended to illustrate how software can be installed in the context +of the EDA reference solution. + +Depending on the software you want to use, different installation paths may be required. + +- **Installation with binary** + Commercial-off-the-shelf applications typically come with precompiled binaries which + are provided by the ISV. If you do not share them using the toolsfs or libraryfs shares, + you can install software using the following method. + + In general, you need to bring the binaries to your EDA cluster for which it is + useful to use a Google Cloud Storage bucket, which is accessible from any machine using the + gsutil command and which can be mounted in the cluster. + + As this installation process only needs to be done once and at the same time may require time, + we recommend to do this installation in a separate deployment group before you bring up the cluster. + The `software_installation' stage is meant to accommodate this. You can for example bring up + a dedicated VM + + ``` {.yaml} + - id: sw-installer-vm + source: modules/compute/vm-instance + use: [network1, toolsfs] + settings: + name_prefix: sw-installer + add_deployment_name_before_prefix: true + threads_per_core: 2 + machine_type: c2-standard-16 + ``` + + where you can follow the installation steps manually. Or using the toolkit's + [startup-script](../../modules/scripts/startup-scripts/README.md) module, the process + can be automated. + + Once that is completed, the software will persist on the NetApp Volumes share for as long as you + do not destroy the `base` stage. + +- **Installation from source/with package manager** + For open source software, you may want to compile the software from scratch or use a + package manager such as spack for the installation. This process typically takes + a non-negligible amount of time (~hours). We therefore strongly suggest to use + the `software_installation` stage for this purpose. + + Please see the [HCLS Blueprint](../../docs/videos/healthcare-and-life-sciences/README.md) example + for how the `software_installation` stage can be used to use the spack package manager + to install all dependencies for a particular version of the software, including compiling + the software or its dependencies from source. + + Please also see the [OpenFOAM](../../docs/tutorials/openfoam/spack-openfoam.md) example + for how this can be used to install the OpenFOAM software. + + Once that is completed, the software will persist on the NetApp Volumes share for as long as you + do not destroy the `base` stage. diff --git a/community/examples/eda/eda-all-on-cloud.yaml b/community/examples/eda/eda-all-on-cloud.yaml new file mode 100644 index 0000000000..88c32fdd0e --- /dev/null +++ b/community/examples/eda/eda-all-on-cloud.yaml @@ -0,0 +1,235 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +--- + +blueprint_name: eda-all-on-cloud + +vars: + project_id: ## Set GCP Project ID Here ## + deployment_name: eda-all-on-cloud + region: ## Set GCP Region Here ## + zone: ## Set GCP Zone Here ## + pool_service_level: "EXTREME" # Options: "STANDARD", "PREMIUM", "EXTREME" + rdma_net_range: 192.168.128.0/18 # Specify an unused CIDR range for the RDMA network + +deployment_groups: + +# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +# +# +# Deployment Group: Setup +# +# Sets up VPC network, persistent NFS shares +# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +- group: base + modules: + # Frontend network for GCE, NetApp Volumes and other services + - id: eda-net + source: modules/network/vpc + + # Backend RDMA network for GCE instances with RDMA capabilities + - id: eda-rdma-net + source: modules/network/vpc + settings: + network_name: $(vars.deployment_name)-rdma-net-0 + mtu: 8896 + network_profile: https://www.googleapis.com/compute/beta/projects/$(vars.project_id)/global/networkProfiles/$(vars.zone)-vpc-falcon + network_routing_mode: REGIONAL + enable_cloud_router: false + enable_cloud_nat: false + enable_internal_traffic: false + subnetworks: + - subnet_name: $(vars.deployment_name)-rdma-sub-0 + subnet_region: $(vars.region) + subnet_ip: $(vars.rdma_net_range) + region: $(vars.region) + + # PSA is required for Google Cloud NetApp Volumes. + # Private Service Access (PSA) requires the compute.networkAdmin role which is + # included in the Owner role, but not Editor. + # https://cloud.google.com/vpc/docs/configure-private-services-access#permissions + - id: private_service_access + source: community/modules/network/private-service-access + use: [eda-net] + settings: + prefix_length: 24 + service_name: "netapp.servicenetworking.goog" + deletion_policy: "ABANDON" + + # NetApp Storage Pool. All NetApp Volumes will be created in this pool. + - id: netapp_pool + source: modules/file-system/netapp-storage-pool + use: [eda-net, private_service_access] + settings: + pool_name: "eda-pool" + capacity_gib: 4096 + service_level: $(vars.pool_service_level) + region: $(vars.region) + # allow_auto_tiering: true + + # NFS volume for shared tools and utilities + - id: toolsfs + source: modules/file-system/netapp-volume + use: [netapp_pool] + settings: + region: $(vars.region) + volume_name: "toolsfs" + capacity_gib: 1024 # Adjust size as needed + large_capacity: false + local_mount: "/tools" + protocols: ["NFSV3"] + unix_permissions: "0777" + # Mount options are optimized for aggressive caching, assuming rare changes on the volume + mount_options: "nocto,actimeo=600,hard,rsize=262144,wsize=262144,vers=3,tcp,mountproto=tcp" + + # NFS volume for shared libraries + - id: libraryfs + source: modules/file-system/netapp-volume + use: [netapp_pool] + settings: + region: $(vars.region) + volume_name: "libraryfs" + capacity_gib: 1024 # Adjust size as needed + large_capacity: false + local_mount: "/library" + protocols: ["NFSV3"] + unix_permissions: "0777" + # Mount options are optimized for aggressive caching, assuming rare changes on the volume + mount_options: "nocto,actimeo=600,hard,rsize=262144,wsize=262144,vers=3,tcp,mountproto=tcp" + + # NFS volume for home directories + - id: homefs + source: modules/file-system/netapp-volume + use: [netapp_pool] + settings: + region: $(vars.region) + volume_name: "homefs" + capacity_gib: 1024 # Adjust size as needed + large_capacity: false + local_mount: "/home" + protocols: ["NFSV3"] + unix_permissions: "0755" + + # NFS volume for scratch space + - id: scratchfs + source: modules/file-system/netapp-volume + use: [netapp_pool] + settings: + region: $(vars.region) + volume_name: "scratchfs" + capacity_gib: 1024 # Adjust size as needed + large_capacity: false + local_mount: "/scratch" + protocols: ["NFSV3"] + unix_permissions: "0777" + +# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +# +# +# Deployment Group: Software Installation +# +# This deployment group is a stub for installing software before +# bringing up the actual cluster. +# See the README.md for useful software deployment patterns. +# +# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +# - group: software_installation +# modules: + +# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +# +# +# Deployment Group: Cluster +# +# Provisions the actual EDA cluster with compute partitions, +# Connects to the previously set up NFS shares. +# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +- group: cluster + modules: + - id: h4d_startup + source: modules/scripts/startup-script + settings: + set_ofi_cloud_rdma_tunables: true + local_ssd_filesystem: + fs_type: ext4 + mountpoint: /mnt/lssd + permissions: "1777" + + - id: h4d_nodeset + source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset + use: + - h4d_startup + - eda-net + - homefs + - toolsfs + - libraryfs + - scratchfs + settings: + bandwidth_tier: gvnic_enabled + machine_type: h4d-highmem-192-lssd + node_count_static: 1 # Adjust as needed + node_count_dynamic_max: 0 # Adjust as needed + enable_placement: false + disk_type: hyperdisk-balanced + on_host_maintenance: TERMINATE + additional_networks: + $(concat( + [{ + network=null, + subnetwork=eda-rdma-net.subnetwork_self_link, + subnetwork_project=vars.project_id, + nic_type="IRDMA", + queue_count=null, + network_ip=null, + stack_type=null, + access_config=null, + ipv6_access_config=[], + alias_ip_range=[] + }] + )) + + - id: h4d_partition + source: community/modules/compute/schedmd-slurm-gcp-v6-partition + use: + - h4d_nodeset + settings: + exclusive: false + partition_name: h4d + is_default: true + partition_conf: + ResumeTimeout: 900 + SuspendTimeout: 600 + + - id: slurm_login + source: community/modules/scheduler/schedmd-slurm-gcp-v6-login + use: [eda-net] + settings: + machine_type: n2-standard-4 + enable_login_public_ips: true + + - id: slurm_controller + source: community/modules/scheduler/schedmd-slurm-gcp-v6-controller + use: + - eda-net + - h4d_partition + - slurm_login + - homefs + - toolsfs + - libraryfs + - scratchfs + settings: + enable_controller_public_ips: true + cloud_parameters: + slurmd_timeout: 900 diff --git a/community/examples/eda/eda-hybrid-cloud.yaml b/community/examples/eda/eda-hybrid-cloud.yaml new file mode 100644 index 0000000000..7e57788155 --- /dev/null +++ b/community/examples/eda/eda-hybrid-cloud.yaml @@ -0,0 +1,232 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +--- + +blueprint_name: eda-hybrid-cloud + +vars: + project_id: ## Set GCP Project ID Here ## + deployment_name: eda-hybrid-cloud + region: ## Set GCP Region Here ## + zone: ## Set GCP Zone Here ## + network: default ## Set name of existing VPC network here + rdma_net_range: 192.168.128.0/18 # Specify an unused CIDR range for the RDMA network + +deployment_groups: + +# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +# +# +# Deployment Group: Setup +# +# Sets up VPC network, persistent NFS shares +# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +- group: base + modules: + # Frontend network for GCE, NetApp Volumes and other services. Make sure it has internet access. + - id: eda-net + source: modules/network/pre-existing-vpc + settings: + project_id: $(vars.project_id) + region: $(vars.region) + network_name: $(vars.network) + + - id: firewall-rule-frontend + source: modules/network/firewall-rules + use: + - eda-net + settings: + ingress_rules: + - name: $(vars.deployment_name)-allow-internal-traffic + description: Allow internal traffic + destination_ranges: + - $(eda-net.subnetwork_address) + source_ranges: + - $(eda-net.subnetwork_address) + allow: + - protocol: tcp + ports: + - 0-65535 + - protocol: udp + ports: + - 0-65535 + - protocol: icmp + - name: $(vars.deployment_name)-allow-iap-ssh + description: Allow IAP-tunneled SSH connections + destination_ranges: + - $(eda-net.subnetwork_address) + source_ranges: + - 35.235.240.0/20 + allow: + - protocol: tcp + ports: + - 22 + + # Backend RDMA network for GCE instances with RDMA capabilities + - id: eda-rdma-net + source: modules/network/vpc + settings: + network_name: $(vars.deployment_name)-rdma-net-0 + mtu: 8896 + network_profile: https://www.googleapis.com/compute/beta/projects/$(vars.project_id)/global/networkProfiles/$(vars.zone)-vpc-falcon + network_routing_mode: REGIONAL + enable_cloud_router: false + enable_cloud_nat: false + enable_internal_traffic: false + subnetworks: + - subnet_name: $(vars.deployment_name)-rdma-sub-0 + subnet_region: $(vars.region) + subnet_ip: $(vars.rdma_net_range) + region: $(vars.region) + +# Connect existing Google Cloud NetApp Volumes +# Replace server_ip, remote_mount, and local_mount values as needed for toolsfs, libraryfs, homefs, scratchfs +# Make sure the root inode of each volume has appropriate permissions for intended users, otherwise SLURM jobs may fail + - id: toolsfs + source: modules/file-system/pre-existing-network-storage + settings: + server_ip: # Set IP address of the NFS server here + remote_mount: # Set exported path of NFS share here + local_mount: /tools + fs_type: nfs + # Mount options are optimized for aggressive caching, assuming rare changes on the volume + mount_options: "nocto,actimeo=600,hard,rsize=262144,wsize=262144,vers=3,tcp,mountproto=tcp" + + - id: libraryfs + source: modules/file-system/pre-existing-network-storage + settings: + server_ip: # Set IP address of the NFS server here + remote_mount: # Set exported path of NFS share here + local_mount: /library + fs_type: nfs + # Mount options are optimized for aggressive caching, assuming rare changes on the volume + mount_options: "nocto,actimeo=600,hard,rsize=262144,wsize=262144,vers=3,tcp,mountproto=tcp" + + - id: homefs + source: modules/file-system/pre-existing-network-storage + settings: + server_ip: # Set IP address of the NFS server here + remote_mount: # Set exported path of NFS share here + local_mount: /home + fs_type: nfs + mount_options: "hard,rsize=262144,wsize=262144,vers=3,tcp,mountproto=tcp" + + - id: scratchfs + source: modules/file-system/pre-existing-network-storage + settings: + server_ip: # Set IP address of the NFS server here + remote_mount: # Set exported path of NFS share here + local_mount: /scratch + fs_type: nfs + mount_options: "hard,rsize=262144,wsize=262144,vers=3,tcp,mountproto=tcp" + +# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +# +# +# Deployment Group: Software Installation +# +# This deployment group is a stub for installing software before +# bringing up the actual cluster. +# See the README.md for useful software deployment patterns. +# +# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +# - group: software_installation +# modules: + +# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +# +# +# Deployment Group: Cluster +# +# Provisions the actual EDA cluster with compute partitions, +# Connects to the previously set up NFS shares. +# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +- group: cluster + modules: + - id: h4d_startup + source: modules/scripts/startup-script + settings: + set_ofi_cloud_rdma_tunables: true + local_ssd_filesystem: + fs_type: ext4 + mountpoint: /mnt/lssd + permissions: "1777" + + - id: h4d_nodeset + source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset + use: + - h4d_startup + - eda-net + - homefs + - toolsfs + - libraryfs + - scratchfs + settings: + bandwidth_tier: gvnic_enabled + machine_type: h4d-highmem-192-lssd + node_count_static: 1 # Adjust as needed + node_count_dynamic_max: 0 # Adjust as needed + enable_placement: false + disk_type: hyperdisk-balanced + on_host_maintenance: TERMINATE + additional_networks: + $(concat( + [{ + network=null, + subnetwork=eda-rdma-net.subnetwork_self_link, + subnetwork_project=vars.project_id, + nic_type="IRDMA", + queue_count=null, + network_ip=null, + stack_type=null, + access_config=null, + ipv6_access_config=[], + alias_ip_range=[] + }] + )) + + - id: h4d_partition + source: community/modules/compute/schedmd-slurm-gcp-v6-partition + use: + - h4d_nodeset + settings: + exclusive: false + partition_name: h4d + is_default: true + partition_conf: + ResumeTimeout: 900 + SuspendTimeout: 600 + + - id: slurm_login + source: community/modules/scheduler/schedmd-slurm-gcp-v6-login + use: [eda-net] + settings: + machine_type: n2-standard-4 + enable_login_public_ips: true + + - id: slurm_controller + source: community/modules/scheduler/schedmd-slurm-gcp-v6-controller + use: + - eda-net + - h4d_partition + - slurm_login + - homefs + - toolsfs + - libraryfs + - scratchfs + settings: + enable_controller_public_ips: true + cloud_parameters: + slurmd_timeout: 900 diff --git a/docs/network_storage.md b/docs/network_storage.md index b09cc70e2c..964e99c266 100644 --- a/docs/network_storage.md +++ b/docs/network_storage.md @@ -107,7 +107,7 @@ nfs-server | via USE | via USE | via USE | via STARTUP | via USE | via USE cloud-storage-bucket (GCS)| via USE | via USE | via USE | via STARTUP | via USE | via USE DDN EXAScaler lustre | via USE | via USE | via USE | Needs Testing | via USE | via USE Managed Lustre | via USE | Needs Testing | via USE | Needs Testing | Needs Testing | Needs Testing -netapp-volume | Needs Testing | Needs Testing | via USE | Needs Testing | Needs Testing | Needs Testing +netapp-volume | via USE | Needs Testing | via USE | Needs Testing | Needs Testing | Needs Testing |  |   |   |   |   |   filestore (pre-existing) | via USE | via USE | via USE | via STARTUP | via USE | via USE nfs-server (pre-existing) | via USE | via USE | via USE | via STARTUP | via USE | via USE diff --git a/examples/README.md b/examples/README.md index 5cb3933c51..1175386b70 100644 --- a/examples/README.md +++ b/examples/README.md @@ -66,6 +66,8 @@ md_toc github examples/README.md | sed -e "s/\s-\s/ * /" * [gke-g4](#gke-g4-) ![core-badge] * [netapp-volumes.yaml](#netapp-volumesyaml--) ![core-badge] * [gke-tpu-7x](#gke-tpu-7x-) ![core-badge] + * [eda-all-on-cloud.yaml](#eda-all-on-cloudyaml-) ![community-badge] + * [eda-hybrid-cloud.yaml](#eda-hybrid-cloudyaml-) ![community-badge] * [Blueprint Schema](#blueprint-schema) * [Writing an HPC Blueprint](#writing-an-hpc-blueprint) * [Blueprint Boilerplate](#blueprint-boilerplate) @@ -1705,6 +1707,24 @@ This example shows how TPU 7x cluster can be created and be used to run a job th [gke-tpu-7x]: ../examples/gke-tpu-7x +### [eda-all-on-cloud.yaml] ![community-badge] + +Creates a basic auto-scaling Slurm cluster intended for EDA use cases. The blueprint also creates two new VPC networks, a network called `eda-net` which connects VMs, Slurm and storage and a RDMA network called `eda-rdma-net` between the H4D nodes, along with four [Google Cloud NetApp Volumes](https://cloud.google.com/netapp/volumes/docs/configure-and-use/volumes/overview) mounted to `/home`, `/tools`, `/libraries` and `/scratch`. There is an `h4d` partition that uses compute-optimized `h4d-highmem-192-lssd` machine type. + +The deployment instructions can be found in the [README](../community/examples/eda/README.md). + +[eda-all-on-cloud.yaml]: ../community/examples/eda/eda-all-on-cloud.yaml + +### [eda-hybrid-cloud.yaml] ![community-badge] + +Creates a basic auto-scaling Slurm cluster intended for EDA use cases. The blueprint also connects to one existing user network which connects VMs, Slurm and storage and creates a RDMA network called `eda-rdma-net` for low latency communication between the compute nodes. There is an `h4d` partition that uses compute-optimized `h4d-highmem-192-lssd` machine type. + +Four pre-existing NFS volumes are mounted to `/home`, `/tools`, `/libraries` and `/scratch`. Using [FlexCache](https://cloud.google.com/netapp/volumes/docs/configure-and-use/volumes/cache-ontap-volumes/overview) volumes allows to bring on-premises data to Google Cloud compute, without having to manually copy the data. This enables "burst to the cloud" use cases. + +The deployment instructions can be found in the [README](../community/examples/eda/README.md). + +[eda-hybrid-cloud.yaml]: ../community/examples/eda/eda-hybrid-cloud.yaml + ## Blueprint Schema Similar documentation can be found on diff --git a/modules/README.md b/modules/README.md index c9849d51ef..0ead152de0 100644 --- a/modules/README.md +++ b/modules/README.md @@ -90,6 +90,8 @@ Modules that are still in development and less stable are labeled with the * **[filestore]** ![core-badge] : Creates a [filestore](https://cloud.google.com/filestore) file system. +* **[netapp-volume]** ![core-badge] : Creates a + [netapp-volume](https://docs.cloud.google.com/netapp/volumes/docs/discover/overview) file system. * **[parallelstore]** ![core-badge] ![experimental-badge]: Creates a [parallelstore](https://cloud.google.com/parallelstore) file system. * **[pre-existing-network-storage]** ![core-badge] : Specifies a @@ -109,6 +111,7 @@ Modules that are still in development and less stable are labeled with the and mounts [WEKA](https://www.weka.io/) filesystems. [filestore]: file-system/filestore/README.md +[netapp-volume]: file-system/netapp-volume/README.md [parallelstore]: file-system/parallelstore/README.md [pre-existing-network-storage]: file-system/pre-existing-network-storage/README.md [managed-lustre]: file-system/managed-lustre/README.md