|
| 1 | +## Description |
| 2 | + |
| 3 | +This module creates a [Slinky](https://slinky.ai) cluster and nodesets, for a [Slurm](https://slurm.schedmd.com/documentation.html)-on-Kubernetes HPC setup. |
| 4 | + |
| 5 | +The setup closely follows the [documented quickstart installation](https://github.com/SlinkyProject/slurm-operator/blob/main/docs/quickstart.md), with the exception of a more lightweight monitoring/metrics setup. Consider scraping the Slurm Exporter with [Google Managed Prometheus](https://cloud.google.com/stackdriver/docs/managed-prometheus) and a [PodMonitoring resource](https://cloud.google.com/stackdriver/docs/managed-prometheus/setup-managed#gmp-pod-monitoring), rather than a cluster-local Kube Prometheus Stack (although both are possible with module parameterizations). |
| 6 | + |
| 7 | +Through `cert_manager_values`, `prometheus_values`, `slurm_operator_values`, and `slurm_values`, you can customize the Helm releases that constitute Slinky. The Cert Manager, Slurm Operator, and Slurm Helm installations are required, whereas the Prometheus Helm chart is optional (and not included by default). Set `install_kube_prometheus_stack=true` to install Prometheus. |
| 8 | + |
| 9 | +### Example |
| 10 | + |
| 11 | +```yaml |
| 12 | +- id: slinky |
| 13 | + source: community/modules/scheduler/slinky |
| 14 | + use: [gke_cluster, base_pool] |
| 15 | + settings: |
| 16 | + slurm_values: |
| 17 | + compute: |
| 18 | + nodesets: |
| 19 | + - name: h3 |
| 20 | + enabled: true |
| 21 | + replicas: 2 |
| 22 | + image: |
| 23 | + # Use the default nodeset image |
| 24 | + repository: "" |
| 25 | + tag: "" |
| 26 | + resources: |
| 27 | + requests: |
| 28 | + cpu: 86 |
| 29 | + memory: 324Gi |
| 30 | + limits: |
| 31 | + cpu: 86 |
| 32 | + memory: 324Gi |
| 33 | + affinity: |
| 34 | + nodeAffinity: |
| 35 | + requiredDuringSchedulingIgnoredDuringExecution: |
| 36 | + nodeSelectorTerms: |
| 37 | + - matchExpressions: |
| 38 | + - key: "node.kubernetes.io/instance-type" |
| 39 | + operator: In |
| 40 | + values: |
| 41 | + - h3-standard-88 |
| 42 | + partition: |
| 43 | + enabled: true |
| 44 | +``` |
| 45 | +
|
| 46 | +This creates a Slinky cluster with the following attributes: |
| 47 | +
|
| 48 | +* Slinky Helm releases are installed atop the `gke_cluster` (from the `gke-cluster` module). |
| 49 | +* Slinky system components are scheduled on the `base_pool` (from the `gke-node-pool` module). |
| 50 | + * This node affinity specification is recommended, to save HPC hardware for HPC nodesets, and to ensure Helm releases are fully uninstalled before all nodepools are deleted during a `gcluster destroy`. |
| 51 | +* One Slurm nodeset is provisioned, with resource requests/limits and node affinities aligned to h3-standard-88 VMs. |
| 52 | + |
| 53 | +### Usage |
| 54 | + |
| 55 | +To test Slurm functionality, connect to the controller and use Slurm client commands: |
| 56 | + |
| 57 | +```bash |
| 58 | +gcloud container clusters get-credentials YOUR_CLUSTER --region YOUR_REGION |
| 59 | +``` |
| 60 | + |
| 61 | +```bash |
| 62 | +kubectl exec -it statefulsets/slurm-controller \ |
| 63 | + --namespace=slurm \ |
| 64 | + -- bash --login |
| 65 | +``` |
| 66 | + |
| 67 | +On the controller pod (e.g. host slurm@slurm-controller-0), run the following commands to quickly test if Slurm is functioning: |
| 68 | + |
| 69 | +```bash |
| 70 | +sinfo |
| 71 | +srun hostname |
| 72 | +sbatch --wrap="sleep 60" |
| 73 | +squeue |
| 74 | +``` |
| 75 | + |
| 76 | +<!-- BEGINNING OF PRE-COMMIT-TERRAFORM DOCS HOOK --> |
| 77 | +## Requirements |
| 78 | + |
| 79 | +| Name | Version | |
| 80 | +|------|---------| |
| 81 | +| <a name="requirement_terraform"></a> [terraform](#requirement\_terraform) | >= 1.3 | |
| 82 | +| <a name="requirement_google"></a> [google](#requirement\_google) | >= 6.16 | |
| 83 | +| <a name="requirement_helm"></a> [helm](#requirement\_helm) | ~> 2.17 | |
| 84 | + |
| 85 | +## Providers |
| 86 | + |
| 87 | +| Name | Version | |
| 88 | +|------|---------| |
| 89 | +| <a name="provider_google"></a> [google](#provider\_google) | >= 6.16 | |
| 90 | +| <a name="provider_helm"></a> [helm](#provider\_helm) | ~> 2.17 | |
| 91 | + |
| 92 | +## Modules |
| 93 | + |
| 94 | +No modules. |
| 95 | + |
| 96 | +## Resources |
| 97 | + |
| 98 | +| Name | Type | |
| 99 | +|------|------| |
| 100 | +| [helm_release.cert_manager](https://registry.terraform.io/providers/hashicorp/helm/latest/docs/resources/release) | resource | |
| 101 | +| [helm_release.prometheus](https://registry.terraform.io/providers/hashicorp/helm/latest/docs/resources/release) | resource | |
| 102 | +| [helm_release.slurm](https://registry.terraform.io/providers/hashicorp/helm/latest/docs/resources/release) | resource | |
| 103 | +| [helm_release.slurm_operator](https://registry.terraform.io/providers/hashicorp/helm/latest/docs/resources/release) | resource | |
| 104 | +| [google_client_config.default](https://registry.terraform.io/providers/hashicorp/google/latest/docs/data-sources/client_config) | data source | |
| 105 | +| [google_container_cluster.gke_cluster](https://registry.terraform.io/providers/hashicorp/google/latest/docs/data-sources/container_cluster) | data source | |
| 106 | + |
| 107 | +## Inputs |
| 108 | + |
| 109 | +| Name | Description | Type | Default | Required | |
| 110 | +|------|-------------|------|---------|:--------:| |
| 111 | +| <a name="input_cert_manager_chart_version"></a> [cert\_manager\_chart\_version](#input\_cert\_manager\_chart\_version) | Version of the Cert Manager chart to install. | `string` | `"v1.17.1"` | no | |
| 112 | +| <a name="input_cert_manager_values"></a> [cert\_manager\_values](#input\_cert\_manager\_values) | Value overrides for the Cert Manager release | `any` | <pre>{<br/> "crds": {<br/> "enabled": true<br/> }<br/>}</pre> | no | |
| 113 | +| <a name="input_cluster_id"></a> [cluster\_id](#input\_cluster\_id) | An identifier for the GKE cluster resource with format projects/<project\_id>/locations/<region>/clusters/<name>. | `string` | n/a | yes | |
| 114 | +| <a name="input_install_kube_prometheus_stack"></a> [install\_kube\_prometheus\_stack](#input\_install\_kube\_prometheus\_stack) | Install the Kube Prometheus Stack. | `bool` | `false` | no | |
| 115 | +| <a name="input_node_pool_names"></a> [node\_pool\_names](#input\_node\_pool\_names) | Names of node pools, for use in node affinities (Slinky system components). | `list(string)` | `null` | no | |
| 116 | +| <a name="input_project_id"></a> [project\_id](#input\_project\_id) | The project ID that hosts the GKE cluster. | `string` | n/a | yes | |
| 117 | +| <a name="input_prometheus_chart_version"></a> [prometheus\_chart\_version](#input\_prometheus\_chart\_version) | Version of the Kube Prometheus Stack chart to install. | `string` | `"70.4.1"` | no | |
| 118 | +| <a name="input_prometheus_values"></a> [prometheus\_values](#input\_prometheus\_values) | Value overrides for the Prometheus release | `any` | <pre>{<br/> "installCRDs": true<br/>}</pre> | no | |
| 119 | +| <a name="input_slurm_chart_version"></a> [slurm\_chart\_version](#input\_slurm\_chart\_version) | Version of the Slurm chart to install. | `string` | `"0.2.1"` | no | |
| 120 | +| <a name="input_slurm_operator_chart_version"></a> [slurm\_operator\_chart\_version](#input\_slurm\_operator\_chart\_version) | Version of the Slurm Operator chart to install. | `string` | `"0.2.1"` | no | |
| 121 | +| <a name="input_slurm_operator_values"></a> [slurm\_operator\_values](#input\_slurm\_operator\_values) | Value overrides for the Slinky release | `any` | `{}` | no | |
| 122 | +| <a name="input_slurm_values"></a> [slurm\_values](#input\_slurm\_values) | Value overrides for the Slurm release | `any` | `{}` | no | |
| 123 | + |
| 124 | +## Outputs |
| 125 | + |
| 126 | +| Name | Description | |
| 127 | +|------|-------------| |
| 128 | +| <a name="output_instructions"></a> [instructions](#output\_instructions) | Post deployment instructions. | |
| 129 | +<!-- END OF PRE-COMMIT-TERRAFORM DOCS HOOK --> |
0 commit comments