Create a new community scheduler module for Slinky (Slurm on Kubernetes) #3862

ndebuhr · 2025-03-31T21:25:01Z

Added a community Slinky (Slurm-on-Kubernetes) module and example.

This PR introduces a new community module and an accompanying example blueprint to enable the deployment of Slinky, a Slurm workload manager implementation on Kubernetes.

The new module handles the installation of necessary components (Cert Manager, Slinky Operator, and Slurm Cluster) via their respective Helm charts onto a target GKE cluster. It allows for customization through Helm values overrides and supports best practice node affinities for Slinky system components and nodepools.

The example blueprint provides a practical example demonstrating how to deploy a Slinky cluster with both a debug node pool and an H3 HPC node pool. The example also includes configuration for monitoring the Slurm cluster using Google Managed Prometheus (GMP) via a PodMonitoring resource.

Design notes:

The Helm-based approach closely follows the documented "quickstart" Slinky installation.
A lightweight, GCP-native metrics/monitoring system is adopted by default, rather than the Slinky-documented cluster-local Kube Prometheus Stack.
As of 2025/04/03, the quickstart documentation suggests a v0.1.0 installation, but v0.2.0 is implemented as the module default given its improved stability.
The example adopts a zonal implementation, which is common to avoid inter-zonal networking charges and to minimize latency. This includes skipping the GKE Cluster "system node pool", to avoid potential zone-based volume node affinity conflicts (or the need to use custom regional storage classes).
The module is unopinionated on nodeset node affinities, given the need for compute fabric flexibility. However, a module dependency is used for the Slurm system component node affinities, given less need for flexibility and to ensure a node pool remains up during Helm uninstalls (gcluster destroys).
Terraform's concat() is used for Helm values, given nested object type safety constraints.

The new module and example have been manually tested, pretty extensively, using Terraform v1.11.3 and Packer v1.12.0.

CC @samskillman as FYI, given additional context

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

✅ Fork your PR branch from the Toolkit "develop" branch (not main)
✅ Test all changes with pre-commit in a local branch #
✅ Confirm that "make tests" passes all tests
✅ Add or modify unit tests to cover code changes
✅ Ensure that unit test coverage remains above 80%
✅ Update all applicable documentation
✅ Follow Cluster Toolkit Contribution guidelines #

…s), which follows the standard Helm-based installation mechanism

…pts, and custom partition configs

…ccelerate installations (based on exponential backoffs and dependency timing) and reduce the risk of exceeding context deadlines

…al zone-based volume node affinity conflicts (or the need to use custom regional storage classes)

…nd Slurm nodesets, to improve cluster efficiency and control

…ner dependency management (i.e., avoid potential issues with running-node-dependent namespace finalizers) and streamlined provider configurations

…values and a Kube Prometheus Stack installation (both of which are default/recommended in the Slinky quickstart)

…ky example)

…the HPC Slinky example, for scraping extensive Slurm metrics into Cloud Monitoring (the DIY Kube Prometheus Stack alternative is by-default disabled in the example)

…ts, customization, and usage

…stability

…ep) to the module variables, as only the Slurm Exporter needs a small change/override (the default image does not exist), and make a prerequisite shift from Terraform's shallow merge() to Helm's deep values merge

ndebuhr · 2025-04-04T02:05:28Z

@ighosh98 This is ready for review. Keep me posted here (or via internal chat messages) on any additional questions/concerns/steps.

ighosh98

Could you please add an integration test for this module in the PR?

community/examples/hpc-slinky.yaml

community/modules/scheduler/slinky/versions.tf

…proved security (HPC Slinky blueprint example)

…autoscaling

…duce initial node pool requirements and associated provisioning times

… specification to a separate file in the HPC Slinky example

…y example, to follow conventions

ighosh98

The integration test won't run in the current definition. Have added explanation on how to set it up. We can connect if needed.

Rest of the changes look good to me.

tools/cloud-build/daily-tests/tests/slinky.yml

…etup for base-integration-test.yml

…simple srun job successfully executes on the provisioned Slinky cluster

samskillman

Overall looks good. I have a few minor comments below. However, when testing I'm running into several issues / questions:

Why aren't we creating a login service like the quickstart (https://github.com/SlinkyProject/slurm-operator/blob/main/docs/quickstart.md).
I'm unable to launch a job on the h3 partition that spins up even though I see an h3 node in the gke nodepool.
When I run sbatch --wrap "hostname", I don't see output logfiles in the expected location in /tmp/slurm-{job id}.out/err.

community/examples/hpc-slinky/hpc-slinky.yaml

community/modules/scheduler/slinky/README.md

community/examples/hpc-slinky/hpc-slinky.yaml

community/modules/scheduler/slinky/README.md

examples/README.md

…r it needs to be added via --vars or inline Co-authored-by: Sam Skillman <[email protected]>

Co-authored-by: Sam Skillman <[email protected]>

ndebuhr · 2025-04-21T18:22:29Z

Thanks @samskillman! Will rework and retest a number of things here, based on Friday's Slinky v0.2.1 release (which FYI is also related to "Why aren't we creating a login service like the quickstart" and "Consider adding a note on how to connect to the cluster.", as the quickstart didn't have a login service in v0.2.0, and the documented path to "connect" was kubectl exec. Glad to see some improvements in v0.2.1 and happy to integrate. Will ping you when this is ready for another round of review

…ount (objectViewer→objectAdmin), in line with recent changes to GKE best practices in other blueprints

…odepool scaling structures

…ical specification for exploring multiple nodesets (debug AND h3) and multiple nodes (two per nodeset), while parameterizing these values for easier steady-state setup

…nodeset (minimal, but sufficient for multi-node testing/exploration)

…ed v0.2.0 Slurm Exporter bug workaround (fixed in v0.2.1)

…cumentation

…eset readiness

ndebuhr · 2025-04-21T22:32:03Z

@samskillman Reworked, retested, and ready for your review. Some high-level notes:

v0.2.1 did indeed fix the v0.2.0 Slurm Exporter issue - removed the workaround.
v0.2.1 does not include the new login node and connection system. I'll keep an eye out and submit a PR as soon as v0.2.2 or v0.3.0 is released. Even if we wanted to (which we don't 🙂), building on top of Slinky main or some commit hash would be very messy, as we'd have to circumvent the official publishings in the OCI chart repo.
The blueprint previously used a default h3 nodeset of 0 replicas - assuming folks would use the debug nodeset for development and then scale out the h3 nodeset as needed. However, as your issue with the h3 partition highlights, this was not intuitive or well documented. Both the debug nodeset and h3 nodeset have been updated to a default of 2 replicas (both can be overridden with blueprint variables), and the documentation is clarified/expanded.

Tested your sbatch --wrap "hostname" command on debug and h3 partitions. With the aforementioned 0→2 h3 nodeset replica change, they both work fine - just need to check /tmp on the nodeset replica that ran the job (e.g., via kubectl exec, since there is no RWX shared volume between the nodeset replicas and the controller pod.

…cluster connection command

ndebuhr · 2025-04-25T19:21:07Z

After some discussions on this, I'll wait until Slinky v0.3.0 is released, integrate the latest, and re-ping for review. There's some good stuff in v0.3.0 (especially a login node and RWX mount support) that will significantly improve usability, and it seems like the release will come soon.

Draft a new community scheduler module for Slinky (Slurm on Kubernete…

5345611

…s), which follows the standard Helm-based installation mechanism

ighosh98 self-requested a review April 1, 2025 18:54

ndebuhr added 12 commits April 2, 2025 02:19

Simplify the Slinky example, by removing Filestore, Slurm Prolog scri…

3dda557

…pts, and custom partition configs

Remove the unused Prometheus namespace from the Slinky module

f83ac55

Specify dependencies between Helm releases in the Slinky module, to a…

a91070e

…ccelerate installations (based on exponential backoffs and dependency timing) and reduce the risk of exceeding context deadlines

Skip the system node pool in the HPC Slinky example, to avoid potenti…

53e00ee

…al zone-based volume node affinity conflicts (or the need to use custom regional storage classes)

Enrich the node affinities system for both Slinky system components a…

871ad87

…nd Slurm nodesets, to improve cluster efficiency and control

Bundle Slinky namespaces into their respective Helm releases for clea…

2fa6557

…ner dependency management (i.e., avoid potential issues with running-node-dependent namespace finalizers) and streamlined provider configurations

Expand the Slinky metrics/monitoring setup, including Slurm Exporter …

58e99e8

…values and a Kube Prometheus Stack installation (both of which are default/recommended in the Slinky quickstart)

Fix node affinity specification nesting in the Slurm values (HPC Slin…

93e78e1

…ky example)

Use a lightweight Google Managed Prometheus pod monitoring system in …

d6bb5d1

…the HPC Slinky example, for scraping extensive Slurm metrics into Cloud Monitoring (the DIY Kube Prometheus Stack alternative is by-default disabled in the example)

Expand the Slinky module documentation to include foundational concep…

1cb3ee9

…ts, customization, and usage

Pin and parameterize Slinky Helm chart versions, for flexibility and …

1a33ca7

…stability

ndebuhr marked this pull request as ready for review April 4, 2025 02:05

ndebuhr requested review from samskillman and a team as code owners April 4, 2025 02:05

Merge branch 'develop' into feat/slinky-scheduler

2281445

ighosh98 reviewed Apr 4, 2025

View reviewed changes

community/examples/hpc-slinky.yaml Outdated Show resolved Hide resolved

community/examples/hpc-slinky.yaml Outdated Show resolved Hide resolved

ighosh98 reviewed Apr 4, 2025

View reviewed changes

community/examples/hpc-slinky.yaml Outdated Show resolved Hide resolved

ighosh98 reviewed Apr 4, 2025

View reviewed changes

community/examples/hpc-slinky.yaml Outdated Show resolved Hide resolved

community/examples/hpc-slinky.yaml Outdated Show resolved Hide resolved

ighosh98 reviewed Apr 4, 2025

View reviewed changes

community/modules/scheduler/slinky/versions.tf Show resolved Hide resolved

ighosh98 requested a review from RachaelSTamakloe April 4, 2025 19:30

ndebuhr added 6 commits April 5, 2025 14:40

Add the HPC Slinky blueprint to the Example Blueprints documentation

15de9ce

Disable GCP public CIDRs access (GKE control plane) by default for im…

9a1741e

…proved security (HPC Slinky blueprint example)

Add HPC Slinky example documentation around GKE node pool design and …

32a1d81

…autoscaling

Lower the HPC Slinky example debug nodeset CPU requests/limits, to re…

014b2b4

…duce initial node pool requirements and associated provisioning times

Refactor the Slurm Exporter pod monitoring by shifting from an inline…

05a8b3e

… specification to a separate file in the HPC Slinky example

Add a project ID comment and default deployment name to the HPC Slink…

eecfac0

…y example, to follow conventions

ighosh98 suggested changes Apr 7, 2025

View reviewed changes

tools/cloud-build/daily-tests/tests/slinky.yml Outdated Show resolved Hide resolved

ndebuhr and others added 5 commits April 7, 2025 21:07

Add a build config for the Slinky integration test, to complete the s…

2cc2fee

…etup for base-integration-test.yml

Add a layer of validation to the Slinky integration test - confirm a …

b1c135a

…simple srun job successfully executes on the provisioned Slinky cluster

Merge branch 'develop' into feat/slinky-scheduler

9d6491c

Fix the first step ID (typo) in the Slinky integration test build config

fc13c47

Merge branch 'develop' into feat/slinky-scheduler

56d5bf6

ighosh98 self-requested a review April 16, 2025 13:15

ighosh98 previously approved these changes Apr 16, 2025

View reviewed changes

samskillman requested changes Apr 21, 2025

View reviewed changes

Make the Slinky example authorized CIDR fail earlier - making it clea…

1bdb62e

…r it needs to be added via --vars or inline Co-authored-by: Sam Skillman <[email protected]>

ndebuhr dismissed ighosh98’s stale review via 1bdb62e April 21, 2025 18:10

ndebuhr and others added 2 commits April 21, 2025 14:13

Tweak the Slinky example documentation language

ce46a81

Co-authored-by: Sam Skillman <[email protected]>

Tweak formatting in the Slinky example blueprint

e77c9e5

Co-authored-by: Sam Skillman <[email protected]>

ndebuhr added 7 commits April 21, 2025 19:42

Elevate GCS permissions in the Slinky blueprint node pool service acc…

5dd3263

…ount (objectViewer→objectAdmin), in line with recent changes to GKE best practices in other blueprints

Clean up the Slinky blueprint example documentation, around nodeset+n…

5650c47

…odepool scaling structures

Rework Slinky blueprint default nodeset replicas - use a minimum crit…

b3fb032

…ical specification for exploring multiple nodesets (debug AND h3) and multiple nodes (two per nodeset), while parameterizing these values for easier steady-state setup

Update the Slinky module documentation to highlight a two-replica h3 …

c5aedb7

…nodeset (minimal, but sufficient for multi-node testing/exploration)

Bump the Slinky scheduler module to Slinky v0.2.1, and remove a relat…

b2f9461

…ed v0.2.0 Slurm Exporter bug workaround (fixed in v0.2.1)

Add a kubectl nodeset scale in/out command to the Slinky blueprint do…

5030c99

…cumentation

Increase the Slinky h3 node pool initial_node_count to accelerate nod…

8a0882a

…eset readiness

Further clarify the Slinky usage instructions, by providing a gcloud …

915f614

…cluster connection command

samskillman added the release-key-new-features Added to release notes under the "Key New Features" heading. label Apr 21, 2025

samskillman self-assigned this Apr 21, 2025

samskillman added the release-new-modules Added to release notes under the "New Modules" heading. label Apr 21, 2025

ighosh98 self-requested a review April 25, 2025 06:11

ighosh98 approved these changes Apr 25, 2025

View reviewed changes

ndebuhr marked this pull request as draft April 26, 2025 20:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Create a new community scheduler module for Slinky (Slurm on Kubernetes) #3862

Create a new community scheduler module for Slinky (Slurm on Kubernetes) #3862

ndebuhr commented Mar 31, 2025 •

edited

Loading

Uh oh!

ndebuhr commented Apr 4, 2025

Uh oh!

ighosh98 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ighosh98 left a comment

Uh oh!

Uh oh!

samskillman left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ndebuhr commented Apr 21, 2025

Uh oh!

ndebuhr commented Apr 21, 2025

Uh oh!

ndebuhr commented Apr 25, 2025

Uh oh!

Uh oh!

Create a new community scheduler module for Slinky (Slurm on Kubernetes) #3862

Are you sure you want to change the base?

Create a new community scheduler module for Slinky (Slurm on Kubernetes) #3862

Conversation

ndebuhr commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Submission Checklist

Uh oh!

ndebuhr commented Apr 4, 2025

Uh oh!

ighosh98 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ighosh98 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

samskillman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ndebuhr commented Apr 21, 2025

Uh oh!

ndebuhr commented Apr 21, 2025

Uh oh!

ndebuhr commented Apr 25, 2025

Uh oh!

Uh oh!

ndebuhr commented Mar 31, 2025 •

edited

Loading