Skip to content

Create a new community scheduler module for Slinky (Slurm on Kubernetes) #3862

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 39 commits into
base: develop
Choose a base branch
from

Conversation

ndebuhr
Copy link
Contributor

@ndebuhr ndebuhr commented Mar 31, 2025

Added a community Slinky (Slurm-on-Kubernetes) module and example.

This PR introduces a new community module and an accompanying example blueprint to enable the deployment of Slinky, a Slurm workload manager implementation on Kubernetes.

The new module handles the installation of necessary components (Cert Manager, Slinky Operator, and Slurm Cluster) via their respective Helm charts onto a target GKE cluster. It allows for customization through Helm values overrides and supports best practice node affinities for Slinky system components and nodepools.

The example blueprint provides a practical example demonstrating how to deploy a Slinky cluster with both a debug node pool and an H3 HPC node pool. The example also includes configuration for monitoring the Slurm cluster using Google Managed Prometheus (GMP) via a PodMonitoring resource.

Design notes:

  1. The Helm-based approach closely follows the documented "quickstart" Slinky installation.
  2. A lightweight, GCP-native metrics/monitoring system is adopted by default, rather than the Slinky-documented cluster-local Kube Prometheus Stack.
  3. As of 2025/04/03, the quickstart documentation suggests a v0.1.0 installation, but v0.2.0 is implemented as the module default given its improved stability.
  4. The example adopts a zonal implementation, which is common to avoid inter-zonal networking charges and to minimize latency. This includes skipping the GKE Cluster "system node pool", to avoid potential zone-based volume node affinity conflicts (or the need to use custom regional storage classes).
  5. The module is unopinionated on nodeset node affinities, given the need for compute fabric flexibility. However, a module dependency is used for the Slurm system component node affinities, given less need for flexibility and to ensure a node pool remains up during Helm uninstalls (gcluster destroys).
  6. Terraform's concat() is used for Helm values, given nested object type safety constraints.

The new module and example have been manually tested, pretty extensively, using Terraform v1.11.3 and Packer v1.12.0.

CC @samskillman as FYI, given additional context

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • ✅ Fork your PR branch from the Toolkit "develop" branch (not main)
  • ✅ Test all changes with pre-commit in a local branch #
  • ✅ Confirm that "make tests" passes all tests
  • ✅ Add or modify unit tests to cover code changes
  • ✅ Ensure that unit test coverage remains above 80%
  • ✅ Update all applicable documentation
  • ✅ Follow Cluster Toolkit Contribution guidelines #

…s), which follows the standard Helm-based installation mechanism
@ighosh98 ighosh98 self-requested a review April 1, 2025 18:54
ndebuhr added 12 commits April 2, 2025 02:19
…ccelerate installations (based on exponential backoffs and dependency timing) and reduce the risk of exceeding context deadlines
…al zone-based volume node affinity conflicts (or the need to use custom regional storage classes)
…nd Slurm nodesets, to improve cluster efficiency and control
…ner dependency management (i.e., avoid potential issues with running-node-dependent namespace finalizers) and streamlined provider configurations
…values and a Kube Prometheus Stack installation (both of which are default/recommended in the Slinky quickstart)
…the HPC Slinky example, for scraping extensive Slurm metrics into Cloud Monitoring (the DIY Kube Prometheus Stack alternative is by-default disabled in the example)
…ep) to the module variables, as only the Slurm Exporter needs a small change/override (the default image does not exist), and make a prerequisite shift from Terraform's shallow merge() to Helm's deep values merge
@ndebuhr
Copy link
Contributor Author

ndebuhr commented Apr 4, 2025

@ighosh98 This is ready for review. Keep me posted here (or via internal chat messages) on any additional questions/concerns/steps.

@ndebuhr ndebuhr marked this pull request as ready for review April 4, 2025 02:05
@ndebuhr ndebuhr requested review from samskillman and a team as code owners April 4, 2025 02:05
Copy link
Contributor

@ighosh98 ighosh98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add an integration test for this module in the PR?

Copy link
Contributor

@ighosh98 ighosh98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The integration test won't run in the current definition. Have added explanation on how to set it up. We can connect if needed.

Rest of the changes look good to me.

@ighosh98 ighosh98 self-requested a review April 16, 2025 13:15
ighosh98
ighosh98 previously approved these changes Apr 16, 2025
Copy link
Collaborator

@samskillman samskillman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. I have a few minor comments below. However, when testing I'm running into several issues / questions:

  1. Why aren't we creating a login service like the quickstart (https://github.com/SlinkyProject/slurm-operator/blob/main/docs/quickstart.md).

  2. I'm unable to launch a job on the h3 partition that spins up even though I see an h3 node in the gke nodepool.

  3. When I run sbatch --wrap "hostname", I don't see output logfiles in the expected location in /tmp/slurm-{job id}.out/err.

…r it needs to be added via --vars or inline

Co-authored-by: Sam Skillman <[email protected]>
@ndebuhr
Copy link
Contributor Author

ndebuhr commented Apr 21, 2025

Thanks @samskillman! Will rework and retest a number of things here, based on Friday's Slinky v0.2.1 release (which FYI is also related to "Why aren't we creating a login service like the quickstart" and "Consider adding a note on how to connect to the cluster.", as the quickstart didn't have a login service in v0.2.0, and the documented path to "connect" was kubectl exec. Glad to see some improvements in v0.2.1 and happy to integrate. Will ping you when this is ready for another round of review

ndebuhr added 7 commits April 21, 2025 19:42
…ount (objectViewer→objectAdmin), in line with recent changes to GKE best practices in other blueprints
…ical specification for exploring multiple nodesets (debug AND h3) and multiple nodes (two per nodeset), while parameterizing these values for easier steady-state setup
…nodeset (minimal, but sufficient for multi-node testing/exploration)
…ed v0.2.0 Slurm Exporter bug workaround (fixed in v0.2.1)
@ndebuhr
Copy link
Contributor Author

ndebuhr commented Apr 21, 2025

@samskillman Reworked, retested, and ready for your review. Some high-level notes:

  • v0.2.1 did indeed fix the v0.2.0 Slurm Exporter issue - removed the workaround.
  • v0.2.1 does not include the new login node and connection system. I'll keep an eye out and submit a PR as soon as v0.2.2 or v0.3.0 is released. Even if we wanted to (which we don't 🙂), building on top of Slinky main or some commit hash would be very messy, as we'd have to circumvent the official publishings in the OCI chart repo.
  • The blueprint previously used a default h3 nodeset of 0 replicas - assuming folks would use the debug nodeset for development and then scale out the h3 nodeset as needed. However, as your issue with the h3 partition highlights, this was not intuitive or well documented. Both the debug nodeset and h3 nodeset have been updated to a default of 2 replicas (both can be overridden with blueprint variables), and the documentation is clarified/expanded.

Tested your sbatch --wrap "hostname" command on debug and h3 partitions. With the aforementioned 0→2 h3 nodeset replica change, they both work fine - just need to check /tmp on the nodeset replica that ran the job (e.g., via kubectl exec, since there is no RWX shared volume between the nodeset replicas and the controller pod.

@samskillman samskillman added the release-key-new-features Added to release notes under the "Key New Features" heading. label Apr 21, 2025
@samskillman samskillman self-assigned this Apr 21, 2025
@samskillman samskillman added the release-new-modules Added to release notes under the "New Modules" heading. label Apr 21, 2025
@ighosh98 ighosh98 self-requested a review April 25, 2025 06:11
@ndebuhr
Copy link
Contributor Author

ndebuhr commented Apr 25, 2025

After some discussions on this, I'll wait until Slinky v0.3.0 is released, integrate the latest, and re-ping for review. There's some good stuff in v0.3.0 (especially a login node and RWX mount support) that will significantly improve usability, and it seems like the release will come soon.

@ndebuhr ndebuhr marked this pull request as draft April 26, 2025 20:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-key-new-features Added to release notes under the "Key New Features" heading. release-new-modules Added to release notes under the "New Modules" heading.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants