-
Notifications
You must be signed in to change notification settings - Fork 196
Create a new community scheduler module for Slinky (Slurm on Kubernetes) #3862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Create a new community scheduler module for Slinky (Slurm on Kubernetes) #3862
Conversation
…s), which follows the standard Helm-based installation mechanism
…pts, and custom partition configs
…ccelerate installations (based on exponential backoffs and dependency timing) and reduce the risk of exceeding context deadlines
…al zone-based volume node affinity conflicts (or the need to use custom regional storage classes)
…nd Slurm nodesets, to improve cluster efficiency and control
…ner dependency management (i.e., avoid potential issues with running-node-dependent namespace finalizers) and streamlined provider configurations
…values and a Kube Prometheus Stack installation (both of which are default/recommended in the Slinky quickstart)
…the HPC Slinky example, for scraping extensive Slurm metrics into Cloud Monitoring (the DIY Kube Prometheus Stack alternative is by-default disabled in the example)
…ts, customization, and usage
…ep) to the module variables, as only the Slurm Exporter needs a small change/override (the default image does not exist), and make a prerequisite shift from Terraform's shallow merge() to Helm's deep values merge
@ighosh98 This is ready for review. Keep me posted here (or via internal chat messages) on any additional questions/concerns/steps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add an integration test for this module in the PR?
…proved security (HPC Slinky blueprint example)
…duce initial node pool requirements and associated provisioning times
… specification to a separate file in the HPC Slinky example
…y example, to follow conventions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The integration test won't run in the current definition. Have added explanation on how to set it up. We can connect if needed.
Rest of the changes look good to me.
…etup for base-integration-test.yml
…simple srun job successfully executes on the provisioned Slinky cluster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good. I have a few minor comments below. However, when testing I'm running into several issues / questions:
-
Why aren't we creating a login service like the quickstart (https://github.com/SlinkyProject/slurm-operator/blob/main/docs/quickstart.md).
-
I'm unable to launch a job on the h3 partition that spins up even though I see an h3 node in the gke nodepool.
-
When I run
sbatch --wrap "hostname"
, I don't see output logfiles in the expected location in /tmp/slurm-{job id}.out/err.
…r it needs to be added via --vars or inline Co-authored-by: Sam Skillman <[email protected]>
Co-authored-by: Sam Skillman <[email protected]>
Co-authored-by: Sam Skillman <[email protected]>
Thanks @samskillman! Will rework and retest a number of things here, based on Friday's Slinky v0.2.1 release (which FYI is also related to "Why aren't we creating a login service like the quickstart" and "Consider adding a note on how to connect to the cluster.", as the quickstart didn't have a login service in v0.2.0, and the documented path to "connect" was |
…ount (objectViewer→objectAdmin), in line with recent changes to GKE best practices in other blueprints
…odepool scaling structures
…ical specification for exploring multiple nodesets (debug AND h3) and multiple nodes (two per nodeset), while parameterizing these values for easier steady-state setup
…nodeset (minimal, but sufficient for multi-node testing/exploration)
…ed v0.2.0 Slurm Exporter bug workaround (fixed in v0.2.1)
@samskillman Reworked, retested, and ready for your review. Some high-level notes:
Tested your |
…cluster connection command
After some discussions on this, I'll wait until Slinky v0.3.0 is released, integrate the latest, and re-ping for review. There's some good stuff in v0.3.0 (especially a login node and RWX mount support) that will significantly improve usability, and it seems like the release will come soon. |
Added a community Slinky (Slurm-on-Kubernetes) module and example.
This PR introduces a new community module and an accompanying example blueprint to enable the deployment of Slinky, a Slurm workload manager implementation on Kubernetes.
The new module handles the installation of necessary components (Cert Manager, Slinky Operator, and Slurm Cluster) via their respective Helm charts onto a target GKE cluster. It allows for customization through Helm values overrides and supports best practice node affinities for Slinky system components and nodepools.
The example blueprint provides a practical example demonstrating how to deploy a Slinky cluster with both a debug node pool and an H3 HPC node pool. The example also includes configuration for monitoring the Slurm cluster using Google Managed Prometheus (GMP) via a PodMonitoring resource.
Design notes:
concat()
is used for Helm values, given nested object type safety constraints.The new module and example have been manually tested, pretty extensively, using Terraform v1.11.3 and Packer v1.12.0.
CC @samskillman as FYI, given additional context
Submission Checklist
NOTE: Community submissions can take up to 2 weeks to be reviewed.
Please take the following actions before submitting this pull request.