The slurm-cloud-integration project contains Dockerfiles, config files, and deployment/config content designed to enable the protyping and delivery of capabilities that integrate the Kubernetes and Slurm-HPC ecosystems
The combination of the slurm-jupyter-docker and slurm-single-node Dockerfiles are based upon the excellent work by Rodrigo Ancavil.
The slurm-single-node Dockerfile delivers an image that enables integration testing with a full Slurm stack w/ one worker (slurmd) node. This Dockerfile is based upon this excellent example written by Lennart Landsmeer.
The slurm-single-node Docker image is built from the project root directory as follows:
docker build -f src/docker/slurm-single-node -t hokiegeek2/slurm-single-node:$VERSION .
To simply run the slurm-single-node docker container, execute the following command:
docker run -it --rm --network=host hokiegeek2/slurm-single-node
In order to perform any integration testing with applications outside of the slurm-single-node, a munge.key used in the external app must be mounted into the docker container. Accordingly, to mount a munge.key and start the slurm-single-node docker container, execute the following command:
docker run -it --rm --network=host -v $PWD/munge.key:/tmp/munge.key hokiegeek2/slurm-single-node
Successful startup of slurm-single-node looks like this:
The slurm-jupyter-docker Dockerfile and slurm-jupyter Helm chart enables deployment of the awesome NERSC jupyterlab-slurm application to Kubernetes.
The slurm-jupyter Docker image is built from the project root directory as follows:
docker build -f src/docker/slurm-jupyter-docker -t hokiegeek2/slurm-jupyter:$VERSION .
The command sequence to start slurm-jupyterlab is contained within the start-slurm-jupyter.sh file and is as follows:
#!/bin/bash
# copy munge.key, set ownership and permissions, and move to config dir
sudo cp /tmp/munge/munge.key /tmp/munge.key
sudo mv /tmp/munge.key /etc/munge/munge.key
sudo chown munge:munge /etc/munge/munge.key
sudo chmod 400 /etc/munge/munge.key
# start munge authorization service
sudo service munge start
jupyter lab --no-browser --allow-root --ip=0.0.0.0 --NotebookApp.token='' \
--NotebookApp.password=''
tail -f /dev/null
Note the munge.key handling section, which is required to handle the munge.key passed in at container startup. Specifically, the munge.key file must be owned by the munge user and the permissions must be 400.
The munge.key configured for slurmctld needs to be added as a secret, which is accomplished as follows:
# Add secret encapsulating munge.key
kubectl create secret generic slurm-munge-key --from-file=/tmp/munge.key -n slurm-integration
# Confirm secret was created
kubectl get secret -n slurm-integration
NAME TYPE DATA AGE
slurm-munge-key Opaque 1 18d
Importantly, in analogy to the slurmd workers, the munge.key MUST be the same munge.key used in the munge service running on the slurmctld node.
Deploying slurm-jupyterlab is done via the slurm-jupyter Docker image and the slurm-jupyter Helm chart.
The helm command is executed as follows from the project root directory:
helm install -n slurm-integration slurm-jupyter-server deployment/charts/slurm-jupyter/
In addition to the helm chart artifacts, the slurm-jupyterhub k8s deployment requires the same munge.key used in the slurm cluster that the slurm-jupyterlab will connect to. The munge.key is used to create a Kubernetes secret that is mounted in the pod. The kubectl command is as follows:
kubectl create secret generic slurm-munge-key --from-file=munge.key -n slurm-integration
The configuration logic for loading the k8s munge.key secret is in the slurm-jupyter Helm template
Successful deployment of slurm-jupyterlab looks like this:
Confirm connectivity to slurm via the following commands:
# generic cluster info including slurmd node names
sinfo
# specific info and statuses for each slurmd node
scontrol show nodes
The combination of the slurm-jupyter-docker and slurm-single-node Dockerfiles are based upon the excellent work by Rodrigo Ancavil.
Integration testing of slurm-jupyterlab on k8s with slurm-single-node involves running the slurm-single-node Docker image. The docker run command is as follows:
docker run -it --rm --network=host -v $PWD/munge.key:/tmp/munge.key hokiegeek2/slurm-single-node:$VERSION
The munge.key is passed into the Docker container, which is an extremely important detail. The munge key either in the slurm docker container or on a bare-metal slurm cluster must be the same munge.key in the slurm-jupyterlab deployment on k8s. If not, authentication from slurm-jupyterlab on k8s to the slurm cluster will fail with the following message:
Using the test.slurm job, as successful job execution will look as follows in slurm-jupyterlab via terminal...
...as well as this in slurm queue manager:
...and finally this in slurm: