Skip to content

Conversation

@icompres
Copy link
Collaborator

Hi Josh, Ralph,

Here are the changes to have Slurm support added:

1.- I have separated my changes from the base Dockerfile.ssh into a Dockerfile.slurm.

  • The main difference is that Munge is built into the image.
    2.- Helper build scripts for Slurm and MPICH were added.
  • We plan to add more build and helper scripts for other runtime systems later.
  • I may add a module system later. I am not sure we need it yet.
    3.- A base slurm.conf and a generator script to dynamically populate a partition.
  • This is necessary since Docker assigns new IPs every time a container cluster is launched.
    4.- Updated the README.md to include instructions for the Slurm use case.

I tried to avoid impacting any preexisting functionality, but please double check. We can delete this branch when we are done.

Known issue:
If you try to use Slurm at HEAD with Open PMIx at HEAD it will not work. Ralph and me are working on pushing a simple change to Slurm upstream. In the meantime, if you would like to try this setup with the bleeding edge, you can clone my Slurm fork found here:
https://github.com/icompres/slurm
The build system is patched already to allow the latest Open PMIx to work with Slurm.

@icompres icompres requested review from jjhursey and rhc54 August 31, 2021 16:29
@rhc54
Copy link
Collaborator

rhc54 commented Aug 31, 2021

Perhaps I am misunderstanding the script, but it sounds like we would need to rebuild Slurm each time we start the swarm? If so, that would quickly get annoying. Maybe we need some kind of "option" that would cause the Dockerfile to build Slurm into the image, perhaps using a provided version (e.g., something like "--with-slurm=path"). Just thinking here - having to rebuild every time is going to be a lot of overhead. You really don't want to leave these swarms running for long periods of time on your machine, so stopping/restarting them happens multiple times a day.

@icompres
Copy link
Collaborator Author

icompres commented Sep 1, 2021

Hi Ralph,

you only need to build Slurm once. This is meant to be done by the user on the build/ directory.

In the Docker file, the only addition is Munge. I found that distributions are somewhat inconsistent with their Munge setup, so it is better to build it from source.

What you need to do each time is:
1.- Generate a new slurm.conf with every new host list. This is done at the host before dropping in.
2.- Bootstrap the munge and slurm daemons (slurmd and slurmctld). This is done as the root user, inside the first node.

For both these tasks a script is provided.

I added some instructions in the README.md near the bottom. It was done quickly, so it needs some improvements.

Allow the build to save the Slurm and CentOS7 builds as
separate images so we can select between them. Correct
the spelling of the PRTE envars so PRRTE recognizes them.
Add a few missing envars, and setup the RPMBUILD directories
as they prove useful.

Signed-off-by: Ralph Castain <[email protected]>
Signed-off-by: Ralph Castain <[email protected]>
@rhc54
Copy link
Collaborator

rhc54 commented Sep 13, 2021

I posted a PR (#12) to correct a couple of things and add an CentOS8 build option. Please see what you think.

@jjhursey This replaced my earlier PR, so I closed it.

jjhursey and others added 3 commits September 16, 2021 10:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants