This repository contains Terraform code and related scripts for automated provisioning and configuration of the SLURM cluster on the OpenStack-based SecureData4Health (SD4H) cloud.
The code supports creating a fixed SLURM cluster with a constant number of compute nodes, as well as a dynamic (aka elastic) SLURM cluster that automatically scales up or down depending on usage.
Important
- The setup was developed using Ubuntu as a starting image, and adjustments may be needed if switching to other distributions.
- We needed to disable Ubuntu’s built-in DNS caching mechanism provided by systemd-resolved.service on the SLURM controller node. This is because, when working in dynamic mode (i.e., on-the-fly destroying and re-creating compute nodes), the DNS cache would contain obsolete IP mappings and disrupt SLURM’s communications with re-created nodes. The DNS cache is enabled on SLURM compute nodes.
- We utilize s3fs-fuse for mounting S3 bucket on
/mnt, as it is the most compliant with POSIX standards; however, it is not the fastest option for file uploads or downloads. Therefore, we recommend using it solely for the purposes of pipeline synchronization and conducting the staging of large input/output files with the s3cmd tool, which is pre-installed.
Note
Other pre-installed tools/data include:
- apptainer
- nextflow
- samtools, bcftools, bgzip, tabix
- GATK's apptainer image located in the
/resourcesdirectory. - GATK's Resource Bundle and human reference genomes in the
/resources/gatk_resource_bundleand/resources/genomesdirectories. - BWA-MEM and additional reference genome index files in the
/resources/genomesdirectory.
You need to install Ansible Python package. We use Ansible here to describe and automate the deployment of the required system tools, software packages, and configuration files on the cloud images (it can also be used to create Docker/Apptainer images as well).
Download the Packer tool. We use Packer here to handle automated creation of cloud images. Packer will remotely run Ansbile scripts on the temporary VM, which will then be used to create an ready-to-use image with pre-installed software and configuration files.
Download the Terraform tool. We use Terrorm here to describe and automatically provision the SLURM cluster infrastructure. Terraform will take care of automated creation of the required number of VMs with the required CPUs/Memory/Volume using ready-to-use images created by Packer.
When filling out the Packer and Terraform configurations, it is good idea to have installed the OpenStack command-line client. The OpenStack command-line client will allow you to retrieve the image, flavor, network IDs and other information much quicker than through the OpenStack dashboard GUI.
To run Packer and Terraform you need to be authenticated in the active terminal session:
- Go to SD4H OpenStack dashboard at https://juno.calculquebec.ca/ and select the relevant cloud project.
- Go to the API Access tab, click on the Download OpenStack RC File dropdown menu (upper right corner), and download OpenStack RC File. The OpenStack RC file is a bash script named <project_name>-openrc.sh.
- Before executing Packer or Terraform scripts, run
source <project_name>-openrc.shin your active terminal session to authenticate.
In this step, we will create an image (default name slurm-node-image) with pre-installed SLURM and related software tools that will be used by Terraform to create SLURM controller and SLURM compute node instances.
- Clone this repo:
git clone https://github.com/bio-portal/slurm-sd4h. - If you plan using GATK variant calling tools, then consider populating
ansible/roles/genetics/filesdirectories with the GATK-related resource files. - Go to the packer directory:
cd slurm-sd4h/packer. - Edit the
main.pkr.hclfile following the comments inside. - Execute
packer init . - Execute
packer validate . - Execute
packer build . - You should see the new image ID using
openstack image list.
Note
If you want to customize the tools and data pre-installed on the image, you can do this by:
- Editing the Ansible scripts inside the
slurm-sd4h/ansibledirectory. Remember, if you add additional tools or data, you may need to increase the image's volume size in theslurm-sd4h/packer/main.pkr.hclconfiguration file. - Editing the
slurm-sd4h/scripts/compute_node_setup.sh.tftplandslurm-sd4h/scripts/controller_node_setup.sh.tftplscripts, which are automatically run at startup on the corresponding VM instances (part of cloud-init).
- Download clouds.yaml from SD4H OpenStack dashboard (API Access -> Download OpenStack RC File -> OpenStack clouds.yaml File) to
slurm-sd4h/terraformdirectory. This file is used by Terraform to get OpenStack's project-related and user-related info. - SLURM controller node will need to authenticate to be able to dynamically create and destroy compute nodes using OpenStack API. Also, SLURM compute nodes may need to authenticate to perform read/write operations to buckets. For this, create application credentials though SD4H OpenStack dashboard (Application Credentials -> Create Application Credentials) and dowload the <some-prefix>-openrc.sh file with the secret (use Download openrc file button in the dialog) to
slurm-sd4h/app_credentialsdirectory. - To automatically mount s3 bucket as a POSIX-like filesystem on each SLURM node, we are currently using s3fs-fuse. To make it work, the Object Store access key and secret must be distibuted in the passwd-s3fs file.
- Create access key and secret using the
openstack ec2 credentials createcommand. Create theslurm-sd4h/app_credentials/passwd-s3fsfile with the following line: <access key>:<secret> (e.g. 8173ad48294a8as8:za73ad88214nbas9). - Alternatively, you can list all available keys and secrets created before using the
openstack ec2 credentials listcommand and choose one that matches your username and project.
- Create access key and secret using the
- Go to the terraform directory:
cd slurm-sd4h/terraform. - Edit the
config.tfvarsfile following the comments inside. We assume that a bucket and floating IP necessary for the SLURM cluster were already created (e.g. through SD4H's OpenStack dashboard). - Execute
terraform init. - Execute
terraform validate. - Execute
terraform apply -var-file="config.tfvars"
If successfull, you should be able to see the SLURM controller instance and SLURM compute instance(s) (if you specified some fixed number of them) through SD4H OpenStack dashboard.
- Go to the terraform directory:
cd slurm-sd4h/terraform. - Execute
terraform destroy -var-file="config.tfvars"
If you used dynamic SLURM cluster setup, you need to wait until SLURM automatically scales-down (default, at least 10 min after the last use) before executing terraform destroy. Otherwise, you will need to destroy the remaining dynamic SLURM compute nodes manually.