Skip to content

CERC-Genomic-Medicine/slurm-sd4h

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

slurm-sd4h

This repository contains Terraform code and related scripts for automated provisioning and configuration of the SLURM cluster on the OpenStack-based SecureData4Health (SD4H) cloud.

The code supports creating a fixed SLURM cluster with a constant number of compute nodes, as well as a dynamic (aka elastic) SLURM cluster that automatically scales up or down depending on usage.

Important

  • The setup was developed using Ubuntu as a starting image, and adjustments may be needed if switching to other distributions.
  • We needed to disable Ubuntu’s built-in DNS caching mechanism provided by systemd-resolved.service on the SLURM controller node. This is because, when working in dynamic mode (i.e., on-the-fly destroying and re-creating compute nodes), the DNS cache would contain obsolete IP mappings and disrupt SLURM’s communications with re-created nodes. The DNS cache is enabled on SLURM compute nodes.
  • We utilize s3fs-fuse for mounting S3 bucket on /mnt, as it is the most compliant with POSIX standards; however, it is not the fastest option for file uploads or downloads. Therefore, we recommend using it solely for the purposes of pipeline synchronization and conducting the staging of large input/output files with the s3cmd tool, which is pre-installed.

Note

Other pre-installed tools/data include:

  • apptainer
  • nextflow
  • samtools, bcftools, bgzip, tabix
  • GATK's apptainer image located in the /resources directory.
  • GATK's Resource Bundle and human reference genomes in the /resources/gatk_resource_bundle and /resources/genomes directories.
  • BWA-MEM and additional reference genome index files in the /resources/genomes directory.

1. Pre-requisites

Ansible

You need to install Ansible Python package. We use Ansible here to describe and automate the deployment of the required system tools, software packages, and configuration files on the cloud images (it can also be used to create Docker/Apptainer images as well).

Packer

Download the Packer tool. We use Packer here to handle automated creation of cloud images. Packer will remotely run Ansbile scripts on the temporary VM, which will then be used to create an ready-to-use image with pre-installed software and configuration files.

Terraform

Download the Terraform tool. We use Terrorm here to describe and automatically provision the SLURM cluster infrastructure. Terraform will take care of automated creation of the required number of VMs with the required CPUs/Memory/Volume using ready-to-use images created by Packer.

OpenStack

When filling out the Packer and Terraform configurations, it is good idea to have installed the OpenStack command-line client. The OpenStack command-line client will allow you to retrieve the image, flavor, network IDs and other information much quicker than through the OpenStack dashboard GUI.

2. Authenticating

To run Packer and Terraform you need to be authenticated in the active terminal session:

  • Go to SD4H OpenStack dashboard at https://juno.calculquebec.ca/ and select the relevant cloud project.
  • Go to the API Access tab, click on the Download OpenStack RC File dropdown menu (upper right corner), and download OpenStack RC File. The OpenStack RC file is a bash script named <project_name>-openrc.sh.
  • Before executing Packer or Terraform scripts, run source <project_name>-openrc.sh in your active terminal session to authenticate.

3. Creating ready-to-use images

In this step, we will create an image (default name slurm-node-image) with pre-installed SLURM and related software tools that will be used by Terraform to create SLURM controller and SLURM compute node instances.

  1. Clone this repo: git clone https://github.com/bio-portal/slurm-sd4h.
  2. If you plan using GATK variant calling tools, then consider populating ansible/roles/genetics/files directories with the GATK-related resource files.
  3. Go to the packer directory: cd slurm-sd4h/packer.
  4. Edit the main.pkr.hcl file following the comments inside.
  5. Execute packer init .
  6. Execute packer validate .
  7. Execute packer build .
  8. You should see the new image ID using openstack image list.

Note

If you want to customize the tools and data pre-installed on the image, you can do this by:

  1. Editing the Ansible scripts inside the slurm-sd4h/ansible directory. Remember, if you add additional tools or data, you may need to increase the image's volume size in the slurm-sd4h/packer/main.pkr.hcl configuration file.
  2. Editing the slurm-sd4h/scripts/compute_node_setup.sh.tftpl and slurm-sd4h/scripts/controller_node_setup.sh.tftpl scripts, which are automatically run at startup on the corresponding VM instances (part of cloud-init).

4. Provisioning SLURM cluster

  1. Download clouds.yaml from SD4H OpenStack dashboard (API Access -> Download OpenStack RC File -> OpenStack clouds.yaml File) to slurm-sd4h/terraform directory. This file is used by Terraform to get OpenStack's project-related and user-related info.
  2. SLURM controller node will need to authenticate to be able to dynamically create and destroy compute nodes using OpenStack API. Also, SLURM compute nodes may need to authenticate to perform read/write operations to buckets. For this, create application credentials though SD4H OpenStack dashboard (Application Credentials -> Create Application Credentials) and dowload the <some-prefix>-openrc.sh file with the secret (use Download openrc file button in the dialog) to slurm-sd4h/app_credentials directory.
  3. To automatically mount s3 bucket as a POSIX-like filesystem on each SLURM node, we are currently using s3fs-fuse. To make it work, the Object Store access key and secret must be distibuted in the passwd-s3fs file.
    • Create access key and secret using the openstack ec2 credentials create command. Create theslurm-sd4h/app_credentials/passwd-s3fs file with the following line: <access key>:<secret> (e.g. 8173ad48294a8as8:za73ad88214nbas9).
    • Alternatively, you can list all available keys and secrets created before using the openstack ec2 credentials list command and choose one that matches your username and project.
  4. Go to the terraform directory: cd slurm-sd4h/terraform.
  5. Edit the config.tfvars file following the comments inside. We assume that a bucket and floating IP necessary for the SLURM cluster were already created (e.g. through SD4H's OpenStack dashboard).
  6. Execute terraform init.
  7. Execute terraform validate.
  8. Execute terraform apply -var-file="config.tfvars"

If successfull, you should be able to see the SLURM controller instance and SLURM compute instance(s) (if you specified some fixed number of them) through SD4H OpenStack dashboard.

5. Destroying SLURM cluster

  1. Go to the terraform directory: cd slurm-sd4h/terraform.
  2. Execute terraform destroy -var-file="config.tfvars"

If you used dynamic SLURM cluster setup, you need to wait until SLURM automatically scales-down (default, at least 10 min after the last use) before executing terraform destroy. Otherwise, you will need to destroy the remaining dynamic SLURM compute nodes manually.

About

Elastic SLURM cluster for SecureData4Health cloud

Resources

License

Stars

Watchers

Forks

Packages

No packages published