To submit an analysis (or job) to run on a Digital Research Alliance
of Canada (the Alliance) cluster, you must use the Slurm Workload
Manager scheduler via the sbatch command, ensuring that you
properly define the job's resource requirements and associate it with
a specific Resource Allocation Project (RAP). Other schedulers exist,
but Slurm is by far the most common.
The process generally involves three main steps: preparation, job script creation, and submission.
Before submitting a job, you must ensure your environment and data are ready:
You must connect to a cluster's login node securely using Secure
Shell (SSH). The connection requires your username and password (or
SSH key) and availability of your second factor for multifactor
authentication. You should know the name of the machine you are
connecting to, such as fir.alliancecan.ca or
trillium.alliancecan.ca.
Software needed for your analysis is typically accessed using the module system (Lmod).
Using modules is the standard way to access installed software on Digital Research Alliance of Canada (the Alliance) clusters.
Modules are configuration files that contain instructions for modifying your software environment. This modular architecture enables the system to support multiple versions of the same application without conflicts.
- Function: A module file modifies or initializes environment
variables, such as
PATHand `LD_LIBRARY_PATH$, to make an application or library available in the user’s login session. - Management Tool: On newer Alliance servers, modules are managed using the Lmod tool, which is designed to be very similar to the older Environment Modules system.
- Hierarchy: Modules operate under a hierarchy system. This means that the modules available to load often depend on "parent" modules you have already loaded (like compilers or MPI implementations). The hierarchy prevents naming conflicts when supporting many versions of complex software.
To integrate modules into your analysis workflow, the Alliance recommends specific practices, especially when submitting computational jobs:
-
Loading: Use the
module load <module-name>sub-command to access a specific program or library version (e.g.,gcc/9.3). Loading a module does not execute the software, but makes the program binary accessible. -
Environment Consistency (Purge): You should generally avoid loading modules automatically in your
~/.bashrcfile. To prevent inherited environment settings from your submitting shell from causing problems, it is best practice to include the linemodule purgein your job script before loading all the required modules to ensure a consistent environment state. -
Module Collections: To repeatedly load a specific set of modules (e.g., for a complex analysis environment like machine learning or bioinformatics), you can create and restore a module collection using the
module saveandmodule restoresub-commands. -
Finding Modules:
- Use
module listto see what is currently loaded. - Use
module availto list modules available to load. - Use
module spiderto search the complete tree of all modules, including those that are incompatible or hidden due to the module hierarchy. This command is often needed to see how to load a specific version that has prerequisites.
- Use
-
Bioinformatics Software: Many specialized bioinformatics tools are available as modules. These tools are often categorized as
bioin the available software list. -
It is generally recommended not to load modules automatically in your
~/.bashrc, but instead to load them directly in your job scripts. -
You can use the
module purgecommand at the beginning of your script to ensure a consistent environment state before loading the required modules. -
If you need a specific set of modules repeatedly, you can create a module collection using the
module savesub-command.
The cluster uses a distributed filesystem, and accessing data from
/project or /scratch has different performance implications than
local storage.
-
For datasets
$\leq 100\text{ GB}$ : You should transfer the data to the local storage of the compute node at the beginning of the job, as this storage is orders of magnitude faster and more reliable than shared storage. A temporary directory for this purpose is available at `$SLURM_TMPDIR$. -
For large datasets with many small files (e.g., image
datasets): These collections can cause issues with filesystem
quotas and significantly slow down software when streaming from
shared storage (
/projector/scratch). Such data should be stored in large single-file archives on the distributed filesystem.
All computational work (jobs) must be submitted via the scheduler, with exceptions only for small tasks not expected to consume more than about 10 CPU-minutes or 4 GB of RAM (which can run on a login node).
A minimal Slurm job script (simple\_job.sh) requires executable
commands and directives prefixed with #SBATCH.
For example, many of my job scripts have something like this at the top:
#!/bin/bash
#SBATCH -J extract-data
#SBATCH --time=00:15:00
#SBATCH --account=rrg-jbpoline
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=8G
#SBATCH -o ./logs/extract-data-%j.out
#SBATCH -e ./logs/extract-data-%j.err
The most important feature are:
-
Account/Project Association (
--account): Every job must be charged to a specific Resource Allocation Project (RAP).- You must specify the account name (Group Name) using the
--accountdirective. This account determines job priority based on the associated group's target usage. - Example:
#SBATCH --account=def-someuser.
- You must specify the account name (Group Name) using the
-
Time Limit (
--time): Our policies require you to supply at least a time limit for each job. Shorter jobs have more scheduling opportunities, and jobs with time limits under three hours benefit most from backfilling.- Example formats:
00:15:00(15 minutes) or0-0:5(5 minutes).
- Example formats:
-
Resource Request (Cores/Memory/GPUs):
- For CPU jobs, specify the number of tasks (
--ntasks) and memory per CPU (--mem-per-cpu). - If you don't specify memory, a default of
$256\text{ MB}$ per core is allocated on general-purpose clusters. - If you request memory beyond what is bundled per core
(typically
$4\text{ GB}$ per core), you will be charged for more core equivalents. - For GPU jobs, you must follow specific instructions, typically
requesting resources using the
--gresdirective (see the Using GPUs with Slurm documentation).
- For CPU jobs, specify the number of tasks (
Job arrays, also known as task arrays, are a powerful feature of the Slurm Workload Manager used on the Digital Research Alliance of Canada (the Alliance) clusters. They allow you to submit an entire set of similar jobs with a single command, which is particularly useful for workflows involving running many similar jobs.
A job array submits multiple instances of the same job script. The
individual jobs within the array are differentiated by an environment
variable: $SLURM\_ARRAY\_TASK\_ID$. This variable is set to a unique
value for each instance (task) of the job.
Job arrays are beneficial in situations such as:
- Hyperparameter search.
- Training many variants of the same method.
- Running many optimization processes of similar duration.
- Restarting long computations (checkpointing).
You define a job array using the #SBATCH --array directive in your
submission script.
- Defining the Array Range
The --array directive specifies the range of task IDs that will be
executed.
For example, to create 10 tasks with $SLURM\_ARRAY\_TASK\_ID$
ranging from 1 to 10, you would use: #SBATCH --array=1-10
- Example Job Array Script
In your job script, you use the $SLURM\_ARRAY\_TASK\_ID$ variable to
control the execution path or input files for each distinct task.
A minimal example script (array\_job.sh) demonstrating 10 tasks running the application ./myapplication is:
#!/bin/bash
#SBATCH --account=def-someuser
#SBATCH --time=0-0:5
#SBATCH --array=1-10
./myapplication $SLURM_ARRAY_TASK_IDThe application ./myapplication would run 10 times, receiving task IDs from 1 through 10 as an argument.
Note that on clusters like Graham and Béluga, there may be a limit on the number of jobs you can have in the system at one time (e.g., 1000 jobs, queued and running). Each task of a job array counts as one job towards this limit.
Job arrays can also be used to automatically restart computations that exceed the maximum time limit of the cluster (up to 7 days on general-purpose clusters like Fir, Narval, Nibi, and Rorqual). This requires the application to support checkpointing (saving state to a file).
The --array=1-100%10 syntax can be used to submit a collection of
identical jobs with the condition that only one job of the array
will run at any given time. The script should be designed to ensure
that the last checkpoint file is used to restart the calculation for
the next job.
For example, splitting a simulation into 10 sequential chunks using a
job array, ensuring only one runs at a time: #SBATCH --array=1-10%1
The script would use a test to check for the existence of a checkpoint
file (state.cpt) to either restart the simulation or start a new
one.
For more detailed documentation, you should refer to the documentation on Job arrays and Job Array Support. Automating job submission via job arrays is one technique provided, alongside tools like META, GLOST, and GNU Parallel, which can bundle many short computations into fewer tasks of longer duration to improve computational efficiency.
If your analysis is expected to run longer than the cluster time limits (up to 7 days on general-purpose clusters), you must use checkpointing to save your state and resume later.
- For example, a 3-day training job should be split into 24-hour chunks.
- This can be automated using job arrays (for a fixed number of restarts) or by implementing resubmission from the job script.
The primary way to interact with jobs on Digital Research Alliance of Canada (the Alliance) clusters is through the Slurm Workload Manager commands.
Here are the basic commands for submitting, monitoring, and controlling your jobs:
The command to submit a job script to the scheduler is sbatch.
sbatch <script_name>: Submits a batch job script (e.g.,simple\_job.sh) to the Slurm scheduler. This script must include#SBATCHdirectives specifying necessary resources like--time(time limit) and--account(Resource Allocation Project/Group Name).sbatch --time=00:30:00 simple_job.sh: You can override directives in the script by supplying them as command-line arguments tosbatch.
You can check the status of jobs using two primary commands: squeue
or the local customization sq. It is important not to run these
commands from a script at a high frequency, as they add load to
Slurm.
-
sq: Lists only your own jobs. -
squeue -u $USER$: The general Slurm command to list only your jobs, filtering by your username. -
squeue -u <username> -t RUNNING: Shows only jobs currently in the R (Running) state for a specific user. -
squeue -u <username> -t PENDING: Shows only jobs currently in the PD (Pending) state for a specific user. -
scontrol show job <jobid>: Shows detailed information for a specific job ID.
To terminate or remove a job from the queue:
-
scancel <jobid>: Cancels a specific job using its job ID. -
scancel -u $USER$: Cancels all jobs belonging to your user ID. -
scancel -t PENDING -u $USER$: Cancels all pending jobs belonging to your user ID.
Once a job is complete, you can review its performance:
-
seff <jobid>: Provides a short summary of the CPU and memory efficiency of a job, including the wall-clock time and efficiency percentages. -
sacct -j <jobid>: Provides more detailed information about a completed job. You can specify the output format using the--formatoption, for example: `$ sacct -j --format=JobID,JobName,MaxRSS,Elapsed$.
Interactive sessions on compute nodes are useful for tasks like debugging, data exploration, or compiling software:
-
salloc: Starts an interactive session on a compute node, reserving resources for the duration specified. - Example: `$ salloc --time=1:0:0 --mem-per-cpu=3G --ntasks=1 --account=def-someuser$.
You can connect to the node running a job to monitor or troubleshoot
(e.g., running htop or nvidia-smi):
srun --jobid <jobid> --pty <command>: Runs a process on the node assigned to the specified running job ID.
Submit your job script using the sbatch command: $ sbatch simple_job.sh
-
Checking Status: You can list your submitted jobs using the
customized utility
sqor the general commandsqueue -u $USER$. The status column (ST) shows if a job isPD(pending) orR` (running). -
Checking Output: Output is typically written to a file named
slurm-<job ID>.outin the submission directory. Output from non-interactive jobs is normally buffered, which means there may be a delay before you see the log file updated. If real-time monitoring is needed, run an interactive job usingsalloc. -
Efficiency: After completion, you can check the efficiency of
the job's CPU and memory utilization using
seff <jobid>. - Visualizing Job Status with Portail: By accessing https://portail.narval.calculquebec.ca/, you can recieve visualizations about the resource usage of your job during its processing. You will also recieve pretty clear warnings about not utilizing a reasonable proportion of your requested resource if you do not use enough of it. This is a very convenient resource compared to many CLI resource tools.