An Introduction to Snakemake by Anastasiia Hyhorzhevska. This repository serves as a guide for understanding and utilizing the workflow engine Snakemake. Your feedback and contributions are highly encouraged!
- About Snakemake
- Setup
- Exercises
- Useful command line arguments
- Useful keywords and functions
- Useful resources
- Authors
Snakemake is a Pythonic workflow engine developed by Dr. Johannes Köster, also the founder of the Bioconda project for sustainably distributing bioinformatics software.
-
Software to create reproducible and scalable workflows. The workflow is defined by specifying rules how to create sets of output files from sets of input files.
-
Easy to prototype workflows due to embedded Python scripting
-
Feature-rich and configurable: many different command-line options and multiple ways to configure rules
-
Snakemake doesn't (re-)run a job if a taget (output) file already exists and/or an output file is older than an input file.
Clone the repository:
git clone \
--depth 3 \
--filter=blob:none \
--sparse \ https://github.molgen.mpg.de/mpip/IT-Wiki/
;
cd IT-Wiki
git sparse-checkout init --cone
git sparse-checkout set 04\ MPIP\ cluster/snakemake-tutorial/
Then, use either Snakemake installed on the cluster, or, alternatively, install a Snakemake environment. To install your own environment, run:
module load anaconda/anaconda3
conda create --name py38 python=3.8 --channel conda-forge
conda install -n py38 -c conda-forge mamba
conda activate py38
mamba create -c conda-forge -c bioconda -n snakemake snakemake
conda activate /home/USERNAME/.conda/envs/py38/envs/snakemake
Minimal examples corresponding to this README content can be found in the exercises folder. To run them, use:
user@slurmgate$ snakemake --core 1 --printshellcmds -- all
rule hellomake:
input: ["hellomake.c", "hellofunc.c"]
output: "hellomake"
shell:
"gcc -o {output} {input} -I."
From input files create some output using the shell command. Inside the shell command, all local and global variables, including input and output files can be accessed via their names in the python format minilanguage. Instead of a shell command, a rule can run some python code to generate the output. For this, use the run
instead of shell
:
rule hellomake:
input: ["hellomake.c", "hellofunc.c"]
output: "hellomake"
run:
dummy python code to create output[0] from input[0] and input[1]
{input} and {output} are wildcards for the input and output files, and are parsed with Python string formatting rules
Declarative language: Tell Snakemake what to do, and it will make it for you:
rule all:
input: "hellomake"
rule all: the rule that contains all the target files that should be created by the current workflow. This rule doesn't contain any output file, any script or shell command. It allows to avoid mentioning the target in the command line while executing.
import os
data = "data/{id}.txt"
ids, = glob_wildcards(data)
rule concatenate:
input: expand(data, id = ids)
output: "result/conc.txt"
shell:
"cat {input} > {output}"
rule all:
input: rules.concatenate.output
It is possible to use arbitary Python outside the rules: import os
glob_wildcards: a special Snakemake function to look for all files matching a certain pattern. In this example, it returns ["f1", "f2"]
expand: a special Snakemake function to create a list of strings from a pattern. In this example, it returns ["data/f1.txt", "data/f2.txt"]
rules.concatenate.output: It is possible to access the output from other rules programmatically:
- rules: is a special variable that contains all the infromation that is present in the other rules of the workflow
- concatenate: is a name of the rule in the workflow
- output: is an output of the concatenate rule
To execute, run the following command:
snakemake --core 1 --printshellcmds -- all
where
-- all: the name of the rule to execute. It is a good practice to specify the target rule at the end, after two dashes to clarify that it is not an argument
--printshellcmds: see the executed commands
Automatic logging is placed in the .snakemake directory by default
This is what you see once run the command:
[user@slurmgate 01]$ snakemake --core 1 --printshellcmds -- all
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 all
1 concatenate
2
[Wed Apr 28 12:33:18 2021]
rule concatenate:
input: data/f1.txt, data/f2.txt
output: result/conc.txt
jobid: 1
cat data/f1.txt data/f2.txt > result/conc.txt
[Wed Apr 28 12:33:18 2021]
Finished job 1.
1 of 2 steps (50%) done
[Wed Apr 28 12:33:18 2021]
localrule all:
input: result/conc.txt
jobid: 0
[Wed Apr 28 12:33:19 2021]
Finished job 0.
2 of 2 steps (100%) done
Complete log: /path_to/IT-Wiki/04 MPIP cluster/snakemake-tutorial/exercises/01/.snakemake/log/2021-04-28T123318.778561.snakemake.log
data = "data/{id}.txt"
ids, = glob_wildcards(data)
rule sort:
input: "data/{id}.txt"
output: "result/{id}_sorted.txt"
shell:
"sort {input} > {output}"
rule all:
input: expand(rules.sort.output, id=ids)
{id}: the wildcard that matches any partial filename that is a string.
The wildcards must be consistent between input and output files.
It is also possible to restrict the wildcards to a specific format by using regular expressions. In this example, to constrain the id to alphanumeric characters, use {id,[A-Za-z0-9]+}.txt:
data = "data/{id,[A-Za-z0-9]+}.txt"
ids, = glob_wildcards(data)
If you want to run a generic rule differently based on the type of sample being used, or, run the same rule several times, using different sets of parameters, it is possible :
- Specifying input files using input functions
- Specifying parameters using the params directive
Goal 1: Map fastq files to index files
Goal 2: Subsample reads from both samples
data = "data/{id}.fq"
ids, = glob_wildcards(data)
subsampling = [12, 20]
sample_dict = {"chr21.test1": "index/index1/",
"chr21.test2": "index/index2/"}
rule all:
input:
expand("result/{id}_subsample_{subsample}_Aligned.out.sam", id=ids, subsample=subsampling)
rule get_subsample:
input:
data
output:
temp("{id}_tmp_{subsample}.fq")
params:
"{subsample}"
shell:
"head -n {params} {input} > {output}"
rule mapping:
input:
fastq="{id}_tmp_{subsample}.fq",
index=lambda wc: sample_dict[wc.id]
output:
"result/{id}_subsample_{subsample}_Aligned.out.sam"
params:
subsample="{subsample}"
shell:
"STAR "
"--genomeDir {input.index} "
"--readFilesIn {input.fastq} "
"--outFileNamePrefix result/{wildcards.id}_subsample_{params.subsample}_"
To execute the example, run:
snakemake --core 1 -j 8 --printshellcmds -- all
Goal 1:
Use input functions to determine the input of a rule based on the matching wildcard of the output
Input functions take a single variable: the wildcards object
Create a dictionary telling which mapping index to use for which sample:
sample_dict = {"chr21.test1": "index/index1/",
"chr21.test2": "index/index2/"}
Look up the ID using the lambda function:
index=lambda wc: sample_dict[wc.id]
Goal 2:
Use the params keyword to add additional non-file parameters to the rule:
params: subsample="{subsample}"
Add an additional subsample parameter and run each mapping twice using this parameter:
output:
"result/{id}_subsample_{subsample}_Aligned.out.sam"
params:
subsample="{subsample}"
...
rule all:
input:
expand("result/{id}_subsample_{subsample}_Aligned.out.sam", id=ids, subsample=subsampling)
The params object gets passed to the shell directive and can be accessed in a way: {params.subsample}
Lexical scoping and string formatting
Snakemake creates the following 'special' variables as objects: input, output, params, wildcards, log, threads, resources, config.
These are then passed to the rule, shell directive in this example, and expanded using the .format(**vars) string formatting function.
You can refer any variable within the scope of the rule, including external variables defined.
Workflow graph
To create the Snakemake workflow graph, execute:
snakemake --dag -- all | dot -Tpdf > dag.pdf
The results will be written out to the dag.pdf
file.
Alternatively, metadata can be stored in a config file:
sample_dict:
chr21.test1: data/index
chr21.test2: data/index
subsample:
[12,20]
Then, the Snakemake file looks as follow:
data = "data/{id}.fq"
ids, = glob_wildcards(data)
configfile: "config.yaml"
rule all:
input:
expand("result/{id}_subsample_{subsample}_Aligned.out.sam", id=ids, subsample=config["subsample"])
...
rule mapping:
input:
fastq="{id}_tmp_{subsample}.fq",
index=lambda wc: config["sample_dict"][wc.id]
...
It is possible to define isolated software environments per rule.
Upon execution of a workflow, the Conda package manager is used to obtain and deploy the defined software packages in the specified versions.
Packages will be installed into the working directory.
The environment specifications are stored in the environmnet yaml file:
channels:
- r
dependencies:
- r=4.0.3
- r-ggplot2=3.3.3
To install the required conda environments without running the full workflow, run:
snakemake --use-conda --conda-create-envs-only
To run the full workflow:
snakemake --cores 1 --use-conda -- all
The result of execution:
[user@slurmgate 05]$ snakemake --core 1 -j 2 --use-conda --printshellcmds -- all
Building DAG of jobs...
Creating conda environment envs/plot.yaml...
Downloading and installing remote packages.
Environment for envs/plot.yaml created (location: .snakemake/conda/c00adc15fcb7ce78cd2de9774da7df74)
...
A rule can also point to an external script instead of a shell command.
Inside the script, you have access to an object snakemake that provides access to the same objects that are available in the run and shell directives: input, output, params, wildcards, log, threads, resources, config.
For example, in R, you can use snakemake@input[[1]], or snakemake@input[["filename"]], to get the first file.
In Python, you would use snakemake.input[0].
Snakefile:
data = "data/{file}.csv"
ids, = glob_wildcards(data)
rule all:
input:
expand("plots/plot_{file}.pdf", file = ids)
rule plot:
input:
data
output:
"plots/plot_{file}.pdf"
conda:
"envs/plot.yaml"
script:
"scripts/plot.R"
plot.R:
library(ggplot2)
gene <- read.csv2(snakemake@input[[1]], dec = ",")
pdf(file = snakemake@output[[1]])
ggplot(gene, aes(x = id)) +
geom_density(alpha=.2, fill="#FF6666") +
ggtitle(paste0("Plot "), snakemake@wildcards[["file"]])
dev.off()
where
snakemake@input[[1]]: the first input file of the above rule plot
snakemake@output[[1]]: the first output file of the above rule plot
snakemake@wildcards[["file"]]: the wildcard variale which itself contains the file wildcard for the above rule plot
To execute the example, run:
snakemake --core 1 -j 2 --use-conda --printshellcmds -- all
The result of execution:
[user@slurmgate 05]$ snakemake --core 1 -j 2 --use-conda --printshellcmds -- all
Building DAG of jobs...
Creating conda environment envs/plot.yaml...
Downloading and installing remote packages.
Environment for envs/plot.yaml created (location: .snakemake/conda/c00adc15fcb7ce78cd2de9774da7df74)
Using shell: /bin/bash
Provided cores: 2
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 all
2 plot
3
Select jobs to execute...
[Thu May 6 18:09:20 2021]
rule plot:
input: data/x.csv
output: plots/plot_x.pdf
jobid: 1
wildcards: file=x
[Thu May 6 18:09:20 2021]
rule plot:
input: data/y.csv
output: plots/plot_y.pdf
jobid: 2
wildcards: file=y
Rscript --vanilla /05/.snakemake/scripts/tmp1imjv2r6.plot.R
Rscript --vanilla /05/.snakemake/scripts/tmpp_00n_lx.plot.R
Activating conda environment: /05/.snakemake/conda/c00adc15fcb7ce78cd2de9774da7df74
Activating conda environment: /05/.snakemake/conda/c00adc15fcb7ce78cd2de9774da7df74
Registered S3 methods overwritten by 'ggplot2':
method from
[.quosures rlang
c.quosures rlang
print.quosures rlang
...
There are several options for parallelizing a workflow.
On a single node with many cores, you can use:
snakemake -j $NUM_THREADS
By default, each rule is assumed to take 1 thread. You can use the threads: keyword to specify how many threads each rule takes.
For example:
rule sort_parallel:
input:
"data/big_data.txt"
output:
"results/big_data_sorted.txt"
threads: 8
shell:
"sort --parallel={threads} {input} > {output}"
where
threads: specifies how many cores a rule takes
The {threads} keyword is accessible in the shell directive through lexical scoping.
To execute the example, run:
snakemake --core 1 -j 8 --use-conda --printshellcmds
Note: If -j is less than threads, threads will be reduced to –j.
To run the rule on the cluster, use the --cluster to provide specific keywords when submitting:
snakemake --cluster 'sbatch --mem=2g -c 1' -j 4 -- all
Snakemake creates a job script for each job to be run and uses this command to submit that job.
-j 4 argument limits to at most 4 jobs running at the same time.
Note: Using --cluster like shown above will cause all jobs to request same amount of resources from HPC. To make Snakemake request different amount of resources based on each job, you can:
- Define resources in rule parameters
- Define resources in the cluster configuration file. e.g. cluster.config:
Define resources in rule parameters
The string following --cluster can access all variables within a rule in the same way as in a shell command within a rule:
rule mapping:
input:
fastq="{id}_tmp_{subsample}.fq",
index="data/index"
output:
"results/{id}_subsample_{subsample}_Aligned.out.sam"
params:
subsample = "{subsample}",
mem = "2g"
threads: 4
shell:
"STAR "
"--genomeDir {input.index} "
"--readFilesIn {input.fastq} "
"--outFileNamePrefix result/{wildcards.id}_subsample_{params.subsample}_"
Then invoke workflow:
snakemake --cluster 'sbatch -t --mem={params.mem} -c {threads}' -j 4
Note: These params must be defined for all rules or you will get an error when Snakemake tries to submit a job from a rule that doesn’t have one or more of these params defined.
Define resources in the cluster configuration file
Instead of putting parameters for cluster submission into each individual rule, you can write a separate config file that contains these parameters.
__default__:
partition: "pe"
cpus: "{threads}"
mem: "2g"
get_index:
mem: "8g"
mapping:
mem: "4g"
By default, rules take resources corresponding to those annotated in default.
For specific rules that require more resources, you can annotate them specifically using the rule name in the config file (e.g. get_index or mapping).
Now in the rule definition, you only need to specify threads (to allow Snakemake utilize this paramter for multi-threading when running locally)
rule mapping:
input:
fastq="{id}_tmp_{subsample}.fq",
index="data/index"
output:
"results/{id}_subsample_{subsample}_Aligned.out.sam"
params:
"{subsample}"
threads: 4
shell:
"STAR "
"--genomeDir {input.index} "
"--readFilesIn {input.fastq} "
"--outFileNamePrefix result/{wildcards.id}_subsample_{params}_"
Redirect cluster output
You can tell Snakemake where to place stdout and stderr by specifiynf the output location in the cluster configuration file:
__default__:
partition: "hp"
cpus: "{threads}"
mem: "2g"
output: "logs_slurm/{rule}_{wildcards.id}_{wildcards.subsample}.out"
error: "logs_slurm/{rule}_{wildcards.id}_{wildcards.subsample}.err"
Then run the workflow:
snakemake -d ~/path_to_ex/07 --cluster-config cluster.yaml --cluster 'sbatch --mem={cluster.mem} -c {cluster.cpus} -o {cluster.output} -e {cluster.error}' -j 4 --printshellcmds -- all
where
d: specifies the path to the working directory.
This will redirect stdout and stderr of every job to their corresponding log files named as by rule and wildcards under logs_slurm directory.
Use the log: keyword to capture STDERR automatically, or you can print to it explicitly, e.g.:
my_vars = ["A", "B", "C"]
rule incorrect:
output:
"results/test.txt"
log: "log.txt"
run:
with open(output[0], "w") as outh, open(log[0], "w") as logh:
print("Running incorrect rule...", file=logh)
print(my_vars[3], file=outh)
You can use pdb, a Python debugger, to step through rules and examine variables at each step:
my_vars = ["A", "B", "C"]
rule incorrect:
output:
"results/test.txt"
run:
import pdb; pdb.set_trace()
with open(output[0], "w") as outh, open(log[0], "w") as logh:
print("Running incorrect rule...", file=logh)
print(my_vars[3], file=outh)
Use the --debug flag to enable pdb in run
or script
directives:
snakemake -s Snakefile_incorrect --core 1 --debug
The output:
(py38) [ahryhorzhevska@slurmgate 08]$ snakemake -s Snakefile_incorrect --core 1 --debug
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 incorrect
1
[Sat May 8 14:27:26 2021]
rule incorrect:
output: results/test.txt
log: log.txt
jobid: 0
Job counts:
count jobs
1 incorrect
1
> /home/ahryhorzhevska/mpip/code/IT-Wiki/04 MPIP cluster/snakemake-tutorial/exercises/08/Snakefile_incorrect(16)__rule_incorrect()
(Pdb) my_vars
['A', 'B', 'C']
(Pdb) my_vars[1]
'B'
(Pdb) my_vars[3]
*** IndexError: list index out of range
The correct rule with the error fixed:
my_vars = ["A", "B", "C"]
rule correct:
output: "results/test.txt"
log: "log.txt"
run:
with open(output[0], "w") as outh, open(log[0], "w") as logh:
print("Running correct rule...", file=logh)
print(my_vars[2], file=outh)
Note: for debbuging execution of Snakefiles themselves, use the --verbose option:
snakemake -s Snakefile_incorrect --core 1 --verbose
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 incorrect
1
Resources before job selection: {'_cores': 1, '_nodes': 9223372036854775807}
Ready jobs (1):
incorrect
Selected jobs (1):
incorrect
Resources after job selection: {'_cores': 0, '_nodes': 9223372036854775806}
[Sat May 8 14:33:36 2021]
rule incorrect:
output: results/test.txt
log: log.txt
jobid: 0
Job counts:
count jobs
1 incorrect
1
[Sat May 8 14:33:36 2021]
Error in rule incorrect:
jobid: 0
output: results/test.txt
log: log.txt (check log file(s) for error message)
RuleException:
IndexError in line 13 of /home/ahryhorzhevska/mpip/code/IT-Wiki/04 MPIP cluster/snakemake-tutorial/exercises/08/Snakefile_incorrect:
list index out of range
File "/home/ahryhorzhevska/mpip/code/IT-Wiki/04 MPIP cluster/snakemake-tutorial/exercises/08/Snakefile_incorrect", line 13, in __rule_incorrect
File "/usr/lib64/python3.6/concurrent/futures/thread.py", line 56, in run
Exiting because a job execution failed. Look above for error message
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/ahryhorzhevska/mpip/code/IT-Wiki/04 MPIP cluster/snakemake-tutorial/exercises/08/.snakemake/log/2021-05-08T143336.287182.snakemake.log
unlocking
removing lock
removing lock
removed all locks
- Use
--forceall
to rerun an analysis from scratch - Use
--keep-going
to prevent your workflow from stopping if a step fails - Use
--until
or--omit-from
to stop the pipeline at a certain step. - Use
--delete-all-output
to remove all files generated by the workflow - Use
--rulegraph | dot –Tpdf > ruledag.pdf
to print an overview of the workflow - Use
--archive
to create a gzipped tarball of the entire workflow, including Conda packages
- Use the priority keyword or --prioritize TARGET on the command line to increase the priority of a file, e.g. if your PI is bugging you about some particular figure)
- Use the temp() function around files that you do not need, and they will be automatically deleted when no rules depend on them anymore
- Use protected() around important files to protect them from accidental deletion
- Use directory() to indicate the creation of a directory (by default, Snakemake only recognizes files)
- Use the remote() function to download files via S3, Google, Dropbox, https, ftp, etc.
- Use the include keyword to include sub-Snakefiles and better organize your code