Skip to content

Commit a81ced2

Browse files
authored
feat: add new ensembl-regulation wrapper to systematically download Ensembl regulatory_features (#3017)
<!-- Ensure that the PR title follows conventional commit style (<type>: <description>)--> <!-- Possible types are here: https://github.com/commitizen/conventional-commit-types/blob/master/index.json --> <!-- Add a description of your PR here--> ### QC <!-- Make sure that you can tick the boxes below. --> * [x] I confirm that: For all wrappers added by this PR, * there is a test case which covers any introduced changes, * `input:` and `output:` file paths in the resulting rule can be changed arbitrarily, * either the wrapper can only use a single core, or the example rule contains a `threads: x` statement with `x` being a reasonable default, * rule names in the test case are in [snake_case](https://en.wikipedia.org/wiki/Snake_case) and somehow tell what the rule is about or match the tools purpose or name (e.g., `map_reads` for a step that maps reads), * all `environment.yaml` specifications follow [the respective best practices](https://stackoverflow.com/a/64594513/2352071), * the `environment.yaml` pinning has been updated by running `snakedeploy pin-conda-envs environment.yaml` on a linux machine, * wherever possible, command line arguments are inferred and set automatically (e.g. based on file extensions in `input:` or `output:`), * all fields of the example rules in the `Snakefile`s and their entries are explained via comments (`input:`/`output:`/`params:` etc.), * `stderr` and/or `stdout` are logged correctly (`log:`), depending on the wrapped tool, * temporary files are either written to a unique hidden folder in the working directory, or (better) stored where the Python function `tempfile.gettempdir()` points to (see [here](https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir); this also means that using any Python `tempfile` default behavior works), * the `meta.yaml` contains a link to the documentation of the respective tool or command, * `Snakefile`s pass the linting (`snakemake --lint`), * `Snakefile`s are formatted with [snakefmt](https://github.com/snakemake/snakefmt), * Python wrapper scripts are formatted with [black](https://black.readthedocs.io). * Conda environments use a minimal amount of channels, in recommended ordering. E.g. for bioconda, use (conda-forge, bioconda, nodefaults, as conda-forge should have highest priority and defaults channels are usually not needed because most packages are in conda-forge nowadays).
1 parent ba9bae2 commit a81ced2

File tree

6 files changed

+183
-0
lines changed

6 files changed

+183
-0
lines changed
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# This file may be used to create an environment using:
2+
# $ conda create --name <env> --file <this file>
3+
# platform: linux-64
4+
@EXPLICIT
5+
https://conda.anaconda.org/conda-forge/linux-64/_libgcc_mutex-0.1-conda_forge.tar.bz2#d7c89558ba9fa0495403155b64376d81
6+
https://conda.anaconda.org/conda-forge/linux-64/ca-certificates-2024.6.2-hbcca054_0.conda#847c3c2905cc467cea52c24f9cfa8080
7+
https://conda.anaconda.org/conda-forge/linux-64/libgomp-13.2.0-h77fa898_13.conda#d370d1855cca14dff6a819c90c77497c
8+
https://conda.anaconda.org/conda-forge/linux-64/_openmp_mutex-4.5-2_gnu.tar.bz2#73aaf86a425cc6e73fcf236a5a46396d
9+
https://conda.anaconda.org/conda-forge/linux-64/libgcc-ng-13.2.0-h77fa898_13.conda#9358cdd61ef0d600d2a0dde2d53b006c
10+
https://conda.anaconda.org/conda-forge/linux-64/gettext-tools-0.22.5-h59595ed_2.conda#985f2f453fb72408d6b6f1be0f324033
11+
https://conda.anaconda.org/conda-forge/linux-64/libgettextpo-0.22.5-h59595ed_2.conda#172bcc51059416e7ce99e7b528cede83
12+
https://conda.anaconda.org/conda-forge/linux-64/libstdcxx-ng-13.2.0-hc0a3c3a_13.conda#1053882642ed5bbc799e1e866ff86826
13+
https://conda.anaconda.org/conda-forge/linux-64/libunistring-0.9.10-h7f98852_0.tar.bz2#7245a044b4a1980ed83196176b78b73a
14+
https://conda.anaconda.org/conda-forge/linux-64/libzlib-1.3.1-h4ab18f5_1.conda#57d7dc60e9325e3de37ff8dffd18e814
15+
https://conda.anaconda.org/conda-forge/linux-64/openssl-3.3.1-h4ab18f5_0.conda#a41fa0e391cc9e0d6b78ac69ca047a6c
16+
https://conda.anaconda.org/conda-forge/linux-64/libasprintf-0.22.5-h661eb56_2.conda#dd197c968bf9760bba0031888d431ede
17+
https://conda.anaconda.org/conda-forge/linux-64/libgettextpo-devel-0.22.5-h59595ed_2.conda#b63d9b6da3653179a278077f0de20014
18+
https://conda.anaconda.org/conda-forge/linux-64/zlib-1.3.1-h4ab18f5_1.conda#9653f1bf3766164d0e65fa723cabbc54
19+
https://conda.anaconda.org/conda-forge/linux-64/libasprintf-devel-0.22.5-h661eb56_2.conda#02e41ab5834dcdcc8590cf29d9526f50
20+
https://conda.anaconda.org/conda-forge/linux-64/gettext-0.22.5-h59595ed_2.conda#219ba82e95d7614cf7140d2a4afc0926
21+
https://conda.anaconda.org/conda-forge/linux-64/libidn2-2.3.7-hd590300_0.conda#2b7b0d827c6447cc1d85dc06d5b5de46
22+
https://conda.anaconda.org/conda-forge/linux-64/wget-1.21.4-hda4d442_0.conda#361e96b664eac64a33c20dfd11affbff
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
channels:
2+
- conda-forge
3+
- nodefaults
4+
dependencies:
5+
- wget =1
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
name: ensembl-regulation
2+
description: >
3+
Download annotation of regulatory features (e.g. promotors) for genomes from ENSEMBL FTP servers, and store them in a single .gff or .gff3 file.
4+
The output file can be gzipped, which will save space and avoid unzipping during the download.
5+
From release 112 onwards, gff3 files are available and the wrapper will require this file extension.
6+
For older releases (>=87), only gff files with a different file path are available and the wrapper will require this extension.
7+
For the available species (human and mouse as of writing), see the "Regulation (GFF)" column on the FTP download site:
8+
``https://www.ensembl.org/info/data/ftp/index.html``
9+
authors:
10+
- Johannes Köster
11+
- David Lähnemann
12+
output:
13+
- Ensembl GFF anotation file for regulatory features.
14+
params:
15+
- url: Base URL from where to download cache data (optional; by default is ``ftp://ftp.ensembl.org/pub``).
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
rule get_regulatory_features_gff3_gz:
2+
output:
3+
"resources/regulatory_features.gff3.gz", # presence of .gz determines if downloaded is kept compressed
4+
params:
5+
species="homo_sapiens", # for available species, release and build, search via "Regulation (GFF)" column at: https://www.ensembl.org/info/data/ftp/index.html
6+
release="112",
7+
build="GRCh38",
8+
log:
9+
"logs/get_regulatory_features.log",
10+
cache: "omit-software" # save space and time with between workflow caching (see docs); for data downloads, software does not affect the resulting data
11+
wrapper:
12+
"master/bio/reference/ensembl-regulation"
13+
14+
15+
rule get_regulatory_features_grch37_gff:
16+
output:
17+
"resources/regulatory_features.gff",
18+
params:
19+
species="homo_sapiens",
20+
release="112",
21+
build="GRCh37",
22+
log:
23+
"logs/get_regulatory_features.log",
24+
cache: "omit-software" # save space and time with between workflow caching (see docs); for data downloads, software does not affect the resulting data
25+
wrapper:
26+
"master/bio/reference/ensembl-regulation"
27+
28+
29+
rule get_regulatory_features_mouse_gff_gz:
30+
output:
31+
"resources/regulatory_features.mouse.gff.gz",
32+
params:
33+
species="mus_musculus",
34+
release="98",
35+
build="GRCm39",
36+
log:
37+
"logs/get_regulatory_features.log",
38+
cache: "omit-software" # save space and time with between workflow caching (see docs); for data downloads, software does not affect the resulting data
39+
wrapper:
40+
"master/bio/reference/ensembl-regulation"
Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
__author__ = "Johannes Köster"
2+
__copyright__ = "Copyright 2024, Johannes Köster"
3+
__email__ = "[email protected]"
4+
__license__ = "MIT"
5+
6+
import subprocess
7+
import sys
8+
from pathlib import Path
9+
from snakemake.shell import shell
10+
11+
12+
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
13+
14+
15+
species = snakemake.params.species.lower()
16+
build = snakemake.params.build
17+
release = int(snakemake.params.release)
18+
gtf_release = release
19+
out_fmt = Path(snakemake.output[0]).suffixes
20+
out_gz = (out_fmt.pop() and True) if out_fmt[-1] == ".gz" else False
21+
out_fmt = out_fmt.pop().lstrip(".")
22+
23+
if release < 87:
24+
raise ValueError(
25+
"Comprehensive GFF files are only available for release 87 or newer."
26+
)
27+
28+
if build == "GRCh37":
29+
grch37 = "grch37/"
30+
else:
31+
grch37 = ""
32+
33+
34+
suffix = ""
35+
if out_fmt == "gff":
36+
suffix = "gff.gz"
37+
elif out_fmt == "gff3":
38+
suffix = "gff3.gz"
39+
else:
40+
raise ValueError(
41+
"Invalid format specified."
42+
"Only 'gff[.gz]' (for releases before 112, and for build GRCh37) and"
43+
"'gff3[.gz]' (for any release from 112 onwards) are currently supported."
44+
)
45+
46+
47+
url = snakemake.params.get("url", "ftp://ftp.ensembl.org/pub")
48+
if release < 112 or build == "GRCh37":
49+
if out_fmt != "gff":
50+
raise ValueError(
51+
f"Invalid suffix for output file '{snakemake.output[0]}'."
52+
"For releases older than 112 and for human build GRCh37, only .gff or .gff.gz are valid."
53+
)
54+
url = f"{url}/{grch37}release-{release}/regulation/{species}/{species}.{build}.Regulatory_Build.regulatory_features.*.{suffix}"
55+
else:
56+
if out_fmt != "gff3":
57+
raise ValueError(
58+
f"Invalid suffix for output file '{snakemake.output[0]}'."
59+
"For (non-GRCh37) releases from 112 onwards, only .gff3 or .gff3.gz are valid."
60+
)
61+
url = f"{url}/release-{release}/regulation/{species}/{build}/annotation/{species.capitalize()}.{build}.regulatory_features.v{release}.{suffix}"
62+
63+
try:
64+
if out_gz:
65+
shell('wget "{url}" -o {snakemake.output[0]} {log}')
66+
else:
67+
shell('(wget "{url}" -O - | gzip -d > {snakemake.output[0]}) {log}')
68+
except subprocess.CalledProcessError as e:
69+
if snakemake.log:
70+
sys.stderr = open(snakemake.log[0], "a")
71+
print(
72+
"Unable to download regulatory feature data from Ensembl. "
73+
"Did you check that this combination of species, build, and release is actually provided?"
74+
"A good entry point for a search is: https://www.ensembl.org/info/data/ftp/index.html",
75+
file=sys.stderr,
76+
)
77+
exit(1)

test.py

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5598,6 +5598,30 @@ def test_ensembl_annotation_gtf_gz():
55985598
)
55995599

56005600

5601+
@skip_if_not_modified
5602+
def test_ensembl_regulatory_gff3_gz():
5603+
run(
5604+
"bio/reference/ensembl-regulation",
5605+
["snakemake", "--cores", "1", "resources/regulatory_features.gff3.gz", "--use-conda", "-F"],
5606+
)
5607+
5608+
5609+
@skip_if_not_modified
5610+
def test_ensembl_regulatory_features_grch37_gff():
5611+
run(
5612+
"bio/reference/ensembl-regulation",
5613+
["snakemake", "--cores", "1", "resources/regulatory_features.gff", "--use-conda", "-F"],
5614+
)
5615+
5616+
5617+
@skip_if_not_modified
5618+
def test_ensembl_regulatory_features_mouse_gff_gz():
5619+
run(
5620+
"bio/reference/ensembl-regulation",
5621+
["snakemake", "--cores", "1", "resources/regulatory_features.mouse.gff.gz", "--use-conda", "-F"],
5622+
)
5623+
5624+
56015625
@skip_if_not_modified
56025626
def test_ensembl_variation():
56035627
run(

0 commit comments

Comments
 (0)