Skip to content

feat: new inphared-db wrapper #1550

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions bio/reference/inphared-db/environment.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
channels:
- conda-forge
- nodefaults
dependencies:
- curl
4 changes: 4 additions & 0 deletions bio/reference/inphared-db/meta.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
name: inphared-db
description: Download sequence file from the Inphared database (https://github.com/RyanCook94/inphared/blob/main/README.md), and store them in a single .fasta file. Please check the current database available at the above link and adjust the config file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
description: Download sequence file from the Inphared database (https://github.com/RyanCook94/inphared/blob/main/README.md), and store them in a single .fasta file. Please check the current database available at the above link and adjust the config file.
description: Download sequence file from the [inphared database](https://github.com/RyanCook94/inphared/blob/main/README.md), and store them in a single .fasta file. Please check the above link for available database version and adjust the config file.

authors:
- Noriko A. Cassman
80 changes: 80 additions & 0 deletions bio/reference/inphared-db/old_wrapper.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
__author__ = "Johannes Köster"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file can simply be deleted here, right?

__copyright__ = "Copyright 2019, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"

import subprocess as sp
import sys
from itertools import product
from snakemake.shell import shell

species = snakemake.params.species.lower()
release = int(snakemake.params.release)
build = snakemake.params.build

branch = ""
if release >= 81 and build == "GRCh37":
# use the special grch37 branch for new releases
branch = "grch37/"
elif snakemake.params.get("branch"):
branch = snakemake.params.branch + "/"

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

spec = ("{build}" if int(release) > 75 else "{build}.{release}").format(
build=build, release=release
)

suffixes = ""
datatype = snakemake.params.get("datatype", "")
chromosome = snakemake.params.get("chromosome", "")
if datatype == "dna":
if chromosome:
suffixes = ["dna.chromosome.{}.fa.gz".format(chromosome)]
else:
suffixes = ["dna.primary_assembly.fa.gz", "dna.toplevel.fa.gz"]
elif datatype == "cdna":
suffixes = ["cdna.all.fa.gz"]
elif datatype == "cds":
suffixes = ["cds.all.fa.gz"]
elif datatype == "ncrna":
suffixes = ["ncrna.fa.gz"]
elif datatype == "pep":
suffixes = ["pep.all.fa.gz"]
else:
raise ValueError("invalid datatype, must be one of dna, cdna, cds, ncrna, pep")

if chromosome:
if not datatype == "dna":
raise ValueError(
"invalid datatype, to select a single chromosome the datatype must be dna"
)

spec = spec.format(build=build, release=release)
url_prefix = f"ftp://ftp.ensembl.org/pub/{branch}release-{release}/fasta/{species}/{datatype}/{species.capitalize()}.{spec}"

success = False
for suffix in suffixes:
url = f"{url_prefix}.{suffix}"

try:
shell("curl -sSf {url} > /dev/null 2> /dev/null")
except sp.CalledProcessError:
continue

shell("(curl -L {url} | gzip -d > {snakemake.output[0]}) {log}")
success = True
break

if not success:
if len(suffixes) > 1:
url = f"{url_prefix}.[{'|'.join(suffixes)}]"
else:
url = f"{url_prefix}.{suffixes[0]}"
print(
f"Unable to download requested sequence data from Ensembl ({url}). "
"Please check whether above URL is currently available (might be a temporal server issue). "
"Apart from that, did you check that this combination of species, build, and release is actually provided?",
file=sys.stderr,
)
exit(1)
12 changes: 12 additions & 0 deletions bio/reference/inphared-db/test/Snakefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
configfile: "config.yaml"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix the pipeline failure by ensuring config.yaml exists

The pipeline is failing because the config.yaml file is missing or inaccessible. Make sure the config file exists in the same directory as this Snakefile and has the correct permissions.

#!/bin/bash
# Check if config.yaml exists in the same directory as this Snakefile
if [ -f "bio/reference/inphared-db/test/config.yaml" ]; then
  echo "config.yaml exists"
else
  echo "config.yaml does not exist in the expected location"
fi
🧰 Tools
🪛 GitHub Actions: Code quality

[error] 1-1: WorkflowError: configfile 'config.yaml' is defined but not present or accessible at /home/runner/work/snakemake-wrappers/snakemake-wrappers/config.yaml.


rule get_inphareddb:
output:
expand("{date}{suffix}", date=config["date"], suffix=config["suffix"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the things we try with these wrappers, is to have them work with arbitrary file names. So the config[] entries should not be used in this example Snakefile, but rather in the wrapper.py file (via snakemake.params.date, for example).

Suggested change
expand("{date}{suffix}", date=config["date"], suffix=config["suffix"])
"resources/inphared.fasta"

To add the values in the config variables back into the file name, the users of the wrapper should then add python code around it. But I also like the idea of directly showcasing how to do that here. So maybe we could have two versions of calling the wrapper here, one with a fixed file name (like i suggest here), and one that contains the config[] entries.

params:
prefix = config["prefix"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename this to url:, as that might make its purpose a bit clearer?

Suggested change
prefix = config["prefix"],
url = config["url"],

Obviously, the same rename then applies in the config.yaml file.

date = config["date"],
suffix = config["suffix"]
wrapper:
"master/bio/reference/inphared-db"

9 changes: 9 additions & 0 deletions bio/reference/inphared-db/test/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date:
"2Jul2023"

suffix:
"_refseq_genomes.fa"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a smaller reference fasta file (of some small subset of viruses, e.g.), that could be used in the example? This should optimally get executed in the CI tests regularly, so shouldn't download much, if possible.

Also, we should still add an actual test run for this wrapper.

#"_genomes_excluding_refseq.fa"

prefix:
"https://millardlab-inphared.s3.climb.ac.uk/"
Comment on lines +8 to +9
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To keep this in sync with the suggestions elsewhere, we would change this to:

Suggested change
prefix:
"https://millardlab-inphared.s3.climb.ac.uk/"
url:
"https://millardlab-inphared.s3.climb.ac.uk"

29 changes: 29 additions & 0 deletions bio/reference/inphared-db/test/old_release.smk
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
rule get_genome:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file can be deleted here, right? It's simply a copy-paste leftover, if I understand this correctly.

output:
"refs/genome.fasta",
params:
species="saccharomyces_cerevisiae",
datatype="dna",
build="R64-1-1",
release="75",
log:
"logs/get_genome.log",
cache: "omit-software" # save space and time with between workflow caching (see docs)
wrapper:
"master/bio/reference/ensembl-sequence"


rule get_chromosome:
output:
"refs/old_release.chr1.fasta",
params:
species="saccharomyces_cerevisiae",
datatype="dna",
build="R64-1-1",
release="75",
chromosome="I",
log:
"logs/get_genome.log",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Use separate log files for different rules

Both get_genome and get_chromosome rules use the same log file, which could cause conflicts if both rules run concurrently.

-    log:
-        "logs/get_genome.log",
+    log:
+        "logs/get_chromosome.log",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"logs/get_genome.log",
log:
"logs/get_chromosome.log",

cache: "omit-software" # save space and time with between workflow caching (see docs)
wrapper:
"master/bio/reference/ensembl-sequence"
30 changes: 30 additions & 0 deletions bio/reference/inphared-db/test/old_snakefile.smk
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
rule get_genome:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file can simply be deleted here, right?

output:
"refs/genome.fasta",
params:
species="saccharomyces_cerevisiae",
datatype="dna",
build="R64-1-1",
release="98",
log:
"logs/get_genome.log",
cache: "mit-software" # save space and time with between workflow caching (see docs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix typo in cache parameter

There appears to be a typo in the cache parameter. This rule uses "mit-software" while the other rule uses "omit-software".

-    cache: "mit-software"  # save space and time with between workflow caching (see docs)
+    cache: "omit-software"  # save space and time with between workflow caching (see docs)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cache: "mit-software" # save space and time with between workflow caching (see docs)
cache: "omit-software" # save space and time with between workflow caching (see docs)

wrapper:
"master/bio/reference/ensembl-sequence"


rule get_chromosome:
output:
"refs/chr1.fasta",
params:
species="saccharomyces_cerevisiae",
datatype="dna",
build="R64-1-1",
release="101",
chromosome="I", # optional: restrict to chromosome
# branch="plants", # optional: specify branch
log:
"logs/get_genome.log",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Use separate log files for different rules

Both get_genome and get_chromosome rules use the same log file, which could cause conflicts if both rules run concurrently.

-    log:
-        "logs/get_genome.log",
+    log:
+        "logs/get_chromosome.log",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"logs/get_genome.log",
log:
"logs/get_chromosome.log",

cache: "omit-software" # save space and time with between workflow caching (see docs)
wrapper:
"master/bio/reference/ensembl-sequence"
9 changes: 9 additions & 0 deletions bio/reference/inphared-db/wrapper.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
__author__ = "Noriko A. Cassman"
__copyright__ = "Copyright 2023, Noriko A. Cassman"
__email__ = "[email protected]"
__license__ = "MIT"

from snakemake.shell import shell

shell:
"curl {params.prefix}{params.date}{params.suffix} -o {params.date}{params.suffix}"
Comment on lines +8 to +9
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some little things here:

  1. You have to reference anything from the snakemake rule via the snakemake dict. So for example snakemake.params.date.
  2. We should only use the full output file name here, so that users can put in there whatever they want.
  3. We have to call the shell() function here, as this is not the body of an actual rule, but rather a plain python script.
  4. We have to use the f"" construct here to use format strings to fill in the variables in this context. Here, we are not dealing with snakemake wildcards, but rather python variables. The {} syntax is the same, so this is very confusing...
  5. I would manually put in the separator between URL and file name here, just for a slightly clearer structure of the download link. Then, we can remove the trailing/ in the url in config.yaml.
Suggested change
shell:
"curl {params.prefix}{params.date}{params.suffix} -o {params.date}{params.suffix}"
shell(f"curl {snakemake.params.url}/{snakemake.params.date}{snakemake.params.suffix} -o {snakemake.output}"

Loading