-
Notifications
You must be signed in to change notification settings - Fork 194
feat: new inphared-db wrapper #1550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Is the config file really necessary? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good already. But of course, I have a bunch of suggestions. Maybe we can work on those together during the symposium or even the workshop.
|
||
rule get_inphareddb: | ||
output: | ||
expand("{date}{suffix}", date=config["date"], suffix=config["suffix"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the things we try with these wrappers, is to have them work with arbitrary file names. So the config[]
entries should not be used in this example Snakefile, but rather in the wrapper.py
file (via snakemake.params.date
, for example).
expand("{date}{suffix}", date=config["date"], suffix=config["suffix"]) | |
"resources/inphared.fasta" |
To add the values in the config variables back into the file name, the users of the wrapper should then add python code around it. But I also like the idea of directly showcasing how to do that here. So maybe we could have two versions of calling the wrapper here, one with a fixed file name (like i suggest here), and one that contains the config[]
entries.
@@ -0,0 +1,4 @@ | |||
name: inphared-db | |||
description: Download sequence file from the Inphared database (https://github.com/RyanCook94/inphared/blob/main/README.md), and store them in a single .fasta file. Please check the current database available at the above link and adjust the config file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
description: Download sequence file from the Inphared database (https://github.com/RyanCook94/inphared/blob/main/README.md), and store them in a single .fasta file. Please check the current database available at the above link and adjust the config file. | |
description: Download sequence file from the [inphared database](https://github.com/RyanCook94/inphared/blob/main/README.md), and store them in a single .fasta file. Please check the above link for available database version and adjust the config file. |
output: | ||
expand("{date}{suffix}", date=config["date"], suffix=config["suffix"]) | ||
params: | ||
prefix = config["prefix"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe rename this to url:
, as that might make its purpose a bit clearer?
prefix = config["prefix"], | |
url = config["url"], |
Obviously, the same rename then applies in the config.yaml
file.
@@ -0,0 +1,29 @@ | |||
rule get_genome: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file can be deleted here, right? It's simply a copy-paste leftover, if I understand this correctly.
@@ -0,0 +1,80 @@ | |||
__author__ = "Johannes Köster" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file can simply be deleted here, right?
@@ -0,0 +1,30 @@ | |||
rule get_genome: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file can simply be deleted here, right?
shell: | ||
"curl {params.prefix}{params.date}{params.suffix} -o {params.date}{params.suffix}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some little things here:
- You have to reference anything from the snakemake rule via the
snakemake
dict. So for examplesnakemake.params.date
. - We should only use the full
output
file name here, so that users can put in there whatever they want. - We have to call the
shell()
function here, as this is not the body of an actual rule, but rather a plain python script. - We have to use the
f""
construct here to use format strings to fill in the variables in this context. Here, we are not dealing with snakemake wildcards, but rather python variables. The{}
syntax is the same, so this is very confusing... - I would manually put in the separator between URL and file name here, just for a slightly clearer structure of the download link. Then, we can remove the trailing
/
in the url inconfig.yaml
.
shell: | |
"curl {params.prefix}{params.date}{params.suffix} -o {params.date}{params.suffix}" | |
shell(f"curl {snakemake.params.url}/{snakemake.params.date}{snakemake.params.suffix} -o {snakemake.output}" |
prefix: | ||
"https://millardlab-inphared.s3.climb.ac.uk/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To keep this in sync with the suggestions elsewhere, we would change this to:
prefix: | |
"https://millardlab-inphared.s3.climb.ac.uk/" | |
url: | |
"https://millardlab-inphared.s3.climb.ac.uk" |
"2Jul2023" | ||
|
||
suffix: | ||
"_refseq_genomes.fa" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a smaller reference fasta file (of some small subset of viruses, e.g.), that could be used in the example? This should optimally get executed in the CI tests regularly, so shouldn't download much, if possible.
Also, we should still add an actual test run for this wrapper.
This PR was marked as stale because it has been open for 6 months with no activity. |
📝 WalkthroughWalkthroughSeveral new files were added under Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Snakemake
participant Wrapper (wrapper.py)
participant Remote Server
User->>Snakemake: Launch workflow (Snakefile)
Snakemake->>Wrapper (wrapper.py): Run shell command with params (prefix, date, suffix)
Wrapper (wrapper.py)->>Remote Server: curl {prefix}{date}{suffix}
Remote Server-->>Wrapper (wrapper.py): Send file data
Wrapper (wrapper.py)-->>Snakemake: File saved locally
Snakemake-->>User: Workflow completes, output available
sequenceDiagram
participant User
participant Snakemake
participant Old Wrapper (old_wrapper.py)
participant Ensembl FTP
User->>Snakemake: Launch workflow (old_snakefile.smk/old_release.smk)
Snakemake->>Old Wrapper (old_wrapper.py): Pass parameters (species, release, build, etc.)
Old Wrapper (old_wrapper.py)->>Ensembl FTP: Check for file existence (curl)
alt File exists
Ensembl FTP-->>Old Wrapper (old_wrapper.py): File found
Old Wrapper (old_wrapper.py)->>Ensembl FTP: Download and decompress file
Ensembl FTP-->>Old Wrapper (old_wrapper.py): Send file data
Old Wrapper (old_wrapper.py)-->>Snakemake: File saved locally
else File not found
Old Wrapper (old_wrapper.py)-->>Snakemake: Print error, exit with failure
end
Snakemake-->>User: Workflow completes or fails with error
Tip ⚡️ Faster reviews with caching
Enjoy the performance boost—your workflow just got faster. ✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
♻️ Duplicate comments (9)
bio/reference/inphared-db/meta.yaml (1)
2-2
: Fix trailing spaces and improve description format.The description has trailing spaces at the end of the line, which is flagged by the YAML linter. Also, consider using the Markdown link format as previously suggested.
-description: Download sequence file from the Inphared database (https://github.com/RyanCook94/inphared/blob/main/README.md), and store them in a single .fasta file. Please check the current database available at the above link and adjust the config file. +description: Download sequence file from the [inphared database](https://github.com/RyanCook94/inphared/blob/main/README.md), and store them in a single .fasta file. Please check the above link for available database version and adjust the config file.🧰 Tools
🪛 YAMLlint (1.35.1)
[error] 2-2: trailing spaces
(trailing-spaces)
bio/reference/inphared-db/wrapper.py (1)
8-9
:⚠️ Potential issueFix syntax errors in shell command implementation.
The current implementation has syntax errors and doesn't properly use the Snakemake API. The shell command should use Python syntax, not Snakemake rule syntax.
Based on previous feedback, implement these changes:
- Use the
shell()
function directly- Access parameters through the
snakemake
object- Use f-strings for interpolation
- Use the output file specified by the Snakemake rule
- Add separator between URL components for clarity
- shell: - "curl {params.prefix}{params.date}{params.suffix} -o {params.date}{params.suffix}" +shell(f"curl {snakemake.params.url}/{snakemake.params.date}{snakemake.params.suffix} -o {snakemake.output}")Note: This assumes that the config.yaml has been updated to use
url
instead ofprefix
(without trailing slash) as suggested in previous reviews.🧰 Tools
🪛 Ruff (0.8.2)
8-8: SyntaxError: Unexpected indentation
8-9: SyntaxError: Expected an expression
9-9: SyntaxError: Unexpected indentation
🪛 GitHub Actions: Code quality
[error] 8-8: Black formatting check failed: Cannot parse file at line 8. File could not be formatted.
bio/reference/inphared-db/test/config.yaml (2)
8-9
: Update prefix to url without trailing slash.As suggested in previous reviews, rename the prefix key to url and remove the trailing slash.
-prefix: - "https://millardlab-inphared.s3.climb.ac.uk/" +url: + "https://millardlab-inphared.s3.climb.ac.uk"
4-6
: Consider using a smaller reference file for testing.The current reference file (_refseq_genomes.fa) may be large, which could be problematic for CI testing. As suggested in a previous review, find a smaller subset of viruses to use as a test reference.
Is there a smaller test file available from the inphared database that could be used for CI testing purposes? This would improve test execution times and reduce resource usage.
bio/reference/inphared-db/test/Snakefile (2)
5-5
: Use arbitrary file names instead of direct config referencesOne of the best practices for these wrappers is to work with arbitrary file names. The
config[]
entries should not be used directly in the output path in this example Snakefile, but rather in thewrapper.py
file.Consider showing both approaches - one with a fixed filename and one with config values:
- output: - expand("{date}{suffix}", date=config["date"], suffix=config["suffix"]) + output: + "resources/inphared.fasta"Users can add Python code to incorporate config variables in their own workflow if needed.
7-7
: Rename parameter from 'prefix' to 'url' for clarityRenaming this parameter would make its purpose clearer to users of the wrapper.
- prefix = config["prefix"], + url = config["url"],Remember to update the corresponding entry in the config.yaml file as well.
bio/reference/inphared-db/test/old_release.smk (1)
1-14
: Remove this file if it's not needed for the inphared-db wrapperThis file appears to be unrelated to the inphared-db wrapper functionality and might be a copy-paste leftover as noted in previous review comments.
If this file is not needed for the inphared-db wrapper, it should be removed to prevent confusion.
bio/reference/inphared-db/test/old_snakefile.smk (1)
1-14
: Remove this file if it's not needed for the inphared-db wrapperThis file appears to be unrelated to the inphared-db wrapper functionality and might be a leftover file as noted in previous review comments.
If this file is not needed for the inphared-db wrapper, it should be removed to prevent confusion.
bio/reference/inphared-db/old_wrapper.py (1)
1-5
: Remove this file if it's not needed for the inphared-db wrapperThis file appears to be handling Ensembl downloads rather than inphared database downloads, which seems inconsistent with the PR title. Previous review comments suggested this file might be unnecessary.
If this file is not needed for the inphared-db wrapper, it should be removed to prevent confusion.
🧹 Nitpick comments (2)
bio/reference/inphared-db/old_wrapper.py (2)
8-8
: Remove unused importThe
itertools.product
import is never used in this script.-from itertools import product
🧰 Tools
🪛 Ruff (0.8.2)
8-8:
itertools.product
imported but unusedRemove unused import:
itertools.product
(F401)
47-52
: Simplify nested if statementsThe nested if statements can be combined for better readability.
-if chromosome: - if not datatype == "dna": - raise ValueError( - "invalid datatype, to select a single chromosome the datatype must be dna" - ) +if chromosome and datatype != "dna": + raise ValueError( + "invalid datatype, to select a single chromosome the datatype must be dna" + )🧰 Tools
🪛 Ruff (0.8.2)
47-48: Use a single
if
statement instead of nestedif
statementsCombine
if
statements usingand
(SIM102)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
bio/reference/inphared-db/environment.yaml
(1 hunks)bio/reference/inphared-db/meta.yaml
(1 hunks)bio/reference/inphared-db/old_wrapper.py
(1 hunks)bio/reference/inphared-db/test/Snakefile
(1 hunks)bio/reference/inphared-db/test/config.yaml
(1 hunks)bio/reference/inphared-db/test/old_release.smk
(1 hunks)bio/reference/inphared-db/test/old_snakefile.smk
(1 hunks)bio/reference/inphared-db/wrapper.py
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
`**/*.py`: Do not try to improve formatting. Do not suggest type annotations for functions that are defined inside of functions or methods. Do not suggest type annotation of the `s...
**/*.py
: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of theself
argument of methods.
Do not suggest type annotation of thecls
argument of classmethods.
Do not suggest return type annotation if a function or method does not contain areturn
statement.
bio/reference/inphared-db/wrapper.py
bio/reference/inphared-db/old_wrapper.py
`**/wrapper.py`: Do not complain about use of undefined variable called `snakemake`.
**/wrapper.py
: Do not complain about use of undefined variable calledsnakemake
.
bio/reference/inphared-db/wrapper.py
🪛 YAMLlint (1.35.1)
bio/reference/inphared-db/meta.yaml
[error] 2-2: trailing spaces
(trailing-spaces)
🪛 Ruff (0.8.2)
bio/reference/inphared-db/wrapper.py
8-8: SyntaxError: Unexpected indentation
8-9: SyntaxError: Expected an expression
9-9: SyntaxError: Unexpected indentation
bio/reference/inphared-db/old_wrapper.py
8-8: itertools.product
imported but unused
Remove unused import: itertools.product
(F401)
11-11: Undefined name snakemake
(F821)
12-12: Undefined name snakemake
(F821)
13-13: Undefined name snakemake
(F821)
19-19: Undefined name snakemake
(F821)
20-20: Undefined name snakemake
(F821)
22-22: Undefined name snakemake
(F821)
29-29: Undefined name snakemake
(F821)
30-30: Undefined name snakemake
(F821)
47-48: Use a single if
statement instead of nested if
statements
Combine if
statements using and
(SIM102)
🪛 GitHub Actions: Code quality
bio/reference/inphared-db/wrapper.py
[error] 8-8: Black formatting check failed: Cannot parse file at line 8. File could not be formatted.
bio/reference/inphared-db/test/Snakefile
[error] 1-1: WorkflowError: configfile 'config.yaml' is defined but not present or accessible at /home/runner/work/snakemake-wrappers/snakemake-wrappers/config.yaml.
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: testing
🔇 Additional comments (2)
bio/reference/inphared-db/environment.yaml (1)
1-5
: Environment configuration looks appropriate.The environment configuration correctly specifies conda-forge with nodefaults and includes the curl dependency, which is necessary for downloading from remote URLs.
bio/reference/inphared-db/test/config.yaml (1)
1-9
: Verify if this configuration file is necessary.One of the PR comments questioned if this configuration file is truly necessary. Consider whether these parameters could be directly specified in the Snakefile rules or if this level of configuration provides valuable flexibility to users.
Could you clarify the purpose of having a separate config file rather than hardcoding these parameters in the wrapper or allowing them to be passed directly as rule parameters?
@@ -0,0 +1,12 @@ | |||
configfile: "config.yaml" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the pipeline failure by ensuring config.yaml exists
The pipeline is failing because the config.yaml file is missing or inaccessible. Make sure the config file exists in the same directory as this Snakefile and has the correct permissions.
#!/bin/bash
# Check if config.yaml exists in the same directory as this Snakefile
if [ -f "bio/reference/inphared-db/test/config.yaml" ]; then
echo "config.yaml exists"
else
echo "config.yaml does not exist in the expected location"
fi
🧰 Tools
🪛 GitHub Actions: Code quality
[error] 1-1: WorkflowError: configfile 'config.yaml' is defined but not present or accessible at /home/runner/work/snakemake-wrappers/snakemake-wrappers/config.yaml.
release="75", | ||
chromosome="I", | ||
log: | ||
"logs/get_genome.log", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Use separate log files for different rules
Both get_genome
and get_chromosome
rules use the same log file, which could cause conflicts if both rules run concurrently.
- log:
- "logs/get_genome.log",
+ log:
+ "logs/get_chromosome.log",
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
"logs/get_genome.log", | |
log: | |
"logs/get_chromosome.log", |
release="98", | ||
log: | ||
"logs/get_genome.log", | ||
cache: "mit-software" # save space and time with between workflow caching (see docs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix typo in cache parameter
There appears to be a typo in the cache parameter. This rule uses "mit-software" while the other rule uses "omit-software".
- cache: "mit-software" # save space and time with between workflow caching (see docs)
+ cache: "omit-software" # save space and time with between workflow caching (see docs)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
cache: "mit-software" # save space and time with between workflow caching (see docs) | |
cache: "omit-software" # save space and time with between workflow caching (see docs) |
chromosome="I", # optional: restrict to chromosome | ||
# branch="plants", # optional: specify branch | ||
log: | ||
"logs/get_genome.log", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Use separate log files for different rules
Both get_genome
and get_chromosome
rules use the same log file, which could cause conflicts if both rules run concurrently.
- log:
- "logs/get_genome.log",
+ log:
+ "logs/get_chromosome.log",
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
"logs/get_genome.log", | |
log: | |
"logs/get_chromosome.log", |
Description
Adding a new wrapper to download the current inphared database.
QC
For all wrappers added by this PR,
input:
andoutput:
file paths in the resulting rule can be changed arbitrarily,threads: x
statement withx
being a reasonable default,map_reads
for a step that maps reads),environment.yaml
specifications follow the respective best practices,input:
oroutput:
),Snakefile
s and their entries are explained via comments (input:
/output:
/params:
etc.),stderr
and/orstdout
are logged correctly (log:
), depending on the wrapped tool,tempfile.gettempdir()
points to (see here; this also means that using any Pythontempfile
default behavior works),meta.yaml
contains a link to the documentation of the respective tool or command,Snakefile
s pass the linting (snakemake --lint
),Snakefile
s are formatted with snakefmt,Summary by CodeRabbit