-
Notifications
You must be signed in to change notification settings - Fork 3
feat: Add VEP annotation with REVEL, Sift and PolyPhen Scores to fp… #154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughAdds VEP+REVEL annotation rules and helper functions, updates FP/FN VCF output paths to use annotated_vcf, adds/updates Conda envs for curl and htslib, and inserts a pre-test "Free Disk Space" CI step. (33 words) Changes
Sequence DiagramsequenceDiagram
participant FP_FN as FP/FN VCFs
participant Cache as VEP Cache/Plugins
participant Revel as REVEL data
participant VEP as VEP Annotator
participant Output as Annotated VCF + stats
rect rgba(230,240,255,0.5)
Cache->>Cache: get_vep_cache
Cache->>Cache: get_vep_plugins
Revel->>Revel: download_revel
Revel->>Revel: process_revel_scores (GRCh37/GRCh38)
Revel->>Revel: tabix_revel_scores
end
rect rgba(230,255,230,0.5)
FP_FN->>VEP: annotate_shared_fn / annotate_unique_fp_fn (input VCFs)
Cache->>VEP: provide VEP cache & plugins
Revel->>VEP: provide REVEL TSV + index
VEP->>Output: produce `.annotated.vcf.gz` + `.html` stats
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (2)
workflow/rules/common.smk (1)
744-748: Type inconsistency: returns string vs empty list.
get_plugin_auxreturns a string whenplugin == "REVEL"but an empty list[]otherwise. In Snakemake, this mixed return type can work for inputs, but consider returning an empty string or consistently using lists for clarity.🔎 Proposed fix for consistent return type
def get_plugin_aux(plugin, index=False): if plugin == "REVEL": suffix = ".tbi" if index else "" return "resources/revel_scores.tsv.gz{suffix}".format(suffix=suffix) - return [] + return ""workflow/envs/htslib.yaml (1)
1-6: Consider updating htslib to version 1.22.1.htslib 1.12 was released in March 2021. The latest available version in bioconda is 1.22.1, which includes bug fixes and performance improvements. Updating is straightforward and poses minimal compatibility risk for tabix/bgzip operations.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
workflow/Snakefile(2 hunks)workflow/envs/curl.yaml(1 hunks)workflow/envs/htslib.yaml(1 hunks)workflow/rules/annotation.smk(1 hunks)workflow/rules/common.smk(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-11T08:47:27.118Z
Learnt from: BiancaStoecker
Repo: snakemake-workflows/dna-seq-benchmark PR: 149
File: workflow/Snakefile:36-40
Timestamp: 2025-11-11T08:47:27.118Z
Learning: In the dna-seq-benchmark workflow, VCF files are sorted by a generic sort_vcf rule in workflow/rules/utils.smk that transforms {prefix}.vcf.gz to {prefix}.sorted.vcf.gz, so rules that produce VCF outputs don't include .sorted in their filenames - the sorting is handled as a separate downstream step.
Applied to files:
workflow/Snakefileworkflow/rules/annotation.smk
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Testing
🔇 Additional comments (4)
workflow/rules/common.smk (1)
738-741: LGTM!The tabix parameter logic correctly maps the reference genome to the appropriate column for REVEL score indexing.
workflow/Snakefile (2)
18-18: LGTM!The annotation rules are correctly included within the
variant-callsconfig block, making them conditionally available only when variant calls are configured.
33-41: LGTM!The output paths are correctly updated to reference the new annotated VCF outputs, which aligns with the annotation pipeline that takes sorted VCFs as input and produces annotated VCFs.
workflow/rules/annotation.smk (1)
76-99: Clarify theaccess.random()version requirement and confirm intentional parameter difference.The
access.random()function is available in Snakemake 7.17.1+ (the current min_version requirement), not exclusively an 8.x feature. The two annotation rules intentionally differ in their extra parameters:annotate_shared_fnincludes--sift b --polyphen bwhileannotate_unique_fp_fnomits these flags. Verify this difference aligns with your analysis requirements.
| rule process_revel_scores: | ||
| input: | ||
| "resources/revel_scores.zip", | ||
| output: | ||
| "resources/revel_scores.tsv.gz", | ||
| params: | ||
| build=get_reference_genome_build(), | ||
| log: | ||
| "logs/vep_plugins/process_revel_scores.log", | ||
| conda: | ||
| "../envs/htslib.yaml" | ||
| shell: | ||
| """ | ||
| tmpfile=$(mktemp {resources.tmpdir}/revel_scores.XXXXXX) | ||
| unzip -p {input} | tr "," "\t" | sed '1s/.*/#&/' | bgzip -c > $tmpfile | ||
| if [ "{params.build}" == "GRCh38" ] ; then | ||
| zgrep -h -v ^#chr $tmpfile | awk '$3 != "." ' | sort -k1,1 -k3,3n - | cat <(zcat $tmpfile | head -n1) - | bgzip -c > {output} | ||
| elif [ "{params.build}" == "GRCh37" ] ; then | ||
| cat $tmpfile > {output} | ||
| else | ||
| echo "Annotation of REVEL scores only supported for GRCh37 or GRCh38" > {log} | ||
| exit 125 | ||
| fi | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing resources declaration for tmpdir and temp file cleanup.
The shell script references {resources.tmpdir} but no resources block is declared. This will cause a runtime error. Also, the temporary file $tmpfile is not removed after use.
🔎 Proposed fix
rule process_revel_scores:
input:
"resources/revel_scores.zip",
output:
"resources/revel_scores.tsv.gz",
params:
build=get_reference_genome_build(),
log:
"logs/vep_plugins/process_revel_scores.log",
+ resources:
+ tmpdir=temp("tmpdir"),
conda:
"../envs/htslib.yaml"
shell:
"""
- tmpfile=$(mktemp {resources.tmpdir}/revel_scores.XXXXXX)
+ tmpfile=$(mktemp "${{TMPDIR:-/tmp}}"/revel_scores.XXXXXX)
+ trap "rm -f $tmpfile" EXIT
unzip -p {input} | tr "," "\t" | sed '1s/.*/#&/' | bgzip -c > $tmpfile
if [ "{params.build}" == "GRCh38" ] ; then
zgrep -h -v ^#chr $tmpfile | awk '$3 != "." ' | sort -k1,1 -k3,3n - | cat <(zcat $tmpfile | head -n1) - | bgzip -c > {output}
elif [ "{params.build}" == "GRCh37" ] ; then
cat $tmpfile > {output}
else
echo "Annotation of REVEL scores only supported for GRCh37 or GRCh38" > {log}
exit 125
fi
"""📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| rule process_revel_scores: | |
| input: | |
| "resources/revel_scores.zip", | |
| output: | |
| "resources/revel_scores.tsv.gz", | |
| params: | |
| build=get_reference_genome_build(), | |
| log: | |
| "logs/vep_plugins/process_revel_scores.log", | |
| conda: | |
| "../envs/htslib.yaml" | |
| shell: | |
| """ | |
| tmpfile=$(mktemp {resources.tmpdir}/revel_scores.XXXXXX) | |
| unzip -p {input} | tr "," "\t" | sed '1s/.*/#&/' | bgzip -c > $tmpfile | |
| if [ "{params.build}" == "GRCh38" ] ; then | |
| zgrep -h -v ^#chr $tmpfile | awk '$3 != "." ' | sort -k1,1 -k3,3n - | cat <(zcat $tmpfile | head -n1) - | bgzip -c > {output} | |
| elif [ "{params.build}" == "GRCh37" ] ; then | |
| cat $tmpfile > {output} | |
| else | |
| echo "Annotation of REVEL scores only supported for GRCh37 or GRCh38" > {log} | |
| exit 125 | |
| fi | |
| """ | |
| rule process_revel_scores: | |
| input: | |
| "resources/revel_scores.zip", | |
| output: | |
| "resources/revel_scores.tsv.gz", | |
| params: | |
| build=get_reference_genome_build(), | |
| log: | |
| "logs/vep_plugins/process_revel_scores.log", | |
| resources: | |
| tmpdir=temp("tmpdir"), | |
| conda: | |
| "../envs/htslib.yaml" | |
| shell: | |
| """ | |
| tmpfile=$(mktemp "${{TMPDIR:-/tmp}}"/revel_scores.XXXXXX) | |
| trap "rm -f $tmpfile" EXIT | |
| unzip -p {input} | tr "," "\t" | sed '1s/.*/#&/' | bgzip -c > $tmpfile | |
| if [ "{params.build}" == "GRCh38" ] ; then | |
| zgrep -h -v ^#chr $tmpfile | awk '$3 != "." ' | sort -k1,1 -k3,3n - | cat <(zcat $tmpfile | head -n1) - | bgzip -c > {output} | |
| elif [ "{params.build}" == "GRCh37" ] ; then | |
| cat $tmpfile > {output} | |
| else | |
| echo "Annotation of REVEL scores only supported for GRCh37 or GRCh38" > {log} | |
| exit 125 | |
| fi | |
| """ |
🤖 Prompt for AI Agents
In workflow/rules/annotation.smk around lines 37 to 60, the rule
process_revel_scores uses {resources.tmpdir} but no resources block is declared
and the temporary file $tmpfile is never cleaned up; add a resources:
tmpdir=<appropriate_tmpdir_path> declaration to the rule (or use resources:
tmpdir=configured_tmpdir variable) so Snakemake provides the directory, and
modify the shell block to create the temp file inside that tmpdir and remove it
on exit (use a trap to rm -f $tmpfile on EXIT or ensure rm -f $tmpfile at the
end of the script) so the temp file is always cleaned up even on errors.
…s/dna-seq-benchmark into feat/add-revel-scores
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
workflow/rules/annotation.smk (2)
26-34: Add-Land--failflags to curl command.The curl command should include
-Lto follow redirects and--failto exit with error status on HTTP failures, ensuring robust downloads from Zenodo.🔎 Proposed fix
shell: - "curl https://zenodo.org/records/7072866/files/revel-v1.3_all_chromosomes.zip -o {output} &> {log}" + "curl -L --fail https://zenodo.org/records/7072866/files/revel-v1.3_all_chromosomes.zip -o {output} &> {log}"
37-60: Missingresourcesblock causes runtime error.The shell script references
{resources.tmpdir}on line 50, but noresourcesblock is declared in the rule. This will cause a Snakemake runtime error. Additionally, the temporary file is not cleaned up on exit.🔎 Proposed fix
rule process_revel_scores: input: "resources/revel_scores.zip", output: "resources/revel_scores.tsv.gz", params: build=get_reference_genome_build(), log: "logs/vep_plugins/process_revel_scores.log", + resources: + tmpdir=config.get("tmpdir", "/tmp"), conda: "../envs/htslib.yaml" shell: """ - tmpfile=$(mktemp {resources.tmpdir}/revel_scores.XXXXXX) + tmpfile=$(mktemp -p {resources.tmpdir} revel_scores.XXXXXX) + trap 'rm -f "$tmpfile"' EXIT unzip -p {input} | tr "," "\t" | sed '1s/.*/#&/' | bgzip -c > $tmpfile if [ "{params.build}" == "GRCh38" ] ; then zgrep -h -v ^#chr $tmpfile | awk '$3 != "." ' | sort -k1,1 -k3,3n - | cat <(zcat $tmpfile | head -n1) - | bgzip -c > {output} elif [ "{params.build}" == "GRCh37" ] ; then cat $tmpfile > {output} else echo "Annotation of REVEL scores only supported for GRCh37 or GRCh38" > {log} exit 125 fi """
🧹 Nitpick comments (1)
workflow/rules/annotation.smk (1)
102-125: LGTM! SIFT and PolyPhen flags are correctly included.The rule structure is correct and now includes the
--sift b --polyphen bflags in theextraparameter (line 118) as intended per the PR objectives.Note: There's a minor trailing space after
"...polyphen b "at line 118 that can be trimmed (optional formatting nitpick).
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
workflow/envs/curl.yaml(1 hunks)workflow/rules/annotation.smk(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- workflow/envs/curl.yaml
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-11T08:47:27.118Z
Learnt from: BiancaStoecker
Repo: snakemake-workflows/dna-seq-benchmark PR: 149
File: workflow/Snakefile:36-40
Timestamp: 2025-11-11T08:47:27.118Z
Learning: In the dna-seq-benchmark workflow, VCF files are sorted by a generic sort_vcf rule in workflow/rules/utils.smk that transforms {prefix}.vcf.gz to {prefix}.sorted.vcf.gz, so rules that produce VCF outputs don't include .sorted in their filenames - the sorting is handled as a separate downstream step.
Applied to files:
workflow/rules/annotation.smk
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Testing
🔇 Additional comments (5)
workflow/rules/annotation.smk (5)
15-23: LGTM!Standard VEP plugins setup with appropriate wrapper usage.
12-12: No action needed. The snakemake-wrappers version v8.0.2 used throughout the workflow is valid and current.
76-99: No issues found. The helper functionget_plugin_aux()is correctly implemented inworkflow/rules/common.smk, the lambda function calls with arguments are proper, andaccess.random()is the correct Snakemake API for this resource access pattern. The trailing space in theextraparameter at line 92 can optionally be trimmed for consistency.
63-73: Rule structure and implementation are correct.The
get_tabix_revel_params()function inworkflow/rules/common.smk(lines 738-741) correctly returns build-appropriate tabix parameters for REVEL score indexing. It selects the correct column (2 for GRCh37, 3 otherwise) and uses appropriate tabix flags (-f -s 1 -b {column} -e {column}) for indexing the TSV file across different reference genomes.
1-12: Helper function is properly implemented and returns correct build strings.The
get_reference_genome_build()function inworkflow/rules/common.smkis correctly implemented. Wrapper version v8.0.2 exists and is available in the snakemake-wrappers repository. The function validates the configuration and returns the expected values:
- "GRCh37" for grch37 configuration
- "GRCh38" for grch38 configuration
The rule structure correctly passes this value to the VEP cache wrapper as the build parameter.
famosab
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments / questions :)
|
We still get this error: |
Co-authored-by: Famke Bäuerle <[email protected]>
Co-authored-by: Famke Bäuerle <[email protected]>
Co-authored-by: Famke Bäuerle <[email protected]>
Co-authored-by: Famke Bäuerle <[email protected]>
Co-authored-by: Famke Bäuerle <[email protected]>
Co-authored-by: Famke Bäuerle <[email protected]>
Co-authored-by: Famke Bäuerle <[email protected]>
Added a step to free disk space on Ubuntu before testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @.github/workflows/main.yml:
- Around line 45-59: The Free Disk Space step "Free Disk Space (Ubuntu)" has
inconsistent indentation under the with: block and may reference a non-existent
tag; fix by aligning the with: children (tool-cache, android, dotnet, haskell,
large-packages, swap-storage, docker-images) to use the same 8-space indentation
as other workflow steps, and verify that jlumbroso/[email protected] is a
valid released tag — if not, change the action reference to a stable branch like
`@main` or a valid release tag.
.github/workflows/main.yml
Outdated
| - name: Free Disk Space (Ubuntu) | ||
| uses: jlumbroso/[email protected] | ||
| with: | ||
| # this might remove tools that are actually needed, | ||
| # if set to "true" but frees about 6 GB | ||
| tool-cache: false | ||
|
|
||
| # all of these default to true, but feel free to set to | ||
| # "false" if necessary for your workflow | ||
| android: true | ||
| dotnet: true | ||
| haskell: true | ||
| large-packages: true | ||
| swap-storage: true | ||
| docker-images: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
cat -n .github/workflows/main.yml | sed -n '45,70p'Repository: snakemake-workflows/dna-seq-benchmark
Length of output: 1105
🌐 Web query:
jlumbroso/free-disk-space action latest version GitHub
💡 Result:
The repository has no released tags — use the main branch (jlumbroso/free-disk-space@main) as the latest. [1]
Sources:
[1] jlumbroso/free-disk-space (GitHub repository).
Good addition for accommodating VEP/REVEL resources.
This step appropriately frees disk space before running tests, which is necessary given the new VEP annotation features that download large caches. The configuration choices are sensible—keeping tool-cache and docker-images intact prevents breaking the workflow.
Two issues to address:
- Indentation under
with:uses 10 spaces (lines 48–59), but the rest of the file uses 8 spaces for similar blocks (lines 64–66). Align for consistency. - The action repository (
jlumbroso/free-disk-space) has no released tags. Verify thatv1.3.1is a valid version, or consider using@mainas the recommended alternative.
🤖 Prompt for AI Agents
In @.github/workflows/main.yml around lines 45 - 59, The Free Disk Space step
"Free Disk Space (Ubuntu)" has inconsistent indentation under the with: block
and may reference a non-existent tag; fix by aligning the with: children
(tool-cache, android, dotnet, haskell, large-packages, swap-storage,
docker-images) to use the same 8-space indentation as other workflow steps, and
verify that jlumbroso/[email protected] is a valid released tag — if not,
change the action reference to a stable branch like `@main` or a valid release
tag.
…/fn vcfs.
Summary by CodeRabbit
New Features
Chores
Tests
✏️ Tip: You can customize this high-level summary in your review settings.