Skip to content

Conversation

@BiancaStoecker
Copy link
Collaborator

@BiancaStoecker BiancaStoecker commented Dec 19, 2025

…/fn vcfs.

Summary by CodeRabbit

  • New Features

    • Added VEP+REVEL annotation to the pipeline, producing annotated FP/FN VCFs and HTML stats for shared and unique callsets; output paths now reflect annotated VCFs.
    • Added helpers to support REVEL tabix indexing and plugin-aware resource resolution.
  • Chores

    • Updated workflow environments (curl upgraded; new htslib/unzip environment) and added tabix indexing support for REVEL scores.
  • Tests

    • Added a pre-test free-disk-space check in the CI testing job.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 19, 2025

📝 Walkthrough

Walkthrough

Adds VEP+REVEL annotation rules and helper functions, updates FP/FN VCF output paths to use annotated_vcf, adds/updates Conda envs for curl and htslib, and inserts a pre-test "Free Disk Space" CI step. (33 words)

Changes

Cohort / File(s) Summary
Workflow entry
workflow/Snakefile
Includes rules/annotation.smk and updates FP/FN outputs from results/fp-fn/vcf/.../*.vcf.gz to results/fp-fn/annotated_vcf/.../*.annotated.vcf.gz.
Annotation rules
workflow/rules/annotation.smk
New rules: get_vep_cache, get_vep_plugins, download_revel, process_revel_scores, tabix_revel_scores, annotate_shared_fn, annotate_unique_fp_fn — VEP + REVEL integration, per-build branching, tabix indexing, logs, and wrappers.
Shared helpers
workflow/rules/common.smk
Added get_tabix_revel_params() (selects tabix column by config["reference-genome"]) and get_plugin_aux(plugin, index=False) (resolve REVEL resource path / index).
Conda environments
workflow/envs/curl.yaml, workflow/envs/htslib.yaml
curl.yaml updated to curl=8.17.0; new htslib.yaml added with htslib=1.12, unzip=6.0, channels conda-forge and bioconda.
CI workflow
.github/workflows/main.yml
Adds a "Free Disk Space (Ubuntu)" action step to the Testing job before tests run.

Sequence Diagram

sequenceDiagram
    participant FP_FN as FP/FN VCFs
    participant Cache as VEP Cache/Plugins
    participant Revel as REVEL data
    participant VEP as VEP Annotator
    participant Output as Annotated VCF + stats

    rect rgba(230,240,255,0.5)
    Cache->>Cache: get_vep_cache
    Cache->>Cache: get_vep_plugins
    Revel->>Revel: download_revel
    Revel->>Revel: process_revel_scores (GRCh37/GRCh38)
    Revel->>Revel: tabix_revel_scores
    end

    rect rgba(230,255,230,0.5)
    FP_FN->>VEP: annotate_shared_fn / annotate_unique_fp_fn (input VCFs)
    Cache->>VEP: provide VEP cache & plugins
    Revel->>VEP: provide REVEL TSV + index
    VEP->>Output: produce `.annotated.vcf.gz` + `.html` stats
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • famosab
  • johanneskoester

Poem

🐰 I hopped through VCFs at dawn's first light,
fetched caches, zipped scores, and set them right.
REVEL whispered secrets into VEP's ear,
annotated files now bounce with cheer. ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding VEP annotation with REVEL, SIFT, and PolyPhen scores to false positive/false negative VCFs, which is clearly reflected in the Snakefile changes and new annotation rules.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
workflow/rules/common.smk (1)

744-748: Type inconsistency: returns string vs empty list.

get_plugin_aux returns a string when plugin == "REVEL" but an empty list [] otherwise. In Snakemake, this mixed return type can work for inputs, but consider returning an empty string or consistently using lists for clarity.

🔎 Proposed fix for consistent return type
 def get_plugin_aux(plugin, index=False):
     if plugin == "REVEL":
         suffix = ".tbi" if index else ""
         return "resources/revel_scores.tsv.gz{suffix}".format(suffix=suffix)
-    return []
+    return ""
workflow/envs/htslib.yaml (1)

1-6: Consider updating htslib to version 1.22.1.

htslib 1.12 was released in March 2021. The latest available version in bioconda is 1.22.1, which includes bug fixes and performance improvements. Updating is straightforward and poses minimal compatibility risk for tabix/bgzip operations.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d820558 and 30894c7.

📒 Files selected for processing (5)
  • workflow/Snakefile (2 hunks)
  • workflow/envs/curl.yaml (1 hunks)
  • workflow/envs/htslib.yaml (1 hunks)
  • workflow/rules/annotation.smk (1 hunks)
  • workflow/rules/common.smk (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-11T08:47:27.118Z
Learnt from: BiancaStoecker
Repo: snakemake-workflows/dna-seq-benchmark PR: 149
File: workflow/Snakefile:36-40
Timestamp: 2025-11-11T08:47:27.118Z
Learning: In the dna-seq-benchmark workflow, VCF files are sorted by a generic sort_vcf rule in workflow/rules/utils.smk that transforms {prefix}.vcf.gz to {prefix}.sorted.vcf.gz, so rules that produce VCF outputs don't include .sorted in their filenames - the sorting is handled as a separate downstream step.

Applied to files:

  • workflow/Snakefile
  • workflow/rules/annotation.smk
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Testing
🔇 Additional comments (4)
workflow/rules/common.smk (1)

738-741: LGTM!

The tabix parameter logic correctly maps the reference genome to the appropriate column for REVEL score indexing.

workflow/Snakefile (2)

18-18: LGTM!

The annotation rules are correctly included within the variant-calls config block, making them conditionally available only when variant calls are configured.


33-41: LGTM!

The output paths are correctly updated to reference the new annotated VCF outputs, which aligns with the annotation pipeline that takes sorted VCFs as input and produces annotated VCFs.

workflow/rules/annotation.smk (1)

76-99: Clarify the access.random() version requirement and confirm intentional parameter difference.

The access.random() function is available in Snakemake 7.17.1+ (the current min_version requirement), not exclusively an 8.x feature. The two annotation rules intentionally differ in their extra parameters: annotate_shared_fn includes --sift b --polyphen b while annotate_unique_fp_fn omits these flags. Verify this difference aligns with your analysis requirements.

Comment on lines 37 to 60
rule process_revel_scores:
input:
"resources/revel_scores.zip",
output:
"resources/revel_scores.tsv.gz",
params:
build=get_reference_genome_build(),
log:
"logs/vep_plugins/process_revel_scores.log",
conda:
"../envs/htslib.yaml"
shell:
"""
tmpfile=$(mktemp {resources.tmpdir}/revel_scores.XXXXXX)
unzip -p {input} | tr "," "\t" | sed '1s/.*/#&/' | bgzip -c > $tmpfile
if [ "{params.build}" == "GRCh38" ] ; then
zgrep -h -v ^#chr $tmpfile | awk '$3 != "." ' | sort -k1,1 -k3,3n - | cat <(zcat $tmpfile | head -n1) - | bgzip -c > {output}
elif [ "{params.build}" == "GRCh37" ] ; then
cat $tmpfile > {output}
else
echo "Annotation of REVEL scores only supported for GRCh37 or GRCh38" > {log}
exit 125
fi
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Missing resources declaration for tmpdir and temp file cleanup.

The shell script references {resources.tmpdir} but no resources block is declared. This will cause a runtime error. Also, the temporary file $tmpfile is not removed after use.

🔎 Proposed fix
 rule process_revel_scores:
     input:
         "resources/revel_scores.zip",
     output:
         "resources/revel_scores.tsv.gz",
     params:
         build=get_reference_genome_build(),
     log:
         "logs/vep_plugins/process_revel_scores.log",
+    resources:
+        tmpdir=temp("tmpdir"),
     conda:
         "../envs/htslib.yaml"
     shell:
         """
-        tmpfile=$(mktemp {resources.tmpdir}/revel_scores.XXXXXX)
+        tmpfile=$(mktemp "${{TMPDIR:-/tmp}}"/revel_scores.XXXXXX)
+        trap "rm -f $tmpfile" EXIT
         unzip -p {input} | tr "," "\t" | sed '1s/.*/#&/' | bgzip -c > $tmpfile
         if [ "{params.build}" == "GRCh38" ] ; then
             zgrep -h -v ^#chr $tmpfile | awk '$3 != "." ' | sort -k1,1 -k3,3n - | cat <(zcat $tmpfile | head -n1) - | bgzip -c > {output}
         elif [ "{params.build}" == "GRCh37" ] ; then
             cat $tmpfile > {output}
         else
             echo "Annotation of REVEL scores only supported for GRCh37 or GRCh38" > {log}
             exit 125
         fi
         """
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
rule process_revel_scores:
input:
"resources/revel_scores.zip",
output:
"resources/revel_scores.tsv.gz",
params:
build=get_reference_genome_build(),
log:
"logs/vep_plugins/process_revel_scores.log",
conda:
"../envs/htslib.yaml"
shell:
"""
tmpfile=$(mktemp {resources.tmpdir}/revel_scores.XXXXXX)
unzip -p {input} | tr "," "\t" | sed '1s/.*/#&/' | bgzip -c > $tmpfile
if [ "{params.build}" == "GRCh38" ] ; then
zgrep -h -v ^#chr $tmpfile | awk '$3 != "." ' | sort -k1,1 -k3,3n - | cat <(zcat $tmpfile | head -n1) - | bgzip -c > {output}
elif [ "{params.build}" == "GRCh37" ] ; then
cat $tmpfile > {output}
else
echo "Annotation of REVEL scores only supported for GRCh37 or GRCh38" > {log}
exit 125
fi
"""
rule process_revel_scores:
input:
"resources/revel_scores.zip",
output:
"resources/revel_scores.tsv.gz",
params:
build=get_reference_genome_build(),
log:
"logs/vep_plugins/process_revel_scores.log",
resources:
tmpdir=temp("tmpdir"),
conda:
"../envs/htslib.yaml"
shell:
"""
tmpfile=$(mktemp "${{TMPDIR:-/tmp}}"/revel_scores.XXXXXX)
trap "rm -f $tmpfile" EXIT
unzip -p {input} | tr "," "\t" | sed '1s/.*/#&/' | bgzip -c > $tmpfile
if [ "{params.build}" == "GRCh38" ] ; then
zgrep -h -v ^#chr $tmpfile | awk '$3 != "." ' | sort -k1,1 -k3,3n - | cat <(zcat $tmpfile | head -n1) - | bgzip -c > {output}
elif [ "{params.build}" == "GRCh37" ] ; then
cat $tmpfile > {output}
else
echo "Annotation of REVEL scores only supported for GRCh37 or GRCh38" > {log}
exit 125
fi
"""
🤖 Prompt for AI Agents
In workflow/rules/annotation.smk around lines 37 to 60, the rule
process_revel_scores uses {resources.tmpdir} but no resources block is declared
and the temporary file $tmpfile is never cleaned up; add a resources:
tmpdir=<appropriate_tmpdir_path> declaration to the rule (or use resources:
tmpdir=configured_tmpdir variable) so Snakemake provides the directory, and
modify the shell block to create the temp file inside that tmpdir and remove it
on exit (use a trap to rm -f $tmpfile on EXIT or ensure rm -f $tmpfile at the
end of the script) so the temp file is always cleaned up even on errors.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
workflow/rules/annotation.smk (2)

26-34: Add -L and --fail flags to curl command.

The curl command should include -L to follow redirects and --fail to exit with error status on HTTP failures, ensuring robust downloads from Zenodo.

🔎 Proposed fix
     shell:
-        "curl https://zenodo.org/records/7072866/files/revel-v1.3_all_chromosomes.zip -o {output} &> {log}"
+        "curl -L --fail https://zenodo.org/records/7072866/files/revel-v1.3_all_chromosomes.zip -o {output} &> {log}"

37-60: Missing resources block causes runtime error.

The shell script references {resources.tmpdir} on line 50, but no resources block is declared in the rule. This will cause a Snakemake runtime error. Additionally, the temporary file is not cleaned up on exit.

🔎 Proposed fix
 rule process_revel_scores:
     input:
         "resources/revel_scores.zip",
     output:
         "resources/revel_scores.tsv.gz",
     params:
         build=get_reference_genome_build(),
     log:
         "logs/vep_plugins/process_revel_scores.log",
+    resources:
+        tmpdir=config.get("tmpdir", "/tmp"),
     conda:
         "../envs/htslib.yaml"
     shell:
         """
-        tmpfile=$(mktemp {resources.tmpdir}/revel_scores.XXXXXX)
+        tmpfile=$(mktemp -p {resources.tmpdir} revel_scores.XXXXXX)
+        trap 'rm -f "$tmpfile"' EXIT
         unzip -p {input} | tr "," "\t" | sed '1s/.*/#&/' | bgzip -c > $tmpfile
         if [ "{params.build}" == "GRCh38" ] ; then
             zgrep -h -v ^#chr $tmpfile | awk '$3 != "." ' | sort -k1,1 -k3,3n - | cat <(zcat $tmpfile | head -n1) - | bgzip -c > {output}
         elif [ "{params.build}" == "GRCh37" ] ; then
             cat $tmpfile > {output}
         else
             echo "Annotation of REVEL scores only supported for GRCh37 or GRCh38" > {log}
             exit 125
         fi
         """
🧹 Nitpick comments (1)
workflow/rules/annotation.smk (1)

102-125: LGTM! SIFT and PolyPhen flags are correctly included.

The rule structure is correct and now includes the --sift b --polyphen b flags in the extra parameter (line 118) as intended per the PR objectives.

Note: There's a minor trailing space after "...polyphen b " at line 118 that can be trimmed (optional formatting nitpick).

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 30894c7 and ba99965.

📒 Files selected for processing (2)
  • workflow/envs/curl.yaml (1 hunks)
  • workflow/rules/annotation.smk (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • workflow/envs/curl.yaml
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-11T08:47:27.118Z
Learnt from: BiancaStoecker
Repo: snakemake-workflows/dna-seq-benchmark PR: 149
File: workflow/Snakefile:36-40
Timestamp: 2025-11-11T08:47:27.118Z
Learning: In the dna-seq-benchmark workflow, VCF files are sorted by a generic sort_vcf rule in workflow/rules/utils.smk that transforms {prefix}.vcf.gz to {prefix}.sorted.vcf.gz, so rules that produce VCF outputs don't include .sorted in their filenames - the sorting is handled as a separate downstream step.

Applied to files:

  • workflow/rules/annotation.smk
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Testing
🔇 Additional comments (5)
workflow/rules/annotation.smk (5)

15-23: LGTM!

Standard VEP plugins setup with appropriate wrapper usage.


12-12: No action needed. The snakemake-wrappers version v8.0.2 used throughout the workflow is valid and current.


76-99: No issues found. The helper function get_plugin_aux() is correctly implemented in workflow/rules/common.smk, the lambda function calls with arguments are proper, and access.random() is the correct Snakemake API for this resource access pattern. The trailing space in the extra parameter at line 92 can optionally be trimmed for consistency.


63-73: Rule structure and implementation are correct.

The get_tabix_revel_params() function in workflow/rules/common.smk (lines 738-741) correctly returns build-appropriate tabix parameters for REVEL score indexing. It selects the correct column (2 for GRCh37, 3 otherwise) and uses appropriate tabix flags (-f -s 1 -b {column} -e {column}) for indexing the TSV file across different reference genomes.


1-12: Helper function is properly implemented and returns correct build strings.

The get_reference_genome_build() function in workflow/rules/common.smk is correctly implemented. Wrapper version v8.0.2 exists and is available in the snakemake-wrappers repository. The function validates the configuration and returns the expected values:

  • "GRCh37" for grch37 configuration
  • "GRCh38" for grch38 configuration

The rule structure correctly passes this value to the VEP cache wrapper as the build parameter.

Copy link
Collaborator

@famosab famosab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments / questions :)

@famosab
Copy link
Collaborator

famosab commented Jan 16, 2026

We still get this error:

[E::easy_errno] Libcurl reported error 78 (Remote file not found)
[E::easy_errno] Libcurl reported error 78 (Remote file not found)

BiancaStoecker and others added 8 commits January 20, 2026 10:17
Co-authored-by: Famke Bäuerle <[email protected]>
Co-authored-by: Famke Bäuerle <[email protected]>
Added a step to free disk space on Ubuntu before testing.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @.github/workflows/main.yml:
- Around line 45-59: The Free Disk Space step "Free Disk Space (Ubuntu)" has
inconsistent indentation under the with: block and may reference a non-existent
tag; fix by aligning the with: children (tool-cache, android, dotnet, haskell,
large-packages, swap-storage, docker-images) to use the same 8-space indentation
as other workflow steps, and verify that jlumbroso/[email protected] is a
valid released tag — if not, change the action reference to a stable branch like
`@main` or a valid release tag.

Comment on lines 45 to 59
- name: Free Disk Space (Ubuntu)
uses: jlumbroso/[email protected]
with:
# this might remove tools that are actually needed,
# if set to "true" but frees about 6 GB
tool-cache: false

# all of these default to true, but feel free to set to
# "false" if necessary for your workflow
android: true
dotnet: true
haskell: true
large-packages: true
swap-storage: true
docker-images: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n .github/workflows/main.yml | sed -n '45,70p'

Repository: snakemake-workflows/dna-seq-benchmark

Length of output: 1105


🌐 Web query:

jlumbroso/free-disk-space action latest version GitHub

💡 Result:

The repository has no released tags — use the main branch (jlumbroso/free-disk-space@main) as the latest. [1]

Sources:
[1] jlumbroso/free-disk-space (GitHub repository).


Good addition for accommodating VEP/REVEL resources.

This step appropriately frees disk space before running tests, which is necessary given the new VEP annotation features that download large caches. The configuration choices are sensible—keeping tool-cache and docker-images intact prevents breaking the workflow.

Two issues to address:

  1. Indentation under with: uses 10 spaces (lines 48–59), but the rest of the file uses 8 spaces for similar blocks (lines 64–66). Align for consistency.
  2. The action repository (jlumbroso/free-disk-space) has no released tags. Verify that v1.3.1 is a valid version, or consider using @main as the recommended alternative.
🤖 Prompt for AI Agents
In @.github/workflows/main.yml around lines 45 - 59, The Free Disk Space step
"Free Disk Space (Ubuntu)" has inconsistent indentation under the with: block
and may reference a non-existent tag; fix by aligning the with: children
(tool-cache, android, dotnet, haskell, large-packages, swap-storage,
docker-images) to use the same 8-space indentation as other workflow steps, and
verify that jlumbroso/[email protected] is a valid released tag — if not,
change the action reference to a stable branch like `@main` or a valid release
tag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants