Skip to content

vcftoolbox#226

Merged
bgruening merged 50 commits intogalaxyecology:masterfrom
GINAMO-EBVs:master
Mar 23, 2026
Merged

vcftoolbox#226
bgruening merged 50 commits intogalaxyecology:masterfrom
GINAMO-EBVs:master

Conversation

@GINAMO-EBVs
Copy link
Copy Markdown
Contributor

VCF Toolbox is a suite of tools for filtering and subsetting VCF files used in population genomics. The toolbox allows users to filter SNPs by read depth, genotype quality, missing data, heterozygosity, and minor allele count, extract or remove individuals, split VCFs by population and perform subsampling.

@bgruening
Copy link
Copy Markdown
Collaborator

@GINAMO-EBVs thanks a lot!

Have you seen that bcftools are already available in Galaxy? Is there any functionality missing?

https://github.com/galaxyproject/tools-iuc/tree/main/tools/bcftools there are also a lot of other VCF related tools already in the Galaxy toolshed.

@yvanlebras
Copy link
Copy Markdown
Contributor

Youhou! Hi Laura and Ginamo team ! THANK you for initiating the PR! AFAIK you already checked existing tools dealing with vcf, but we can maybe dive into it together! With @PaulineSGN we will test your tools and investigate also on our side the potentiel overlaps with existing tools !

@GINAMO-EBVs
Copy link
Copy Markdown
Contributor Author

@bgruening
Hi,

I am aware that bcftools is already available on Galaxy. The tools I propose are easier to use and provide a summary of the different filters applied, in addition to retaining the input file name. I have made a few changes so that all the filtering tools are in a single tool. I have also removed one of the tools that overlaps with one of the bcftools.

Comment thread tools/vcftoolbox/split_vcf_by_pop.xml Outdated
Copy link
Copy Markdown
Collaborator

@bgruening bgruening left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread tools/vcftoolbox/split_vcf_by_pop.xml Outdated
Comment thread tools/vcftoolbox/split_vcf_by_pop.xml Outdated
Comment thread tools/vcftoolbox/split_vcf_by_pop.xml Outdated
Comment thread tools/vcftoolbox/split_vcf_by_pop.xml Outdated
Comment thread tools/vcftoolbox/split_vcf_by_pop.xml Outdated
Comment thread tools/vcftoolbox/split_vcf_by_pop.xml Outdated
Copy link
Copy Markdown
Collaborator

@bgruening bgruening left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • please add <required_files>
  • bcf is not supported as input?

Comment thread tools/vcftoolbox/split_vcf_by_pop.xml Outdated
Comment thread tools/vcftoolbox/split_vcf_by_pop.xml Outdated
Comment thread tools/vcftoolbox/split_vcf_by_pop.xml Outdated
- <required_files> add
- bcf not supported because vcftools is used for certain parts
- citation type=doi
- macros with citation, requirement, input
- change pattern discover_dataset
- update indexation
Comment thread tools/vcftoolbox/VCF_filtering.xml
Comment thread tools/vcftoolbox/VCF_filtering.xml
Comment thread tools/vcftoolbox/VCF_filtering.xml
Comment thread tools/vcftoolbox/VCF_keep_remove_ind.xml Outdated
Comment thread tools/vcftoolbox/VCF_keep_remove_ind.xml Outdated
Comment thread tools/vcftoolbox/VCF_subsampled.xml Outdated
- remove count="1"
- correction help VCF_keep_remove
@bgruening
Copy link
Copy Markdown
Collaborator

@GINAMO-EBVs not sure if relevant for you but maybe something is useful in this restructured bash file:

#!/usr/bin/env bash

set -euo pipefail

die() {
    echo "ERROR: $*" >&2
    exit 1
}

usage() {
    echo "Usage: $0 <input.vcf> <vcf_name> <keep|remove> <individual_list>" >&2
    exit 1
}

# Use `awk` for counting so the whole stream is consumed and `pipefail`
# still surfaces any upstream `bcftools` errors.
count_variants() {
    bcftools view -H "$1" | awk 'END { print NR }'
}

count_individuals() {
    bcftools query -l "$1" | awk 'END { print NR }'
}

vcf_input="${1-}"
vcf_name="${2-}"
action="${3-}"
list_ind="${4-}"

readonly output_dir="vcf_directory"

[[ $# -eq 4 ]] || usage
command -v bcftools >/dev/null 2>&1 || die "bcftools is not installed or not in PATH."
[[ -n "$vcf_name" ]] || die "VCF name is not provided."
[[ -f "$vcf_input" ]] || die "Input VCF was not found: $vcf_input"
[[ -f "$list_ind" ]] || die "Input list of individuals was not found: $list_ind"

case "$action" in
    keep|remove) ;;
    *) die "Action must be 'keep' or 'remove', got: $action" ;;
esac

# Create the destination directory here so callers do not have to pre-create it.
mkdir -p -- "$output_dir"

input_variant_count="$(count_variants "$vcf_input")"
(( input_variant_count > 0 )) || die "Input VCF contains no variant records."

##### Build output filename #####
# Strip the extension first, then prefer the trailing `(label)` when present.
name_without_ext="$(basename -- "$vcf_name")"
name_without_ext="${name_without_ext%.vcf.gz}"
name_without_ext="${name_without_ext%.vcf}"

regex='\(([^)]+)\)[[:space:]]*$'
if [[ "$name_without_ext" =~ $regex ]]; then
    base_name="${BASH_REMATCH[1]}"
else
    base_name="$name_without_ext"
fi

[[ -n "$base_name" ]] || die "Could not derive a valid output filename from: $vcf_name"

output_file="${output_dir}/${base_name}.vcf"

##### Main execution #####
if [[ "$action" == "keep" ]]; then
    echo "Keeping individuals listed"
    bcftools view -S "$list_ind" "$vcf_input" -o "$output_file" --force-samples
else
    echo "Removing individuals listed"
    bcftools view -S "^${list_ind}" "$vcf_input" -o "$output_file" --force-samples
fi

##### Verify that filtered VCF is not empty #####
[[ -f "$output_file" ]] || die "Output VCF was not created: $output_file"

output_variant_count="$(count_variants "$output_file")"
(( output_variant_count > 0 )) || die "Filtered VCF contains no variants."

##### Summary #####
n_ind_b="$(count_individuals "$vcf_input")"
n_ind_a="$(count_individuals "$output_file")"

echo "Individuals before: ${n_ind_b}"
echo "Individuals after: ${n_ind_a}"

@GINAMO-EBVs
Copy link
Copy Markdown
Contributor Author

@bgruening Thank your for your proposition. I applied it on VCF_keep_remove.sh and partially in the other scripts.

Copy link
Copy Markdown
Collaborator

@bgruening bgruening left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, glad it was useful, my bash is a bit rusty :)


<required_files>
<include path="VCF_keep_remove_ind.R" />
<include path="VCF_keep_remove_ind_v2.sh" />
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the v2 is not called in this tool?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, a backup that didn't work properly – there isn't a v2

@bgruening
Copy link
Copy Markdown
Collaborator

please look at this page for a summary of the failing test: https://github.com/galaxyecology/tools-ecology/actions/runs/23291952174?pr=226

Copy link
Copy Markdown
Collaborator

@bgruening bgruening left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GINAMO-EBVs thanks a lot. I think those are my final comments. Thanks a lot.

Comment thread tools/vcftoolbox/.shed.yml Outdated
Comment thread tools/vcftoolbox/.shed.yml Outdated
Comment thread tools/vcftoolbox/split_vcf_by_pop.xml Outdated
Change name of tools adding population_genomics as prefix
@bgruening bgruening merged commit 02f6cba into galaxyecology:master Mar 23, 2026
12 checks passed
@bgruening
Copy link
Copy Markdown
Collaborator

Merci!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants