The pipeline consolidates variant calls produced by multiple variant callers into a unified representation. Raw variant calls from each algorithm are first filtered and normalized independently, and then merged into a single candidate variant set.
This step ensures that variants detected by different algorithms are represented consistently and that equivalent variants reported by multiple callers are unified into a single record.
Each VCF file produced by the variant callers undergoes preprocessing before merging. This step standardizes variant representation and removes redundant records.
bcftools_PASS_norm_dedup.sh \
-i input.vcf.gz \
-f additional.vcf.gz \
-r reference.fasta
Arguments:
- -f: additional VCF file to merge with the main input (optional, can be specified multiple times).
The preprocessing stage performs the following operations:
- PASS filtering: retain variants that pass the filters generated by each caller.
- Variant selection: retain only single-nucleotide variants (SNVs) and small insertions or deletions (indels), excluding structural variants.
- Variant normalization: left-align and normalize variants relative to the reference genome.
- Variant atomization: decompose complex variants into primitive representations.
- Duplicate removal: remove redundant variant entries.
These operations are implemented using Bcftools.
After preprocessing, the normalized VCF files generated by each variant caller are merged into a unified variant representation.
merge_callers.py \
-i TNhaplotyper2:tnhaplotyper2.vcf.gz \
-i Strelka2:strelka2.vcf.gz \
-i RUFUS:rufus.vcf.gz \
-i longcallD:longcalld.vcf.gz \
-s sample \
-o merged.vcf.gz
Arguments:
- -i: input VCF file specified as
CALLER:VCF(can be provided multiple times). Supported callers are TNhaplotyper2, Strelka2, RUFUS, longcallD.
The merging step consolidates variants reported by multiple callers into a single record when they share the same genomic position and allele representation (CHROM, POS, REF, ALT).
For each merged variant, the pipeline records which algorithms detected the variant.
This information is stored in the INFO field:
CALLERS=Strelka2,TNhaplotyper2
This annotation allows downstream filtering steps to evaluate support for each variant across independent algorithms.
During merging, the pipeline reconstructs the VCF header to ensure consistency across callers. The process includes:
- Rebuilding
##contigdefinitions - Adding standardized INFO and FILTER definitions
- Adding the sample identifier to the header (
##SAMPLE=<ID=...>)
After validation, variants are annotated based on the level of cross-technology support observed across sequencing platforms.
Variants are labeled CrossCaller when they are independently detected by two or more somatic callers (listed in the CALLERS INFO field).
All the relevant code can be accessed in the GitHub repository:
- bcftools_PASS_norm_dedup.sh [Bcftools]
- merge_callers.py [merge_callers]