Description
When --normalize_vcfs is enabled, sarek runs bcftools norm --multiallelics -both --rm-dup all. Because -m -both splits a multiallelic record into per-allele rows at the same position, the subsequent --rm-dup all (which de-duplicates by position) deletes all but the first row — silently dropping a real ALT allele at heterozygous two-alt (1/2) sites. This can lose genuine variants (e.g. compound-heterozygous-relevant calls).
Steps to reproduce
-profile test,docker --tools deepvariant --skip_tools baserecalibrator --normalize_vcfs --filter_vcfs
At chr22:13575, DeepVariant calls G → C,T with genotype 1/2. Running bcftools norm on the filtered VCF three ways:
bcftools norm args |
records |
site 13575 |
-m -both --rm-dup all (current sarek) |
17 |
only G→C — G→T lost |
-m -both --rm-dup none |
18 |
G→C and G→T |
-m -both --rm-dup exact |
18 |
G→C and G→T |
Root cause
conf/modules/post_variant_calling.config, withName: 'VCFS_NORM':
ext.args = { [
'--multiallelics -both',
'--rm-dup all' // comment: "output only the first instance of a record which is present multiple times"
].join(' ') }
The comment's stated intent — drop records that are identical — corresponds to --rm-dup exact, not all. The all mode removes by position, which collides with the per-allele rows produced by -m -both.
Proposed fix
Change --rm-dup all → --rm-dup exact. Verified end-to-end on the test profile: the normalized VCF then keeps both alleles (18 vs 17 records) while still removing truly identical duplicate records.
Environment
- sarek 3.8.1
- bcftools 1.21
- Nextflow 26.04.4
References
Description
When
--normalize_vcfsis enabled, sarek runsbcftools norm --multiallelics -both --rm-dup all. Because-m -bothsplits a multiallelic record into per-allele rows at the same position, the subsequent--rm-dup all(which de-duplicates by position) deletes all but the first row — silently dropping a real ALT allele at heterozygous two-alt (1/2) sites. This can lose genuine variants (e.g. compound-heterozygous-relevant calls).Steps to reproduce
At
chr22:13575, DeepVariant callsG → C,Twith genotype1/2. Runningbcftools normon the filtered VCF three ways:bcftools normargs-m -both --rm-dup all(current sarek)G→C—G→Tlost-m -both --rm-dup noneG→CandG→T-m -both --rm-dup exactG→CandG→TRoot cause
conf/modules/post_variant_calling.config,withName: 'VCFS_NORM':The comment's stated intent — drop records that are identical — corresponds to
--rm-dup exact, notall. Theallmode removes by position, which collides with the per-allele rows produced by-m -both.Proposed fix
Change
--rm-dup all→--rm-dup exact. Verified end-to-end on the test profile: the normalized VCF then keeps both alleles (18 vs 17 records) while still removing truly identical duplicate records.Environment
References
--rm-dupbehavior on multiallelic sites