PanSVMerger-A-Pipeline-for-Merging-Multi-Allele-Structural-Variants

PanSVMerger is a bioinformatics pipeline designed to merge multi-allele structural variants (SVs) located at the same genomic position based on their sequence similarity and length. This tool aims to provide a reliable and efficient method for consolidating SV data, which is crucial for accurate genomic analysis and interpretation.

Pipeline-Description

vcfwave: Decomposes complex SVs from the original MC VCF file, resolving nested bubble structures by realigning reference and alternate alleles to parse out primitive alleles
vcfcreatemulti: Reconstructs non-overlapping multi-allelic sites by removing nested alleles
vsearch: Clusters alternative alleles based on sequence similarity, further divides clusters when sequence length differences exceed 50bp
genotype updating: Reassigns genotypes based on cluster membership

why-PanSVMerger

When working with Asian-Pan-Graph, we observed a high proportion of multi-allelic sites at single genomic loci. While existing tools like:

could partially merge these alleles, they still left a significant number of multi-allelic sites unresolved.

To address this, we developed PanSVMerger - a specialized method that:

Leverages the structure of APG-pan graphs
Implements sequence similarity-based clustering
Incorporates length difference metrics
Provides optimized allele merging for complex variation patterns

This approach significantly reduces fragmented multi-allelic representations while maintaining biological accuracy of the variation.

Requirements-and-Install

vcflib
pysam (stable version)
vsearch

Usage

Prepare-non-overlapping-multiallelic-VCF-file

Due to the complexity of variants called from graph, particulally those complex regions with nested bubbles, VCF file genereated by MC still contains tricky and problematic sites although after "poping" using vcfbubble. As Garrison et al. suggested, we should first realign reference and alternate alleles to parse out the original "primitive" alleles into multiple records and next put them together again, which would be provided for PanSVMerger to consolidate SVs based on similarity and length (see below).

vcfwave -t16 MC.vcf.gz (-L 100000) > MC.wave.vcf ## you can ignore SVs longer than 100Kb
bcftools norm --threads 16 -m- MC.wave.vcf -Ov -o MC.wave.bi.vcf ## there are still few multiallelic sites
vcfcreatemulti MC.wave.bi.vcf > MC.wave.bi.createmulti.vcf

Merge-SVs

After removing and reconstructing overlapping and nested sites, we would see multiple alleles in one complex SV record. However, they might be slightly different and should be considered as the same one, which would have great impacts on downstream analysis. Therefore, we'd like to consolidate and merge them to confident consensus. See our paper for detailed methods.

python SV.merge_variants.py MC.wave.bi.createmulti.vcf MC.wave.bi.createmulti.merge -t 16 ## if input compressed vcf file, please index first
bgzip -@16 MC.wave.bi.createmulti.merge.vcf ## if input compressed file, `SV.merge_variants.py` will output compressed merged vcf file
bcftools index -t --threads 16 MC.wave.bi.createmulti.merge.vcf.gz

Construct-Benchmarking

This workflow integrates both assembly-based and long-read SV callers, followed by comprehensive merging to generate a high-confidence SV set.

1. Assembly-based SV Calling

mkdir ./long_reads.SV/ && cd ./long_reads.SV/
# SVIM-asm
~/bin/minimap2/2.17/minimap2 -a -x asm5 --cs -r2k -t 4 genome.fa query.fa > alignments.sam
~/bin/samtools sort -m4G -@4 -o alignments.sorted.bam alignments.sam
~/bin/samtools index alignments.sorted.bam
~/bin/svim-asm diploid ./ alignments.sorted.bam genome.fa

# SyRI
~/bin/mummer-4.0.0/bin/nucmer --prefix=sample -l 100 genome.fa query.fa
~/bin/mummer-4.0.0/delta-filter -l 100 -i 90 -1 sample.delta > sample.delta.filter
~/bin/mummer-4.0.0/bin/show-coords -THrd sample.delta.filter > sample.filtered.coords

python3 ~/bin/syri-1.5/syri/bin/syri -c sample.filtered.coords -r genome.fa -q query.fa \
  -d sample.delta.filter -k --prefix sample -s ~/bin/show-snps --lf syri.log

2. Long-read SV Calling

mkdir ./long_reads.SV/ && cd ./long_reads.SV/
# Sniffles
~/bin/samtools calmd --threads 6 -b HiFi_reads.sort.head.bam genome.fa > HiFi_reads.sort.baq.bam
~/bin/sniffles-core-1.0.12/sniffles -m HiFi_reads.sort.baq.bam -v HiFi_reads.sniffles.vcf \
  --report_BND --skip_parameter_estimation --min_support 3

# SVIM
~/bin/svim alignment ./ HiFi_reads.sort.head.bam genome.fa

# CuteSV
~/bin/cuteSV HiFi_reads.sort.head.bam genome.fa HiFi_reads.cuteSV.vcf ./ \
  --max_cluster_bias_INS 1000 --diff_ratio_merging_INS 0.9 \
  --max_cluster_bias_DEL 1000 --diff_ratio_merging_DEL 0.5 -t 8 --genotype

# PBSV
~/bin/pbsv discover HiFi_reads.sort.head.bam HiFi_reads.svsig.gz
tabix -c '#' -s 3 -b 4 -e 4 HiFi_reads.svsig.gz
~/bin/pbsv call --ccs genome.fa HiFi_reads.svsig.gz HiFi_reads.pbsv.vcf

3. Merging Results

# Combine VCFs from all callers
ls ./long_reads.SV/*.vcf > combine.txt
ls ./assembly_based.SV/*.vcf >> combine.txt

# Merge with SURVIVOR (max distance 1000bp, require 2 callers support)
~/bin/SURVIVOR merge combine.txt 1000 2 1 1 0 50 sampleN.truth.sv.vcf.gz

Output Files

sampleN.truth.sv.vcf.gz: Final merged structural variant calls for each individual

./long_reads.SV/: Individual long-reads-based SV VCFs

./assembly_based.SV/: Individual Assembly-based SV VCFs

Compare-with-benchmarking

# 01_subsample_svs.sh
vcftools --gzvcf MC.APGpangraph.vcf.gz --indv sampleN --recode --stdout |bgzip -c > sampleN.sv.vcf.gz
tabix -p vcf sampleN.sv.vcf.gz

# 02_filter_large_svs.sh
vcfbub -l 0 -a 100000 -i sampleN.vcf.gz -o sampleN.filtered.vcf.gz

# 03_decompose_alleles.sh
vcfwave -I 1000 -i sampleN.filtered.vcf.gz -o sampleN.wave.vcf.gz

# 04_split_biallelic.sh
bcftools norm -m -any -i sampleN.wave.vcf.gz -o sampleN.biallelic.vcf.gz

# 05_run_truvari.sh
truvari bench -b sampleN.truth.sv.vcf.gz -c sampleN.biallelic.vcf.gz -o truvari_results

Contacts

([email protected]) ([email protected]) ([email protected])

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
scripts		scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PanSVMerger-A-Pipeline-for-Merging-Multi-Allele-Structural-Variants

Table of Contents

About

Pipeline-Description

why-PanSVMerger

Requirements-and-Install

Usage

Prepare-non-overlapping-multiallelic-VCF-file

Merge-SVs

Construct-Benchmarking

1. Assembly-based SV Calling

2. Long-read SV Calling

3. Merging Results

Output Files

Compare-with-benchmarking

Contacts

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Asian-Pan-Genome/PanSVMerger

Folders and files

Latest commit

History

Repository files navigation

PanSVMerger-A-Pipeline-for-Merging-Multi-Allele-Structural-Variants

Table of Contents

About

Pipeline-Description

why-PanSVMerger

Requirements-and-Install

Usage

Prepare-non-overlapping-multiallelic-VCF-file

Merge-SVs

Construct-Benchmarking

1. Assembly-based SV Calling

2. Long-read SV Calling

3. Merging Results

Output Files

Compare-with-benchmarking

Contacts

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages