Reference based SNP calling. Workflow used to continually add samples to a dataset, and organize that data in way that allows for SNP validation.
- high resolution SNP analysis
- confidence in SNP calls
- visualize SNP differences in tables
- workflow provides a predictable and familiar data structure
- handles large datasets
- reference based
- time intensive
- reference setup
- data validation
- subjective SNP filtering
vSNP can be installed via Anaconda.
If Anaconda is not installed follow steps at package_manager setup.
Follow setup and testing at vSNP3 GitHub page.
- reference.fasta
- define_filter.xlsx
- metadata.xlsx
- Organized by reference type
- Step 1 - alignments
- Step 2 - VCF collection
conda create -c conda-forge -c bioconda -n vsnp3 vsnp3=3.24cd ~; git clone https://github.com/USDA-VS/vsnp3_test_dataset.git
cd ~/vsnp3_test_dataset/vsnp_dependencies
vsnp3_path_adder.py -d `pwd`
vsnp3_path_adder.py -svsnp3_step1.py -r1 ERR766214_R1.fastq.gz -r2 ERR766214_R2.fastq.gz -t Mycobacterium_AF2122Look over stats
Add VCF to database and run step 2
vsnp3_step2.py -t Mycobacterium_AF2122 -aBCG samples
ERR766216
ERR766219
ERR766220
ERR766213
ERR766225
ERR766224
SRR398629
ERR766223
ERR234151
SRR7983756
ERR017778
ERR766218
ERR766215
ERR766217
ERR766214
ERR766222
ERR766221
ERR766226Package FASTQs
for fastq in *.fastq.gz; do name=$(echo $fastq | sed 's/[._].*//'); mkdir -p $name; mv -v $fastq $name/; doneLoop directories
NUM_PER_CYCLE=4; starting_dir=$(pwd); for dir in ./*/; do (echo "starting: $dir"; cd ./$dir; vsnp3_step1.py -r1 *_R1*.fastq.gz -r2 *_R2*.fastq.gz; cd $starting_dir) & let count+=1; [[ $((count%NUM_PER_CYCLE)) -eq 0 ]] && wait; doneCollect Stats
mkdir stats; cp ./*/*stats.xlsx stats; cd stats; vsnp3_excel_merge_files.py