vSNP3 is a powerful tool for high-resolution bacterial and viral SNP analysis, designed specifically for disease tracing and outbreak investigations in diagnostic laboratories.
- Superior Resolution: Precisely identifies with confidence strain differences down to the single nucleotide level
- Flexible Database: Build, maintain, and update your strain database without rerunning all samples
- Intelligent Sample Classification: Automatically group samples based on defining SNPs
- Computational Efficiency: Focus analysis on relevant sample subsets, saving time and resources
- Comprehensive Output: Complete suite of BAM, VCF, annotated SNP matrices, and phylogenetic trees
- Zero Coverage Tracking: Unique capability to track regions with no sequence data
- Mixed SNP Handling: Accurately represents positions with multiple alleles using IUPAC codes - ability to identify mixed strains
Most SNP callers force you to reprocess all your samples each time you add new ones. vSNP3's two-step approach is different:
- Step 1: Process Alignment Once - Align reads and call SNPs for each sample individually
- Step 2: Combine and run VCF files - Generate matrices and trees from any combination of samples
This approach lets you:
- Add new samples to your analysis without reprocessing existing ones
- Create different sample groupings for different investigations
- Save computational resources and time
- Maintain a growing, curated database of SNP profiles
A unique feature of vSNP3 is its use of defining SNPs to automatically categorize samples:
Full Dataset (100 samples)
β
βββ Group A (40 samples) - Defining SNP: position 123456 = T
β β
β βββ Subgroup A1 (15 samples) - Defining SNP: position 234567 = G
β β
β βββ Subgroup A2 (25 samples) - Defining SNP: position 234567 = A
β
βββ Group B (60 samples) - Defining SNP: position 123456 = C
β
βββ Subgroup B1 (20 samples) - Defining SNP: position 345678 = T
β
βββ Subgroup B2 (40 samples) - Defining SNP: position 345678 = C
Benefits of defining SNPs:
- Automatic Grouping: Samples are classified into groups based on specific SNP patterns
- Focused Analysis: Quickly drill down to specific subsets of related samples
- Computational Efficiency: Reduce analysis time by working with smaller, relevant sample sets
conda create -c conda-forge -c bioconda -n vsnp3 vsnp3=3.30
conda activate vsnp3
For detailed setup instructions, see conda instructions.
# Verify installation
vsnp3_step1.py -h
vsnp3_step2.py -h
# Download test dataset and add reference types
cd ${HOME}
git clone https://github.com/USDA-VS/vsnp3_test_dataset.git
cd vsnp3_test_dataset/vsnp_dependencies
vsnp3_path_adder.py -d $(pwd)
# Run Step 1: Process a single sample (only needed once per sample)
cd ~/vsnp3_test_dataset/AF2122_test_files/step1
vsnp3_step1.py -r1 *_R1*.fastq.gz -r2 *_R2*.fastq.gz -t Mycobacterium_AF2122
# Run Step 2: Generate SNP matrix and tree (can be run with any sample combination)
cd ~/vsnp3_test_dataset/AF2122_test_files/step2
vsnp3_step2.py -a -t Mycobacterium_AF2122
Imagine you're tracking a bacterial outbreak over time:
- Initial Investigation: Process your first 10 samples through Step 1, then use Step 2 to generate a phylogenetic tree
- New Sample Analysis: When you receive 5 new samples, only run Step 1 on these new samples
- Updated Results: Run Step 2 again using all 15 samples to see how the new samples relate to the existing ones
- Focused Investigation: Use defining SNPs to identify a specific cluster, then create a detailed analysis with just those samples
This workflow saves time and resources while maintaining a comprehensive database of all processed samples.
Step 1 processes raw sequencing data for each sample individually:
- Aligns reads to your reference genome
- Calls high-quality SNPs
- Tracks regions with zero coverage
- Generates comprehensive quality metrics
- Automatically assigns samples to groups based on defining SNPs
Step 2 combines results from multiple samples:
- Creates SNP matrices from any combination of processed samples
- Builds phylogenetic trees showing evolutionary relationships
- Handles mixed SNPs using IUPAC ambiguity codes
- Generates HTML summary reports for easy interpretation
Sample outputs:
vSNP3's defining SNP capability allows you to:
- Automatically classify samples into hierarchical groups
- Focus your analysis on biologically relevant sample subsets
- Quickly identify related samples in an outbreak scenario
- Build a labeled sample database
One of vSNP3's most powerful features is its ability to automatically classify samples using defining SNPs. Each reference type has its own defining SNP Excel file that defines these critical positions.
After installation, you can find the path to your defining SNP files with:
vsnp3_path_adder.py -s
This will show all installed reference types and their associated file paths.
The defining SNP Excel file has a structured format:
- Row 1: Contains chromosome:position identifiers for each SNP position
- Row 2: Names of each group/subgroup (e.g., Mbovis-All, Mbovis-01, Mbovis-01A)
- Remaining Rows: Positions to be filtered from the analysis for each specific group
When vSNP3 analyzes a sample:
- It checks the sample's nucleotides at each defining position
- Based on the SNP pattern, it automatically assigns the sample to the appropriate group
- During analysis, it filters out the problematic positions listed below each group's column
The beauty of this system is its flexibility:
- You can define hierarchical groups based on evolutionary relationships
- Each group can have its own set of filtered positions to improve analysis quality
- As you discover new lineages, you can update the defining SNP file to reflect them
This classification system allows you to:
- Automatically organize samples as they're processed
- Focus your analysis on specific groups of interest
- Maintain consistent classifications across your entire database
- Filter out positions known to be problematic for specific lineages
The defining SNP system transforms vSNP3 from a simple SNP caller into an intelligent analysis platform that grows more valuable as your sample database expands.
Reference types have key files that provide structure to your analysis:
- Defining filter file: Identifies group-specific SNPs
- Metadata file: Maps sample names
- FASTA reference: For read alignment
- GenBank file: For annotation
Adding a reference is simple:
vsnp3_path_adder.py -d /path/to/reference_files
Reference types are called based on their directory names once their parent directory is added.
One of the most important first steps in using vSNP3 is setting up your reference types. This only needs to be done once, and it enables all the powerful features of vSNP3 including automatic sample classification and group-specific filtering. Reference types are called based on their directory names once their parent directory is added.
A reference type in vSNP3 is a collection of files for a specific organism that includes:
- A reference genome (FASTA)
- Annotation information (GenBank)
- Defining SNP positions (Excel file)
- Sample name mapping (Excel file)
These files work together to provide the foundation for your analyses.
Adding a reference type is simple using the vsnp3_path_adder.py
utility:
Parent directory contains the reference directory. This parent directory may contain many reference types, each a separate subfolder.
# Add a reference parent directory containing all necessary files.
vsnp3_path_adder.py -d /path/to/parent_dictory
This command tells vSNP3 where to find the reference files for a particular organism. The reference type name is taken directly from the directory name. For example, if your files are in a directory called Mycobacterium_AF2122
, that becomes the reference type name you'll use in your commands.
Let's walk through a complete example:
-
Prepare your reference directory
Create a directory with these files:
Parent_Directory/ βββMycobacterium_AF2122/ βββ defining_filter.xlsx # Contains defining SNPs and filter positions βββ metadata.xlsx # Sample name mapping βββ AF2122.fasta # Reference genome βββ AF2122.gbk # GenBank annotation file
-
Add the reference type to vSNP3
vsnp3_path_adder.py -d /path/to/Parent_Directory
-
Verify the reference was added
vsnp3_path_adder.py -s
You should see your reference type listed, along with paths to all associated files.
vSNP3 allows you to work with multiple reference types:
-
Adding additional references: Simply run the path adder for each new reference
vsnp3_path_adder.py -d /path/to/another_parent_directory
-
Viewing all references: Check which references are available
vsnp3_path_adder.py -s
- Organize by organism: Keep reference files for each organism in separate directories
- Use descriptive names: Choose reference type names that clearly identify the organism
- Keep references consistent: Use the same reference across all related analyses
- Back up your reference files: Save your defining SNP files especially, as they contain valuable classification information
By properly setting up your reference types, you're creating a foundation for consistent, repeatable analyses that grow more valuable as your sample database expands.
vSNP3 includes utility scripts for:
- Adding reference paths
- MLST typing
- Downloading reference genomes
- Filter optimization
- Spoligotyping
For full details, see Additional Tools.
- Disease outbreak investigation: Track transmission chains in real time
- Surveillance programs: Monitor pathogen evolution over time
- Vaccine strain monitoring: Detect drift from vaccine strains
- Mix strain evaluation: Identify mixed strains
- Antimicrobial resistance tracking: Link resistance profiles to genetic markers
For support, please open an issue on GitHub or email directly.
If you use vSNP3 in your research, please cite our article.
For archived documentation from previous versions, see Archived Detail.