name	bio-metagenomics-strain-tracking
description	Track bacterial strains using MASH, sourmash, fastANI, and inStrain. Compare genomes, detect contamination, and monitor strain-level variation. Use when needing sub-species resolution for outbreak tracking, transmission analysis, or within-host strain dynamics.
tool_type	cli
primary_tool	MASH

Version Compatibility

Reference examples tested with: Bowtie2 2.5.3+, MetaPhlAn 4.1+, numpy 1.26+, pandas 2.2+, samtools 1.19+, scipy 1.12+

Before using code patterns, verify installed versions match. If versions differ:

Python: pip show <package> then help(module.function) to check signatures
CLI: <tool> --version then <tool> --help to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Strain Tracking

"Track bacterial strains across my samples" → Resolve sub-species variation using genome sketching (Mash/sourmash), average nucleotide identity (fastANI), or within-sample strain profiling (inStrain) for outbreak tracking and transmission analysis.

CLI: mash dist, sourmash compare, fastANI, inStrain profile

Identify and track bacterial strains at sub-species resolution.

Tool Comparison

Tool	Method	Best For
MASH	MinHash sketches	Fast distance estimation
sourmash	MinHash + containment	Metagenome comparisons
fastANI	ANI calculation	Accurate species/strain ID
inStrain	SNV profiling	Strain dynamics in metagenomes

MASH

Installation

conda install -c bioconda mash

Create Sketch

# Single genome
mash sketch -o genome.msh genome.fasta

# Multiple genomes
mash sketch -o reference_db.msh genomes/*.fasta

# From reads (with coverage)
mash sketch -m 2 -r -o reads.msh reads.fastq.gz

Calculate Distance

# Pairwise distance
mash dist genome1.fasta genome2.fasta

# Query against database
mash dist reference_db.msh query.fasta > distances.tsv

# Screen for containment (metagenome)
mash screen reference_db.msh reads.fastq.gz > screen_results.tsv

Interpret MASH Distance

Distance	Interpretation
< 0.05	Same species/strain
0.05-0.15	Same species
0.15-0.25	Same genus
> 0.25	Different genus

Cluster Genomes

# All-vs-all distances
mash triangle genomes/*.fasta > distances.phylip

# Build tree
mash triangle -E genomes/*.fasta > distances.tsv

sourmash

Installation

conda install -c bioconda sourmash

Create Signatures

# Genome signature
sourmash sketch dna -p scaled=1000,k=31 genome.fasta -o genome.sig

# Multiple genomes
sourmash sketch dna -p scaled=1000,k=31 genomes/*.fasta -o genomes.sig

# Protein signatures
sourmash sketch protein -p scaled=100,k=10 proteins.faa -o proteins.sig

Compare Signatures

# Pairwise comparison
sourmash compare *.sig -o comparison.npy --csv comparison.csv

# Search against database
sourmash search query.sig database.sig --threshold 0.8

# Gather (metagenome decomposition)
sourmash gather metagenome.sig database.sig -o gather_results.csv

Taxonomy Assignment

# Download taxonomy database
sourmash database download gtdb-rs214-k31.zip

# Classify
sourmash lca classify --db gtdb-rs214-k31.lca.json.gz --query query.sig

# Summarize metagenome
sourmash lca summarize --db gtdb-rs214-k31.lca.json.gz --query metagenome.sig

fastANI

Installation

conda install -c bioconda fastani

Calculate ANI

# Single pair
fastANI -q query.fasta -r reference.fasta -o ani_result.txt

# Query vs multiple references
fastANI -q query.fasta --rl reference_list.txt -o ani_results.txt

# All-vs-all
fastANI --ql genome_list.txt --rl genome_list.txt -o all_vs_all.txt --matrix

Interpret ANI

ANI	Interpretation
>99%	Same strain
95-99%	Same species
<95%	Different species

inStrain

For strain-level analysis in metagenomes.

Installation

conda install -c bioconda instrain

Profile Strains

# Map reads to reference
bowtie2 -x reference -1 reads_1.fq -2 reads_2.fq | \
    samtools sort -o mapped.bam

# Profile with inStrain
inStrain profile mapped.bam reference.fasta -o instrain_output -p 8

Compare Samples

# Profile multiple samples
for bam in sample*.bam; do
    inStrain profile $bam reference.fasta -o ${bam%.bam}_IS -p 8
done

# Compare strain populations
inStrain compare -i sample*_IS -o comparison_IS -p 8

Key Outputs

# SNV table
cat instrain_output/output/SNVs.tsv

# Gene-level info
cat instrain_output/output/gene_info.tsv

# Genome info
cat instrain_output/output/genome_info.tsv

Complete Workflow: Outbreak Tracking

Goal: Identify potential outbreak clusters by computing pairwise genomic distances across isolate genomes using multiple complementary methods.

Approach: Sketch genomes with MASH for fast distance estimation, compute ANI with fastANI for accurate species-level resolution, compare sourmash signatures for containment analysis, and cluster close matches to identify transmission pairs.

#!/bin/bash
set -euo pipefail

GENOMES_DIR=$1
OUTPUT_DIR=$2

mkdir -p $OUTPUT_DIR

echo "=== MASH sketching ==="
mash sketch -o $OUTPUT_DIR/genomes.msh $GENOMES_DIR/*.fasta

echo "=== MASH distances ==="
mash dist $OUTPUT_DIR/genomes.msh $OUTPUT_DIR/genomes.msh > $OUTPUT_DIR/mash_distances.tsv

echo "=== fastANI ==="
ls $GENOMES_DIR/*.fasta > $OUTPUT_DIR/genome_list.txt
fastANI --ql $OUTPUT_DIR/genome_list.txt \
        --rl $OUTPUT_DIR/genome_list.txt \
        -o $OUTPUT_DIR/fastani_results.txt \
        --matrix

echo "=== sourmash signatures ==="
sourmash sketch dna -p scaled=1000,k=31 $GENOMES_DIR/*.fasta -o $OUTPUT_DIR/all.sig
sourmash compare $OUTPUT_DIR/all.sig -o $OUTPUT_DIR/sourmash.npy --csv $OUTPUT_DIR/sourmash.csv

echo "=== Identify clusters ==="
python3 << 'EOF'
import pandas as pd
import numpy as np

# Load MASH distances
mash = pd.read_csv('${OUTPUT_DIR}/mash_distances.tsv', sep='\t', header=None,
                   names=['ref', 'query', 'distance', 'pvalue', 'shared'])

# Filter for close matches (potential outbreak cluster)
close = mash[(mash['distance'] < 0.001) & (mash['ref'] != mash['query'])]
print("Potential outbreak pairs (MASH distance < 0.001):")
print(close[['ref', 'query', 'distance']])
EOF

echo "=== Complete ==="

Python Analysis

import pandas as pd
import numpy as np
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import squareform

# Load MASH distances
mash = pd.read_csv('mash_distances.tsv', sep='\t', header=None,
                   names=['ref', 'query', 'dist', 'pval', 'shared'])

# Pivot to matrix
samples = sorted(set(mash['ref'].tolist()))
dist_matrix = mash.pivot(index='ref', columns='query', values='dist').fillna(0)
dist_matrix = dist_matrix.loc[samples, samples]

# Cluster
condensed = squareform(dist_matrix.values)
Z = linkage(condensed, method='average')

# Cut tree at species level (0.05)
clusters = fcluster(Z, t=0.05, criterion='distance')
cluster_df = pd.DataFrame({'sample': samples, 'cluster': clusters})
print(cluster_df.groupby('cluster').size())

Related Skills

metagenomics/kraken-classification - Taxonomic classification
genome-assembly/contamination-detection - Contamination screening
phylogenetics/modern-tree-inference - Phylogenetic analysis
metagenomics/metaphlan-profiling - Species profiling

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Version Compatibility

Strain Tracking

Tool Comparison

MASH

Installation

Create Sketch

Calculate Distance

Interpret MASH Distance

Cluster Genomes

sourmash

Installation

Create Signatures

Compare Signatures

Taxonomy Assignment

fastANI

Installation

Calculate ANI

Interpret ANI

inStrain

Installation

Profile Strains

Compare Samples

Key Outputs

Complete Workflow: Outbreak Tracking

Python Analysis

Related Skills

FilesExpand file tree

SKILL.md

Latest commit

History

SKILL.md

File metadata and controls

Version Compatibility

Strain Tracking

Tool Comparison

MASH

Installation

Create Sketch

Calculate Distance

Interpret MASH Distance

Cluster Genomes

sourmash

Installation

Create Signatures

Compare Signatures

Taxonomy Assignment

fastANI

Installation

Calculate ANI

Interpret ANI

inStrain

Installation

Profile Strains

Compare Samples

Key Outputs

Complete Workflow: Outbreak Tracking

Python Analysis

Related Skills