Convert a FASTA alignment to SNP distance matrix
% cat test/good.aln
>seq1
AGTCAGTC
>seq2
AGGCAGTC
>seq3
AGTGAGTA
>seq4
TGTTAGAC
% snp-dists test/good.aln > distances.tab
Read 4 sequences of length 8
% cat distances.tab
snp-dists 0.7 seq1 seq2 seq3 seq4
seq1 0 1 2 3
seq2 1 0 3 4
seq3 2 3 0 4
seq4 3 4 4 0
snp-dists is written in C to the C99 standard and only depends on zlib.
conda install -c bioconda snp-dists
Docker images are available on dockerhub and quay.io. These are maintained by the StaPH-B workgroup. Dockerfiles can be found here.
# Docker
docker pull staphb/snp-dists:latest
docker run staphb/snp-dists:latest snp-dists -h
# Singularity
singularity build snp-dists.sif docker://staphb/snp-dists:latest
singularity exec snp-dists.sif snp-dists -h
git clone https://github.com/tseemann/snp-dists.git
cd snp-dists
make
# run tests
make check
bats test/test.sh # if you have BATS installed
# install into $HOME/.local/bin
make install
USAGE
snp-dists [opts] aligned.fasta[.gz] > matrix.tsv
OPTIONS
-h Show this help
-v Print version and exit
-j CPUS Threads to use [1]
-q Quiet mode; no progress messages
-a Count all differences not just [AGTC]
-k Keep case, don't uppercase all letters
-m Output MOLTEN instead of TSV
-L Ootput lower-trangle only (unique pairs)
-c Use comma instead of tab in output
-b Blank top left corner cell
-t Add column headers when using molten format
-x INT Stop counting distance beyond this [99999]
URL
https://github.com/tseemann/snp-dists
Prints the name and version separated by a space in standard Unix fashion.
snp-dists 0.9.0
Don't print informational messages, only errors.
snp-dists 0.7.0,seq1,seq2,seq3,seq4
seq1,0,1,2,3
seq2,1,0,3,4
seq3,2,3,0,4
seq4,3,4,4,0
seq1 seq2 seq3 seq4
seq1 0 1 2 3
seq2 1 0 3 4
seq3 2 3 0 4
seq4 3 4 4 0
seq1 seq2 seq3 seq4
seq1 0
seq2 1 0
seq3 2 3 0
seq4 3 4 4 0
By default, all letters are (1) uppercased and (2) ignored if not A,G,T or C.
Normally one would not want to count ambiguous letters and gaps as a "difference" but if you desire, you can enable this option.
>seq1
NGTCAGTC
>seq2
AG-CAGTC
>seq3
AGTGNGTA
You may wish to preserve case, as you may wish lower-case characters to be masked in the comparison.
>seq1
AgTCAgTC
>seq2
AggCAgTC
>seq3
AgTgAgTA
seq1 seq1 0
seq1 seq2 1
seq1 seq3 2
seq1 seq4 3
seq2 seq1 1
seq2 seq2 0
seq2 seq3 3
seq2 seq4 4
seq3 seq1 2
seq3 seq2 3
seq3 seq3 0
seq3 seq4 4
seq4 seq1 3
seq4 seq2 4
seq4 seq3 4
seq4 seq4 0
sequence_1 sequence_2 distance
seq1 seq1 0
seq1 seq2 1
seq1 seq3 2
<snip>
Once a distance between two samples becomes
very large there is often not much point
keeping on counting. Th -x option allows you
to "short-circuit" the counting. This can reduce
computation time significantly on large
alignment is you only care about small distance.
Report bugs and give suggesions on the Issues page