Skip to content
This repository was archived by the owner on Jul 1, 2025. It is now read-only.
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions docs/interpreting_plots.rst
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ Percentage Identity/ANI
Percentage identity matrix for *Candidatus Blochmannia* ANIm analysis

Each cell represents a pairwise comparison between the named genomes on rows and columns, and the number in each cell is the pairwise identity *of all aligned regions*. The dendrograms are produced by single-linkage hierarchical clustering trees from the matrix of pairwise identity results. The default colour scheme colours cells with identity > 0.95 as red, and those with < 0.95 as blue. This division corresponds to a widely-used convention for bacterial species boundaries.

.. note::

No single ANI threshold should be considered universally applicable to distinguish between species for all bacterial genomes.
Expand All @@ -56,7 +56,7 @@ Taking the 95% threshold between red and blue cells to be equivalent to a specie
* the two genomes BPEN and 640 could be classified as the same species
* the remaining four genomes each represent a distinct species

In particular, we can see that the off-diagonal identity values are all around 85%, consistent with the limit of detection for homologous nucleotide regions.
In particular, we can see that the off-diagonal identity values are all around 85%, consistent with the limit of detection for homologous nucleotide regions.

.. note::

Expand Down Expand Up @@ -143,13 +143,13 @@ Plot Asymmetry

Each ANI method in `pyani` calculates results by a different method. The difference between methods is usually that alternative third-party alignment tools are used. However, there may also be differences between the ways those alignment outputs are used. Please see the relevant documentation for details of each method.

**Average nucleotide identity** is a measure of similarity between two genomes. Depending on the ANI method used, this may be symmetrical: comparing genome A to genome B is the same as comparing genome B to genome A; or asymmetrical: the result of comparing genome A with genome B can be different from comparing genome B with genome A.
**Average nucleotide identity** is a measure of similarity between two genomes. Depending on the ANI method used, this may be symmetrical: comparing genome A to genome B is the same as comparing genome B to genome A; or asymmetrical: the result of comparing genome A with genome B can be different from comparing genome B with genome A.

Asymmetry can arise as a consequence of the way the sequence alignment algorithm used for calculating genome alignments works. For instance, the initial seed alignment for a pair of genomes may be very similar, but not identical, and this difference may propagate through an extension step into differences in the final alignment. Alternatively, an aspect of the ANI algorithm may introduce asymmetry. For instance, the genome fragmentation step in ANIb may break each participating genome in different ways.

`pyani` provides both symmetrical and asymmetrical ANI methods:

- ANIm — symmetrical
- ANIm — asymmetrical
- FastANI — asymmetrical (only available in version 0.3.0-alpha)
- ANIb — asymmetrical
- ANIblastall — asymmetrical
Expand Down
3 changes: 2 additions & 1 deletion docs/run_anim.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,8 @@ In brief, the analysis proceeds as follows for a set of input prokaryotic genome
The output values are recorded in the ``pyani`` database.

.. NOTE::
A single ``MUMmer`` comparison is performed between each pair of genomes. Input genomes are sorted into alphabetical order by filename, and the query sequence is the genome that occurs earliest in the list; the subject sequence is the genome that occurs latest in the list.
Two ``Mummer`` comparisons are performed between each pair of genomes, so that each genome serves as the reference sequence, and as the subject sequence.
.. A single ``MUMmer`` comparison is performed between each pair of genomes. Input genomes are sorted into alphabetical order by filename, and the query sequence is the genome that occurs earliest in the list; the subject sequence is the genome that occurs latest in the list.

.. TIP::
The ``MUMmer`` comparisons are embarrasingly parallel, and can be distributed across cores on an `Open Grid Scheduler`_-compatible cluster, using the ``--scheduler SGE`` option.
Expand Down
3 changes: 1 addition & 2 deletions pyani/anim.py
Original file line number Diff line number Diff line change
Expand Up @@ -209,7 +209,6 @@ def generate_nucmer_commands(
pairwise comparison.
"""
nucmer_cmdlines, delta_filter_cmdlines = [], []
filenames = sorted(filenames) # enforce ordering of filenames
for idx, fname1 in enumerate(filenames[:-1]):
for fname2 in filenames[idx + 1 :]:
ncmd, dcmd = construct_nucmer_cmdline(
Expand Down Expand Up @@ -248,7 +247,7 @@ def construct_nucmer_cmdline(
outdir, called "nucmer_output".
"""
# Cast path strings to pathlib.Path for safety
fname1, fname2 = sorted([Path(fname1), Path(fname2)])
fname1, fname2 = Path(fname1), Path(fname2)

# Compile commands
# Nested output folders to avoid N^2 scaling in files-per-folder
Expand Down
6 changes: 0 additions & 6 deletions pyani/pyani_orm.py
Original file line number Diff line number Diff line change
Expand Up @@ -651,12 +651,6 @@ def update_comparison_matrices(session, run) -> None:
df_alnlength.loc[qid, sid] = cmp.aln_length
df_simerrors.loc[qid, sid] = cmp.sim_errs
df_hadamard.loc[qid, sid] = cmp.identity * cmp.cov_query
if cmp.program in ["nucmer"]:
df_hadamard.loc[sid, qid] = cmp.identity * cmp.cov_subject
df_simerrors.loc[sid, qid] = cmp.sim_errs
df_alnlength.loc[sid, qid] = cmp.aln_length
df_coverage.loc[sid, qid] = cmp.cov_subject
df_identity.loc[sid, qid] = cmp.identity

# Add matrices to the database
run.df_identity = df_identity.to_json()
Expand Down
4 changes: 2 additions & 2 deletions pyani/scripts/subcommands/subcmd_anim.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@
import logging

from argparse import Namespace
from itertools import combinations
from itertools import permutations
from pathlib import Path
from typing import List, NamedTuple, Tuple

Expand Down Expand Up @@ -234,7 +234,7 @@ def subcmd_anim(args: Namespace) -> None:
logger.info(
"Compiling pairwise comparisons (this can take time for large datasets)..."
)
comparisons = list(combinations(tqdm(genomes, disable=args.disable_tqdm), 2))
comparisons = list(permutations(tqdm(genomes, disable=args.disable_tqdm), 2))
logger.info("\t...total pairwise comparisons to be performed: %s", len(comparisons))

# Check for existing comparisons; if one has been done (for the same
Expand Down
Original file line number Diff line number Diff line change
@@ -1 +1 @@
8b0cab310cb638c977d453ff06eceb64 /Users/lpritc/Development/GitHub/pyani/tests/fixtures/single_genome_download/GCF_000011605.1_ASM1160v1_genomic.fna
8b0cab310cb638c977d453ff06eceb64 /Users/baileythegreen/Software/pyani/tests/fixtures/single_genome_download/GCF_000011605.1_ASM1160v1_genomic.fna
11 changes: 0 additions & 11 deletions tests/test_anim.py
Original file line number Diff line number Diff line change
Expand Up @@ -275,14 +275,3 @@ def test_mummer_job_generation(mummer_cmds_four):
assert job.name == "test_%06d-f" % idx # filter job name
assert len(job.dependencies) == 1 # has NUCmer job
assert job.dependencies[0].name == "test_%06d-n" % idx


def test_genome_sorting(tmp_path, unsorted_genomes):
second, first = [Path(_.path) for _ in unsorted_genomes]
outprefix = f"{tmp_path}/nucmer_output/{first.stem}/{first.stem}_vs_{second.stem}"
expected = (
f"nucmer --mum -p {outprefix} {first} {second}",
f"delta_filter_wrapper.py delta-filter -1 {outprefix}.delta {outprefix}.filter",
)
nucmercmd, filtercmd = anim.construct_nucmer_cmdline(second, first, tmp_path)
assert (nucmercmd, filtercmd) == expected