widdowquinn · baileythegreen · Jan 24, 2022 · Jan 24, 2022 · Jan 24, 2022 · Apr 5, 2022
@@ -44,7 +44,7 @@ Percentage Identity/ANI
     Percentage identity matrix for *Candidatus Blochmannia* ANIm analysis
 
     Each cell represents a pairwise comparison between the named genomes on rows and columns, and the number in each cell is the pairwise identity *of all aligned regions*. The dendrograms are produced by single-linkage hierarchical clustering trees from the matrix of pairwise identity results. The default colour scheme colours cells with identity > 0.95 as red, and those with < 0.95 as blue. This division corresponds to a widely-used convention for bacterial species boundaries.
-    
+
 .. note::
 
     No single ANI threshold should be considered universally applicable to distinguish between species for all bacterial genomes.
@@ -56,7 +56,7 @@ Taking the 95% threshold between red and blue cells to be equivalent to a specie
 * the two genomes BPEN and 640 could be classified as the same species
 * the remaining four genomes each represent a distinct species
 
-In particular, we can see that the off-diagonal identity values are all around 85%, consistent with the limit of detection for homologous nucleotide regions. 
+In particular, we can see that the off-diagonal identity values are all around 85%, consistent with the limit of detection for homologous nucleotide regions.
 
 .. note::
 
@@ -143,13 +143,13 @@ Plot Asymmetry
 
     Each ANI method in `pyani` calculates results by a different method. The difference between methods is usually that alternative third-party alignment tools are used. However, there may also be differences between the ways those alignment outputs are used. Please see the relevant documentation for details of each method.
 
-**Average nucleotide identity** is a measure of similarity between two genomes. Depending on the ANI method used, this may be symmetrical: comparing genome A to genome B is the same as comparing genome B to genome A; or asymmetrical: the result of comparing genome A with genome B can be different from comparing genome B with genome A. 
+**Average nucleotide identity** is a measure of similarity between two genomes. Depending on the ANI method used, this may be symmetrical: comparing genome A to genome B is the same as comparing genome B to genome A; or asymmetrical: the result of comparing genome A with genome B can be different from comparing genome B with genome A.
 
 Asymmetry can arise as a consequence of the way the sequence alignment algorithm used for calculating genome alignments works. For instance, the initial seed alignment for a pair of genomes may be very similar, but not identical, and this difference may propagate through an extension step into differences in the final alignment. Alternatively, an aspect of the ANI algorithm may introduce asymmetry. For instance, the genome fragmentation step in ANIb may break each participating genome in different ways.
 
 `pyani` provides both symmetrical and asymmetrical ANI methods:
 
-  - ANIm — symmetrical
+  - ANIm — asymmetrical
   - FastANI — asymmetrical (only available in version 0.3.0-alpha)
   - ANIb — asymmetrical
   - ANIblastall — asymmetrical

@@ -20,7 +20,8 @@ In brief, the analysis proceeds as follows for a set of input prokaryotic genome
 The output values are recorded in the ``pyani`` database.
 
 .. NOTE::
-    A single ``MUMmer`` comparison is performed between each pair of genomes. Input genomes are sorted into alphabetical order by filename, and the query sequence is the genome that occurs earliest in the list; the subject sequence is the genome that occurs latest in the list.
+    Two ``Mummer`` comparisons are performed between each pair of genomes, so that each genome serves as the reference sequence, and as the subject sequence.
+    .. A single ``MUMmer`` comparison is performed between each pair of genomes. Input genomes are sorted into alphabetical order by filename, and the query sequence is the genome that occurs earliest in the list; the subject sequence is the genome that occurs latest in the list.
 
 .. TIP::
     The ``MUMmer`` comparisons are embarrasingly parallel, and can be distributed across cores on an `Open Grid Scheduler`_-compatible cluster, using the ``--scheduler SGE`` option.

@@ -209,7 +209,6 @@ def generate_nucmer_commands(
     pairwise comparison.
     """
     nucmer_cmdlines, delta_filter_cmdlines = [], []
-    filenames = sorted(filenames)  # enforce ordering of filenames
     for idx, fname1 in enumerate(filenames[:-1]):
         for fname2 in filenames[idx + 1 :]:
             ncmd, dcmd = construct_nucmer_cmdline(
@@ -248,7 +247,7 @@ def construct_nucmer_cmdline(
     outdir, called "nucmer_output".
     """
     # Cast path strings to pathlib.Path for safety
-    fname1, fname2 = sorted([Path(fname1), Path(fname2)])
+    fname1, fname2 = Path(fname1), Path(fname2)
 
     # Compile commands
     # Nested output folders to avoid N^2 scaling in files-per-folder

@@ -651,12 +651,6 @@ def update_comparison_matrices(session, run) -> None:
         df_alnlength.loc[qid, sid] = cmp.aln_length
         df_simerrors.loc[qid, sid] = cmp.sim_errs
         df_hadamard.loc[qid, sid] = cmp.identity * cmp.cov_query
-        if cmp.program in ["nucmer"]:
-            df_hadamard.loc[sid, qid] = cmp.identity * cmp.cov_subject
-            df_simerrors.loc[sid, qid] = cmp.sim_errs
-            df_alnlength.loc[sid, qid] = cmp.aln_length
-            df_coverage.loc[sid, qid] = cmp.cov_subject
-            df_identity.loc[sid, qid] = cmp.identity
 
     # Add matrices to the database
     run.df_identity = df_identity.to_json()

@@ -43,7 +43,7 @@
 import logging
 
 from argparse import Namespace
-from itertools import combinations
+from itertools import permutations
 from pathlib import Path
 from typing import List, NamedTuple, Tuple
 
@@ -234,7 +234,7 @@ def subcmd_anim(args: Namespace) -> None:
     logger.info(
         "Compiling pairwise comparisons (this can take time for large datasets)..."
     )
-    comparisons = list(combinations(tqdm(genomes, disable=args.disable_tqdm), 2))
+    comparisons = list(permutations(tqdm(genomes, disable=args.disable_tqdm), 2))
     logger.info("\t...total pairwise comparisons to be performed: %s", len(comparisons))
 
     # Check for existing comparisons; if one has been done (for the same

@@ -1 +1 @@
-8b0cab310cb638c977d453ff06eceb64	/Users/lpritc/Development/GitHub/pyani/tests/fixtures/single_genome_download/GCF_000011605.1_ASM1160v1_genomic.fna
+8b0cab310cb638c977d453ff06eceb64	/Users/baileythegreen/Software/pyani/tests/fixtures/single_genome_download/GCF_000011605.1_ASM1160v1_genomic.fna
@@ -275,14 +275,3 @@ def test_mummer_job_generation(mummer_cmds_four):
         assert job.name == "test_%06d-f" % idx  # filter job name
         assert len(job.dependencies) == 1  # has NUCmer job
         assert job.dependencies[0].name == "test_%06d-n" % idx
-
-
-def test_genome_sorting(tmp_path, unsorted_genomes):
-    second, first = [Path(_.path) for _ in unsorted_genomes]
-    outprefix = f"{tmp_path}/nucmer_output/{first.stem}/{first.stem}_vs_{second.stem}"
-    expected = (
-        f"nucmer --mum -p {outprefix} {first} {second}",
-        f"delta_filter_wrapper.py delta-filter -1 {outprefix}.delta {outprefix}.filter",
-    )
-    nucmercmd, filtercmd = anim.construct_nucmer_cmdline(second, first, tmp_path)
-    assert (nucmercmd, filtercmd) == expected
Original file line number	Diff line number	Diff line change
		@@ -1 +1 @@
		8b0cab310cb638c977d453ff06eceb64 /Users/lpritc/Development/GitHub/pyani/tests/fixtures/single_genome_download/GCF_000011605.1_ASM1160v1_genomic.fna
		8b0cab310cb638c977d453ff06eceb64 /Users/baileythegreen/Software/pyani/tests/fixtures/single_genome_download/GCF_000011605.1_ASM1160v1_genomic.fna