diff --git a/docs/interpreting_plots.rst b/docs/interpreting_plots.rst index 70dcb469..2ce0a2d4 100644 --- a/docs/interpreting_plots.rst +++ b/docs/interpreting_plots.rst @@ -44,7 +44,7 @@ Percentage Identity/ANI Percentage identity matrix for *Candidatus Blochmannia* ANIm analysis Each cell represents a pairwise comparison between the named genomes on rows and columns, and the number in each cell is the pairwise identity *of all aligned regions*. The dendrograms are produced by single-linkage hierarchical clustering trees from the matrix of pairwise identity results. The default colour scheme colours cells with identity > 0.95 as red, and those with < 0.95 as blue. This division corresponds to a widely-used convention for bacterial species boundaries. - + .. note:: No single ANI threshold should be considered universally applicable to distinguish between species for all bacterial genomes. @@ -56,7 +56,7 @@ Taking the 95% threshold between red and blue cells to be equivalent to a specie * the two genomes BPEN and 640 could be classified as the same species * the remaining four genomes each represent a distinct species -In particular, we can see that the off-diagonal identity values are all around 85%, consistent with the limit of detection for homologous nucleotide regions. +In particular, we can see that the off-diagonal identity values are all around 85%, consistent with the limit of detection for homologous nucleotide regions. .. note:: @@ -143,13 +143,13 @@ Plot Asymmetry Each ANI method in `pyani` calculates results by a different method. The difference between methods is usually that alternative third-party alignment tools are used. However, there may also be differences between the ways those alignment outputs are used. Please see the relevant documentation for details of each method. -**Average nucleotide identity** is a measure of similarity between two genomes. Depending on the ANI method used, this may be symmetrical: comparing genome A to genome B is the same as comparing genome B to genome A; or asymmetrical: the result of comparing genome A with genome B can be different from comparing genome B with genome A. +**Average nucleotide identity** is a measure of similarity between two genomes. Depending on the ANI method used, this may be symmetrical: comparing genome A to genome B is the same as comparing genome B to genome A; or asymmetrical: the result of comparing genome A with genome B can be different from comparing genome B with genome A. Asymmetry can arise as a consequence of the way the sequence alignment algorithm used for calculating genome alignments works. For instance, the initial seed alignment for a pair of genomes may be very similar, but not identical, and this difference may propagate through an extension step into differences in the final alignment. Alternatively, an aspect of the ANI algorithm may introduce asymmetry. For instance, the genome fragmentation step in ANIb may break each participating genome in different ways. `pyani` provides both symmetrical and asymmetrical ANI methods: - - ANIm — symmetrical + - ANIm — asymmetrical - FastANI — asymmetrical (only available in version 0.3.0-alpha) - ANIb — asymmetrical - ANIblastall — asymmetrical diff --git a/docs/run_anim.rst b/docs/run_anim.rst index 379b5585..6a85cdba 100644 --- a/docs/run_anim.rst +++ b/docs/run_anim.rst @@ -20,7 +20,8 @@ In brief, the analysis proceeds as follows for a set of input prokaryotic genome The output values are recorded in the ``pyani`` database. .. NOTE:: - A single ``MUMmer`` comparison is performed between each pair of genomes. Input genomes are sorted into alphabetical order by filename, and the query sequence is the genome that occurs earliest in the list; the subject sequence is the genome that occurs latest in the list. + Two ``Mummer`` comparisons are performed between each pair of genomes, so that each genome serves as the reference sequence, and as the subject sequence. + .. A single ``MUMmer`` comparison is performed between each pair of genomes. Input genomes are sorted into alphabetical order by filename, and the query sequence is the genome that occurs earliest in the list; the subject sequence is the genome that occurs latest in the list. .. TIP:: The ``MUMmer`` comparisons are embarrasingly parallel, and can be distributed across cores on an `Open Grid Scheduler`_-compatible cluster, using the ``--scheduler SGE`` option. diff --git a/pyani/anim.py b/pyani/anim.py index a638a04d..e2a0c490 100644 --- a/pyani/anim.py +++ b/pyani/anim.py @@ -209,7 +209,6 @@ def generate_nucmer_commands( pairwise comparison. """ nucmer_cmdlines, delta_filter_cmdlines = [], [] - filenames = sorted(filenames) # enforce ordering of filenames for idx, fname1 in enumerate(filenames[:-1]): for fname2 in filenames[idx + 1 :]: ncmd, dcmd = construct_nucmer_cmdline( @@ -248,7 +247,7 @@ def construct_nucmer_cmdline( outdir, called "nucmer_output". """ # Cast path strings to pathlib.Path for safety - fname1, fname2 = sorted([Path(fname1), Path(fname2)]) + fname1, fname2 = Path(fname1), Path(fname2) # Compile commands # Nested output folders to avoid N^2 scaling in files-per-folder diff --git a/pyani/pyani_orm.py b/pyani/pyani_orm.py index 40864afc..9b13d1e9 100644 --- a/pyani/pyani_orm.py +++ b/pyani/pyani_orm.py @@ -651,12 +651,6 @@ def update_comparison_matrices(session, run) -> None: df_alnlength.loc[qid, sid] = cmp.aln_length df_simerrors.loc[qid, sid] = cmp.sim_errs df_hadamard.loc[qid, sid] = cmp.identity * cmp.cov_query - if cmp.program in ["nucmer"]: - df_hadamard.loc[sid, qid] = cmp.identity * cmp.cov_subject - df_simerrors.loc[sid, qid] = cmp.sim_errs - df_alnlength.loc[sid, qid] = cmp.aln_length - df_coverage.loc[sid, qid] = cmp.cov_subject - df_identity.loc[sid, qid] = cmp.identity # Add matrices to the database run.df_identity = df_identity.to_json() diff --git a/pyani/scripts/subcommands/subcmd_anim.py b/pyani/scripts/subcommands/subcmd_anim.py index 6e294895..dd4455f2 100644 --- a/pyani/scripts/subcommands/subcmd_anim.py +++ b/pyani/scripts/subcommands/subcmd_anim.py @@ -43,7 +43,7 @@ import logging from argparse import Namespace -from itertools import combinations +from itertools import permutations from pathlib import Path from typing import List, NamedTuple, Tuple @@ -234,7 +234,7 @@ def subcmd_anim(args: Namespace) -> None: logger.info( "Compiling pairwise comparisons (this can take time for large datasets)..." ) - comparisons = list(combinations(tqdm(genomes, disable=args.disable_tqdm), 2)) + comparisons = list(permutations(tqdm(genomes, disable=args.disable_tqdm), 2)) logger.info("\t...total pairwise comparisons to be performed: %s", len(comparisons)) # Check for existing comparisons; if one has been done (for the same diff --git a/tests/fixtures/single_genome_download/GCF_000011605.1_ASM1160v1_genomic.md5 b/tests/fixtures/single_genome_download/GCF_000011605.1_ASM1160v1_genomic.md5 index 0b4c29ef..0aa3cf44 100644 --- a/tests/fixtures/single_genome_download/GCF_000011605.1_ASM1160v1_genomic.md5 +++ b/tests/fixtures/single_genome_download/GCF_000011605.1_ASM1160v1_genomic.md5 @@ -1 +1 @@ -8b0cab310cb638c977d453ff06eceb64 /Users/lpritc/Development/GitHub/pyani/tests/fixtures/single_genome_download/GCF_000011605.1_ASM1160v1_genomic.fna +8b0cab310cb638c977d453ff06eceb64 /Users/baileythegreen/Software/pyani/tests/fixtures/single_genome_download/GCF_000011605.1_ASM1160v1_genomic.fna diff --git a/tests/test_anim.py b/tests/test_anim.py index baf5b5b9..39222706 100644 --- a/tests/test_anim.py +++ b/tests/test_anim.py @@ -275,14 +275,3 @@ def test_mummer_job_generation(mummer_cmds_four): assert job.name == "test_%06d-f" % idx # filter job name assert len(job.dependencies) == 1 # has NUCmer job assert job.dependencies[0].name == "test_%06d-n" % idx - - -def test_genome_sorting(tmp_path, unsorted_genomes): - second, first = [Path(_.path) for _ in unsorted_genomes] - outprefix = f"{tmp_path}/nucmer_output/{first.stem}/{first.stem}_vs_{second.stem}" - expected = ( - f"nucmer --mum -p {outprefix} {first} {second}", - f"delta_filter_wrapper.py delta-filter -1 {outprefix}.delta {outprefix}.filter", - ) - nucmercmd, filtercmd = anim.construct_nucmer_cmdline(second, first, tmp_path) - assert (nucmercmd, filtercmd) == expected diff --git a/tests/test_input/subcmd_anim/GCF_000011745.1_ASM1174v1_genomic.md5 b/tests/test_input/subcmd_anim/GCF_000011745.1_ASM1174v1_genomic.fna.md5 similarity index 100% rename from tests/test_input/subcmd_anim/GCF_000011745.1_ASM1174v1_genomic.md5 rename to tests/test_input/subcmd_anim/GCF_000011745.1_ASM1174v1_genomic.fna.md5 diff --git a/tests/test_input/subcmd_anim/GCF_000043285.1_ASM4328v1_genomic.md5 b/tests/test_input/subcmd_anim/GCF_000043285.1_ASM4328v1_genomic.fna.md5 similarity index 100% rename from tests/test_input/subcmd_anim/GCF_000043285.1_ASM4328v1_genomic.md5 rename to tests/test_input/subcmd_anim/GCF_000043285.1_ASM4328v1_genomic.fna.md5 diff --git a/tests/test_input/subcmd_anim/GCF_000185985.2_ASM18598v2_genomic.md5 b/tests/test_input/subcmd_anim/GCF_000185985.2_ASM18598v2_genomic.fna.md5 similarity index 100% rename from tests/test_input/subcmd_anim/GCF_000185985.2_ASM18598v2_genomic.md5 rename to tests/test_input/subcmd_anim/GCF_000185985.2_ASM18598v2_genomic.fna.md5 diff --git a/tests/test_input/subcmd_anim/GCF_000331065.1_ASM33106v1_genomic.md5 b/tests/test_input/subcmd_anim/GCF_000331065.1_ASM33106v1_genomic.fna.md5 similarity index 100% rename from tests/test_input/subcmd_anim/GCF_000331065.1_ASM33106v1_genomic.md5 rename to tests/test_input/subcmd_anim/GCF_000331065.1_ASM33106v1_genomic.fna.md5 diff --git a/tests/test_input/subcmd_anim/GCF_000973505.1_ASM97350v1_genomic.md5 b/tests/test_input/subcmd_anim/GCF_000973505.1_ASM97350v1_genomic.fna.md5 similarity index 100% rename from tests/test_input/subcmd_anim/GCF_000973505.1_ASM97350v1_genomic.md5 rename to tests/test_input/subcmd_anim/GCF_000973505.1_ASM97350v1_genomic.fna.md5 diff --git a/tests/test_input/subcmd_anim/GCF_000973545.1_ASM97354v1_genomic.md5 b/tests/test_input/subcmd_anim/GCF_000973545.1_ASM97354v1_genomic.fna.md5 similarity index 100% rename from tests/test_input/subcmd_anim/GCF_000973545.1_ASM97354v1_genomic.md5 rename to tests/test_input/subcmd_anim/GCF_000973545.1_ASM97354v1_genomic.fna.md5