You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -31,11 +31,12 @@ Vclust is an alignment-based tool for fast and accurate calculation of Average N
31
31
3.[Dereplicate viral contigs into representative genomes](#73-dereplicate-viral-contigs-into-representative-genomes)
32
32
4.[Calculate pairwise similarities between all-versus-all genomes](#74-calculate-pairwise-similarities-between-all-versus-all-genomes)
33
33
5.[Process large dataset of diverse virus genomes (IMG/VR)](#75-process-large-dataset-of-diverse-virus-genomes-imgvr)
34
-
6.[Process large dataset of redundant and highly similar virus genomes](#76-process-large-dataset-of-redundant-and-highly-similar-virus-genomes)
34
+
6.[Process large dataset of highly redundant virus genomes](#76-process-large-dataset-of-highly-redundant-virus-genomes)
35
35
7.[Cluster plasmid genomes into pOTUs](#77-cluster-plasmid-genomes-into-potus)
36
-
8.[Tests](#8-tests)
37
-
9.[Citation](#9-citation)
38
-
10.[License](#10-license)
36
+
8.[FAQ](#8-faq)
37
+
9.[Tests](#9-tests)
38
+
10.[Citation](#10-citation)
39
+
11.[License](#11-license)
39
40
40
41
41
42
## 1. Features
@@ -549,17 +550,9 @@ Vclust is optimized for efficient comparison of large viral genome and contig da
549
550
--metric ani --ani 0.95 --qcov 0.85
550
551
```
551
552
552
-
### 7.6. Process large dataset of redundant and highly similar virus genomes
553
+
### 7.6. Process large dataset of highly redundant virus genomes
553
554
554
-
When working with large datasets containing highly redundant sequences (e.g., hundreds of thousands of nearly identical genomes), prefiltering may still pass a large number of genome pairs for alignment, even when using high thresholds for `--min-kmers` and `--min-ident`. Since most sequences in these datasets are almost identical, this can lead to increased memory usage and longer runtimes for all three Vclust commands (`prefilter`, `align`, `cluster`).
555
-
556
-
To address this, Vclust offers three additional options in the prefilter step to reduce RAM usage and improve performance (as detailed in: [6.1.1. Optimizing RAM usage and speed](#611-optimizing-ram-usage-and-speed)). In summary:
557
-
558
-
1.**Batch processing**: Processing genomes in smaller batches reduces RAM usage, with a slight increase in runtime.
559
-
2.**Partial k-mer analysis**: Analyzing a fraction of *k*-mers (instead of the full set) significantly improves runtime and slightly reduces RAM usage.
560
-
3.**Limiting target sequences**: Limiting the number of target sequences per query genome significantly improves both RAM usage and runtime.
561
-
562
-
The example below shows the use of all three options simultaneously:
555
+
When working with large datasets containing highly redundant sequences (e.g., hundreds of thousands of nearly identical genomes), prefiltering may still pass a large number of genome pairs for alignment, even when using high thresholds for `--min-kmers` and `--min-ident`. Since most sequences in these datasets are almost identical, this can lead to increased memory usage and longer runtimes for all three Vclust commands (`prefilter`, `align`, `cluster`). To address this, Vclust offers three additional options in the `prefilter` step to reduce RAM usage and improve performance (as detailed in: [6.1.1. Optimizing RAM usage and speed](#611-optimizing-ram-usage-and-speed)). The example below shows the use of all three options simultaneously:
563
556
564
557
```bash
565
558
# Create a pre-alignment filter by processing batches of 100,000 genomes,
@@ -585,35 +578,58 @@ The example below shows the use of all three options simultaneously:
585
578
586
579
### 7.7. Cluster plasmid genomes into pOTUs
587
580
588
-
Vclust can process non-viral short genomes like plasmids.
581
+
The following commands cluster plasmid genomes into plasmid taxonomic units (PTUs).
589
582
590
583
```bash
591
-
# Create a pre-alignment filter
584
+
# Create a pre-alignment filter passing genome pairs with at least common 30 k-mers
**Q: Does Vclust handle circularly permuted bacteriophage genomes?**
605
+
606
+
**A:** Yes, Vclust handles circularly permuted bacteriophage genomes by being robust to sequence rearrangements (e.g., translocations and circular permutations). It calculates ANI and alignment fraction (coverage) across all local alignments between two genomes, even when homologous segments are reordered. In tests with circularly permuted genomes, Vclust showed minimal inaccuracies in ANI and coverage, with a mean absolute error of 0.04% compared to non-permuted genomes. These small discrepancies are due to short alignment breaks at the breakpoint positions in circular genomes.
607
+
608
+
609
+
**Q: How does Vclust's sensitivity compare to BLASTn and MegaBLAST?**
610
+
611
+
**A:** Vclust is designed to match the sensitivity of BLASTn, which is considered highly reliable for estimating ANI. Like BLASTn, Vclust uses an anchor length of 11 nucleotides to align sequences with high precision. MegaBLAST, in comparison, uses a larger word size of 28 nucleotides, making it less sensitive.
612
+
613
+
614
+
**Q: Can I increase the default minimum sequence identity (0.7) in prefilter if I'm aiming for a higher ANI threshold (0.95)?**
615
+
616
+
**A:** Yes, you can safely increase the default minimum sequence identity (`--min-ident`) in the prefilter step to target a higher ANI threshold. We designed the sequence identity calculation in the `prefilter` command to be higher than the ANI derived from the subsequent `align` step. Specifically, while the sequence identity is calculated similarly to ANI in Mash, Vclust's calculation is based on the shorter sequence. As a result, the default `--min-ident` of `0.7` can be raised to values closer to the final alignment-based ANI threshold.
617
+
618
+
In our tests for vOTU clustering (ANI ≥ 95% and AF ≥ 85%), even increasing `--min-ident` to `0.95` during prefiltering did not exclude any genome pairs with an alignment-based ANI of ≥ 95%. Additionally, raising the default `--min-ident` from `0.7` to `0.95` significantly reduces the number of genome pairs requiring alignment, thereby speeding up the alignment step.
619
+
604
620
605
-
## 8. Tests
621
+
## 9. Tests
606
622
607
623
To ensure that Vclust works as expected, you can run tests using [pytest](https://docs.pytest.org/).
608
624
609
625
```bash
610
626
pytest test.py
611
627
```
612
628
613
-
## 9. Citation
629
+
## 10. Citation
614
630
615
631
Zielezinski A, Gudyś A, Barylski J, Siminski K, Rozwalak P, Dutilh BE, Deorowicz S. *Ultrafast and accurate sequence alignment and clustering of viral genomes*. bioRxiv [[doi:10.1101/2024.06.27.601020](https://www.biorxiv.org/content/10.1101/2024.06.27.601020)].
616
632
617
-
## 10. License
633
+
## 11. License
618
634
619
635
[GNU General Public License, version 3](https://www.gnu.org/licenses/gpl-3.0.html)
0 commit comments