Description
(This is intended to be a long-running infodump. Please suggest additions or amendments in comments!!)
sourmash branchwater provides real-time search of the Sequence Read Archive metagenomes along with a map of geo coordinates for discovered samples.
The underlying technology behind the live web site (below) uses a RocksDB-based inverted index that supports containment search. This index is implemented in the disk_revindex.rs
code in sourmash-core, and is somewhat accessible at the command line via the branchwater plugin for sourmash.
- Live Web site: https://branchwater.jgi.doe.gov/
- tech preprint: Sourmash Branchwater Enables Lightweight Petabyte-Scale Sequence Search, Irber et al., 2022 - decent description of use cases and so on, but not up-to-date with RocksDB-based index; current version describes the
manysearch
code in the branchwater plugin instead. - paper showing one use case + validation of matches: Biogeographic distribution of five Antarctic cyanobacteria using large-scale k-mer searching with sourmash branchwater
Drawbacks/challenges & context:
- sourmash uses FracMinHash sketching to compress sequences for search; with the current parameters, search is mainly limited to finding sequences > 5kb in size. This is described in detail in the above preprints.
- as a result, branchwater is open source and deployable on relatively lightweight hardware. We are currently working with several groups to help them stand up search of non-public databases.
Related projects that do/did similar things
Metagraph https://metagraph.ethz.ch/ is a fantastic project that we've used elsewhere (see our SV paper).
Pebblescout - https://www.nature.com/articles/s41592-024-02280-z - uses inverted index to do k-mer searching.
searchSRA - https://www.searchsra.org/ - uses bowtie mapping to find matches to queries of interest in the SRA.
Project Logan - https://github.com/IndexThePlanet/Logan - focused on building unitigs from SRA metagenomes, supporting search.
Related projects with a different focus/emphasis
Serratus (https://serratus.io/; https://www.nature.com/articles/s41586-021-04332-2) used a mapping-based approach + massive parallelism in the cloud to find many novel RNA-dependent RNA polymerase domains => new RNA viruses.
AllTheBacteria provides assemblies of all isolate sequences (NOT metagenomes) in the SRA.