feat: finish batched search and add doc

feldroop · feldroop · commit 0bbbee5eb48c · 2025-09-24T21:57:07.000+02:00
diff --git a/README.md b/README.md
@@ -7,12 +7,13 @@
 The [FM-Index] is a full-text index data structure that allows efficiently counting and retrieving all occurrenes of short sequences in very large texts. It is widely used in sequence analysis and bioinformatics.
 
 The implementation of this library is based on an encoding for the text with rank support data structure (a.k.a. occurrence table)
-by Simon Gene Gottlieb, who also was a great help while developing this library. This data structure is central to the inner workings of
+by Simon Gene Gottlieb, who also was a great help while developing the library. This data structure is central to the inner workings of
 the FM-Index. The encoding attemps to provide a good trade-off between memory usage and running time of queries. 
 A second, faster and less memory efficient encoding is also implemented in this library. Further benefits of `genedex` include:
 
 - Fast, parallel and memory efficient index construction by leveraging [`libsais-rs`] and [`rayon`].
 - Support for indexing a set of texts, like chromosomes of a genome.
+- Optimized functions for searching multiple queries at once (per thread!).
 - A flexible cursor API.
 - Fast reading and writing the FM-Index from/to files, using [`savefile`].
 - Thoroughly tested using [`proptest`].
diff --git a/ROADMAP.md b/ROADMAP.md
@@ -17,6 +17,11 @@
         the condensed text with rank support will get smaller and maybe faster. 
         A text sampled suffix array could be an option, or a "sparse" text with rank support substructure.
 - paired blocks for improved memory usage when using larger alphabets
+- in the search, `lookup_tables::compute_lookup_idx_static_len` still seems to be one of the bottlenecks. this
+    should be investigated further, maybe it can be optimized or it's a measuring error.
+- the batching of search queries could be improved. Currenty, it is not efficent if the queries have very different lengths
+    or if many of them quickly get an empty interval, while other need ot be searched to the very end.
+- more documentation tests
 
 ### Large topics, is the goal to eventually support
 
@@ -31,6 +36,10 @@
 - optional functionality for text recovery
 - text sampled suffix array (with text ids and optionally other annotations)
 - suffix array, lookup table compression using unconventional int widths (e.g. 33 bit)
+- optimized functions for reading directly from input files: both for texts to build the index and queries to search.
+    the latter might be more important, because for simple searches, the search can be faster than reading the 
+    queries from disk.
+- optimize `sais-drum` to make the low memory construction mode less painful
 
 ### Large topics, might never happen
 
diff --git a/examples/basic_usage.rs b/examples/basic_usage.rs
@@ -1,7 +1,7 @@
 use genedex::{FmIndexConfig, PerformancePriority, alphabet};
 
 fn main() {
-    // This example shows how to use the FM-Index in the most basic way.
+    // This example shows how to use the FM-Index in a basic way.
 
     let dna_n_alphabet = alphabet::ascii_dna_with_n();
     let texts = [b"aACGT", b"acGtn"];
@@ -21,4 +21,16 @@ fn main() {
             hit.text_id, hit.position
         );
     }
+
+    // for many queries, the locate_many function can be used for convenience and to improve running time
+    let many_queries = [b"AC".as_slice(), b"CG", b"GT", b"GTN"];
+
+    for (query_id, hits) in index.locate_many(many_queries).enumerate() {
+        for hit in hits {
+            println!(
+                "Found query {query_id} in text {} at position {}.",
+                hit.text_id, hit.position
+            );
+        }
+    }
 }
diff --git a/src/alphabet.rs b/src/alphabet.rs
@@ -246,17 +246,20 @@ impl Alphabet {
     }
 }
 
-/// Includes only the four bases of DNA A,C,G and T (case-insensitive).
+/// Includes only the four bases of DNA A, C, G and T (case-insensitive).
 pub fn ascii_dna() -> Alphabet {
     Alphabet::from_ambiguous_io_symbols([b"Aa", b"Cc", b"Gg", b"Tt"], 0)
 }
 
-/// Includes the four bases of DNA A,C,G and T, and the N character (case-insensitive). The N character is not allowed to be searched.
+/// Includes the four bases of DNA A, C, G and T, and the N character (case-insensitive). The N character is not allowed to be searched.
 pub fn ascii_dna_with_n() -> Alphabet {
     Alphabet::from_ambiguous_io_symbols([b"Aa", b"Cc", b"Gg", b"Tt", b"Nn"], 1)
 }
 
 /// Includes all values of the IUPAC standard (or .fasta format) for DNA bases, except for gaps (case-insensitive).
+///
+/// All symbols are allowed to be searched, but the "degenerate" symbols are not resolved to match their base symbols.
+/// For example, M means "A or C", but an M in the searched query does not match at an A or C of the indexed texts.
 pub fn ascii_dna_iupac() -> Alphabet {
     Alphabet::from_ambiguous_io_symbols(
         [
@@ -282,7 +285,7 @@ pub fn ascii_dna_iupac_as_dna_with_n() -> Alphabet {
     )
 }
 
-/// Includes only values that correspond to single amino acids (case-insensitive).
+/// Includes only values that correspond to single amino acids in the IUPAC standard (case-insensitive).
 pub fn ascii_amino_acid() -> Alphabet {
     Alphabet::from_ambiguous_io_symbols(
         [
@@ -294,6 +297,7 @@ pub fn ascii_amino_acid() -> Alphabet {
 }
 
 /// Includes all values of the IUPAC standard (or .fasta format) for amino acids, except for gaps (case-insensitive).
+/// This alphabet therefore contains all letters of the basic latin alphabet, and the symbol `*`.
 pub fn ascii_amino_acid_iupac() -> Alphabet {
     Alphabet::from_ambiguous_io_symbols(
         [
@@ -329,12 +333,12 @@ pub fn ascii_amino_acid_iupac() -> Alphabet {
     )
 }
 
-/// Includes all u8 values until the `max_symbol` value.
+/// Includes all u8 values until including the `max_symbol` value.
 pub fn u8_until(max_symbol: u8) -> Alphabet {
     Alphabet::from_io_symbols(0..=max_symbol, 0)
 }
 
-/// Includes all printable symbols of the ASCII code (case-sensitive).
+/// Includes all 95 printable symbols of the ASCII code (case-sensitive).
 pub fn ascii_printable() -> Alphabet {
     Alphabet::from_io_symbols(b" !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~", 0)
 }
diff --git a/src/batch_computed_cursors.rs b/src/batch_computed_cursors.rs
@@ -68,20 +68,15 @@ where
         let depths = &mut self.buffers.buffer1[..self.curr_batch_size];
         let idxs = &mut self.buffers.buffer2[..self.curr_batch_size];
 
-        for ((&query, depth), idx) in self
-            .buffers
-            .queries
-            .iter()
-            .take(self.curr_batch_size)
-            .zip(depths)
-            .zip(idxs)
-        {
+        for ((&query, depth), idx) in self.buffers.queries.iter().zip(depths).zip(idxs) {
             let query = query.unwrap();
             *depth = std::cmp::min(query.len(), self.index.lookup_tables.max_depth());
+            let suffix_idx = query.len() - *depth;
+
             *idx = self
                 .index
                 .lookup_tables
-                .compute_lookup_idx(&mut self.index.get_query_iter(query), *depth);
+                .compute_lookup_idx(&query[suffix_idx..], &self.index.alphabet);
         }
 
         let depths = &mut self.buffers.buffer1[..self.curr_batch_size];
diff --git a/src/config.rs b/src/config.rs
@@ -88,11 +88,15 @@ impl<I: IndexStorage, R: TextWithRankSupport<I>> Default for FmIndexConfig<I, R>
 #[derive(Debug, Clone, Copy, PartialEq, Eq)]
 pub enum PerformancePriority {
     HighSpeed,
-    /// This is currently equivalent to `HighSpeed`, but that will change in the future.
+    /// For alphabets with 16 or less symbols in dense encoding, the temporary concatenated text and
+    /// the BWT buffer can be compressed to use less memory while constructing the BWT. This reduces the peak memory usage
+    /// of the construction by about 10-15% and only takes a small amount of additional running time.
     Balanced,
-    /// A slower, not parallel suffix array construction algorithm will be used for `u32`-based FM-Indices,
-    /// if the `u32-saca` feature is activated (by default it is).
+    /// In addition to the space improvements of the `Balanced` variant, a much slower, not parallel suffix
+    /// array construction algorithm will be used for `u32`-based FM-Indices.
+    /// This only happens if the `u32-saca` feature is activated (by default it is).
     /// This can save a lot of memory when the sum of text lengths fits into a `u32`, but not into a `i32`.
+    /// The downside is that the construction is much slower (at least 5 times slower should be expected).
     LowMemory,
 }
 
diff --git a/src/construction/mod.rs b/src/construction/mod.rs
@@ -66,6 +66,7 @@ pub(crate) fn create_data_structures<I: IndexStorage, R: TextWithRankSupport<I>,
 pub trait IndexStorage:
     PrimInt + Pod + maybe_savefile::MaybeSavefile + sealed::Sealed + Send + Sync + 'static
 {
+    #[doc(hidden)]
     type LibsaisOutput: OutputElement + IndexStorage;
 
     #[doc(hidden)]
diff --git a/src/lib.rs b/src/lib.rs
@@ -31,8 +31,14 @@
  * }
  * ```
  *
- * More information about the flexible [cursor](Cursor) API, build [configuration](FmIndexConfig) and [variants](TextWithRankSupport) of the FM-Index can
- * be found in the module-level and struct-level documentation.
+ * More information about the flexible [cursor](Cursor) API, build [configuration](FmIndexConfig)
+ * and [variants](TextWithRankSupport) of the FM-Index can be found in the module-level and struct-level documentation.
+ *
+ * Optimized functions such as [`FmIndex::locate_many`] exist for searching multiple queries at once. They do not use
+ * multi-threading, but can still be significantly faster (around 2x) than calling the respective functions for single
+ * queries in a loop. The reason for the improved performance is that the queries are searched in batches, which allows
+ * different kinds of parallelism inside the CPU to be used. An example of how such a function is used can be found
+ * [here](https://github.com/feldroop/genedex/blob/master/examples/basic_usage.rs).
  *
  * [original paper]: https://doi.org/10.1109/SFCS.2000.892127
  * [`libsais-rs`]: https://github.com/feldroop/libsais-rs
@@ -42,6 +48,9 @@
 pub mod alphabet;
 
 /// Different implementations of the text with rank support (a.k.a. occurrence table) data structure that powers the FM-Index.
+///
+/// The [`TextWithRankSupport`] and [`Block`](text_with_rank_support::Block) traits are good places to start
+///  learning about this module.
 pub mod text_with_rank_support;
 
 mod batch_computed_cursors;
@@ -89,10 +98,11 @@ pub struct FmIndex<I, R = CondensedTextWithRankSupport<I, Block64>> {
     lookup_tables: LookupTables<I>,
 }
 
-/// A little faster than [`FmIndexCondensed512`], but still space efficient for larger alphabets.
+/// A little faster than [`FmIndexCondensed512`], and still space efficient for larger alphabets.
+/// This is the default version.
 pub type FmIndexCondensed64<I> = FmIndex<I, CondensedTextWithRankSupport<I, Block64>>;
 
-/// The most space efficent version.
+/// The most space efficient version.
 pub type FmIndexCondensed512<I> = FmIndex<I, CondensedTextWithRankSupport<I, Block512>>;
 
 /// The fastest version.
@@ -101,6 +111,8 @@ pub type FmIndexFlat64<I> = FmIndex<I, FlatTextWithRankSupport<I, Block64>>;
 /// A little smaller and slower than [`FmIndexFlat64`]. [`FmIndexCondensed64`] should be a better trade-off for most applications.
 pub type FmIndexFlat512<I> = FmIndex<I, FlatTextWithRankSupport<I, Block512>>;
 
+const BATCH_SIZE: usize = 64;
+
 impl<I: IndexStorage, R: TextWithRankSupport<I>> FmIndex<I, R> {
     fn new<T: AsRef<[u8]>>(
         texts: impl IntoIterator<Item = T>,
@@ -141,6 +153,18 @@ impl<I: IndexStorage, R: TextWithRankSupport<I>> FmIndex<I, R> {
         self.cursor_for_query(query).count()
     }
 
+    /// The results of [`Self::count`] for multiple queries.
+    ///
+    /// The order of the queries is preserved for the counts. This function can improve the running
+    /// time when many queries are searched.
+    pub fn count_many<'a>(
+        &'a self,
+        queries: impl IntoIterator<Item = &'a [u8]>,
+    ) -> impl Iterator<Item = usize> {
+        self.cursors_for_many_queries(queries)
+            .map(|cursor| cursor.count())
+    }
+
     /// Returns the number of occurrences of `query` in the set of indexed texts.
     ///
     /// The initial running time is the same as for [`count`](Self::count).
@@ -153,19 +177,15 @@ impl<I: IndexStorage, R: TextWithRankSupport<I>> FmIndex<I, R> {
         self.locate_interval(cursor.interval())
     }
 
-    pub fn count_many<'a>(
-        &'a self,
-        queries: impl IntoIterator<Item = &'a [u8]> + 'a,
-    ) -> impl Iterator<Item = usize> {
-        BatchComputedCursors::<I, R, _, 32>::new(self, queries.into_iter())
-            .map(|cursor| cursor.count())
-    }
-
+    /// The results of [`Self::locate`] for multiple queries.
+    ///
+    /// The order of the queries is preserved for the hits. This function can improve the running
+    /// time when many queries are searched.
     pub fn locate_many<'a>(
         &'a self,
-        queries: impl IntoIterator<Item = &'a [u8]> + 'a,
+        queries: impl IntoIterator<Item = &'a [u8]>,
     ) -> impl Iterator<Item: Iterator<Item = Hit>> {
-        BatchComputedCursors::<I, R, _, 32>::new(self, queries.into_iter())
+        self.cursors_for_many_queries(queries)
             .map(|cursor| self.locate_interval(cursor.interval()))
     }
 
@@ -200,27 +220,16 @@ impl<I: IndexStorage, R: TextWithRankSupport<I>> FmIndex<I, R> {
     /// This allows using a lookup table jump and therefore can be more efficient than creating
     /// an empty cursor and repeatedly calling [`Cursor::extend_query_front`].
     pub fn cursor_for_query<'a>(&'a self, query: &[u8]) -> Cursor<'a, I, R> {
-        let query_iter = self.get_query_iter(query);
-        self.cursor_for_iter_without_alphabet_translation(query_iter)
-    }
-
-    fn cursor_for_iter_without_alphabet_translation<'a, Q>(
-        &'a self,
-        query: impl IntoIterator<IntoIter = Q>,
-    ) -> Cursor<'a, I, R>
-    where
-        Q: ExactSizeIterator<Item = u8>,
-    {
-        let mut query_iter = query.into_iter();
-        let interval = self.initial_lookup_table_jump(&mut query_iter);
+        let (remaining_query, query_suffix) = self.split_query_for_lookup(query);
+        let interval = self.lookup_tables.lookup(query_suffix, &self.alphabet);
 
         let mut cursor = Cursor {
             index: self,
             interval,
         };
 
-        for symbol in query_iter {
-            cursor.extend_front_without_alphabet_translation(symbol);
+        for &symbol in remaining_query.iter().rev() {
+            cursor.extend_query_front(symbol);
 
             if cursor.count() == 0 {
                 break;
@@ -230,25 +239,52 @@ impl<I: IndexStorage, R: TextWithRankSupport<I>> FmIndex<I, R> {
         cursor
     }
 
-    fn get_query_iter(&self, query: &[u8]) -> impl ExactSizeIterator<Item = u8> {
-        query
-            .iter()
-            .rev()
-            .map(|&s| self.alphabet.io_to_dense_representation(s))
+    /// The results of [`Self::cursor_for_query`] for multiple queries.
+    ///
+    /// The order of the queries is preserved for the cursors. This function can improve the running
+    /// time when many queries are searched.
+    pub fn cursors_for_many_queries<'a>(
+        &'a self,
+        queries: impl IntoIterator<Item = &'a [u8]>,
+    ) -> impl Iterator<Item = Cursor<'a, I, R>> {
+        BatchComputedCursors::<I, R, _, BATCH_SIZE>::new(self, queries.into_iter())
     }
 
-    fn initial_lookup_table_jump(
-        &self,
-        query_iter: &mut impl ExactSizeIterator<Item = u8>,
-    ) -> HalfOpenInterval {
-        let lookup_depth = std::cmp::min(query_iter.len(), self.lookup_tables.max_depth());
-        self.lookup_tables.lookup(query_iter, lookup_depth)
+    fn cursor_for_query_without_alphabet_translation<'a>(
+        &'a self,
+        query: &[u8],
+    ) -> Cursor<'a, I, R> {
+        let (remaining_query, query_suffix) = self.split_query_for_lookup(query);
+        let interval = self
+            .lookup_tables
+            .lookup_without_alphabet_translation(query_suffix);
+
+        let mut cursor = Cursor {
+            index: self,
+            interval,
+        };
+
+        for &symbol in remaining_query.iter().rev() {
+            cursor.extend_front_without_alphabet_translation(symbol);
+
+            if cursor.count() == 0 {
+                break;
+            }
+        }
+
+        cursor
     }
 
     fn lf_mapping_step(&self, symbol: u8, idx: usize) -> usize {
         self.count[symbol as usize] + self.text_with_rank_support.rank(symbol, idx)
     }
 
+    fn split_query_for_lookup<'a>(&self, query: &'a [u8]) -> (&'a [u8], &'a [u8]) {
+        let lookup_depth = std::cmp::min(query.len(), self.lookup_tables.max_depth());
+        let suffix_idx = query.len() - lookup_depth;
+        query.split_at(suffix_idx)
+    }
+
     pub fn alphabet(&self) -> &Alphabet {
         &self.alphabet
     }
diff --git a/src/lookup_table.rs b/src/lookup_table.rs
diff --git a/src/text_with_rank_support/mod.rs b/src/text_with_rank_support/mod.rs

Original file line number	Diff line number	Diff line change
`@@ -246,17 +246,20 @@ impl Alphabet {`
`246`	`246`	`}`
`247`	`247`	`}`
`248`	`248`
`249`		`-/// Includes only the four bases of DNA A,C,G and T (case-insensitive).`
	`249`	`+/// Includes only the four bases of DNA A, C, G and T (case-insensitive).`
`250`	`250`	`pub fn ascii_dna() -> Alphabet {`
`251`	`251`	`Alphabet::from_ambiguous_io_symbols([b"Aa", b"Cc", b"Gg", b"Tt"], 0)`
`252`	`252`	`}`
`253`	`253`
`254`		`-/// Includes the four bases of DNA A,C,G and T, and the N character (case-insensitive). The N character is not allowed to be searched.`
	`254`	`+/// Includes the four bases of DNA A, C, G and T, and the N character (case-insensitive). The N character is not allowed to be searched.`
`255`	`255`	`pub fn ascii_dna_with_n() -> Alphabet {`
`256`	`256`	`Alphabet::from_ambiguous_io_symbols([b"Aa", b"Cc", b"Gg", b"Tt", b"Nn"], 1)`
`257`	`257`	`}`
`258`	`258`
`259`	`259`	`/// Includes all values of the IUPAC standard (or .fasta format) for DNA bases, except for gaps (case-insensitive).`
	`260`	`+///`
	`261`	`+/// All symbols are allowed to be searched, but the "degenerate" symbols are not resolved to match their base symbols.`
	`262`	`+/// For example, M means "A or C", but an M in the searched query does not match at an A or C of the indexed texts.`
`260`	`263`	`pub fn ascii_dna_iupac() -> Alphabet {`
`261`	`264`	`Alphabet::from_ambiguous_io_symbols(`
`262`	`265`	`[`
`@@ -282,7 +285,7 @@ pub fn ascii_dna_iupac_as_dna_with_n() -> Alphabet {`
`282`	`285`	`)`
`283`	`286`	`}`
`284`	`287`
`285`		`-/// Includes only values that correspond to single amino acids (case-insensitive).`
	`288`	`+/// Includes only values that correspond to single amino acids in the IUPAC standard (case-insensitive).`
`286`	`289`	`pub fn ascii_amino_acid() -> Alphabet {`
`287`	`290`	`Alphabet::from_ambiguous_io_symbols(`
`288`	`291`	`[`
`@@ -294,6 +297,7 @@ pub fn ascii_amino_acid() -> Alphabet {`
`294`	`297`	`}`
`295`	`298`
`296`	`299`	`/// Includes all values of the IUPAC standard (or .fasta format) for amino acids, except for gaps (case-insensitive).`
	`300`	+/// This alphabet therefore contains all letters of the basic latin alphabet, and the symbol `*`.
`297`	`301`	`pub fn ascii_amino_acid_iupac() -> Alphabet {`
`298`	`302`	`Alphabet::from_ambiguous_io_symbols(`
`299`	`303`	`[`
`@@ -329,12 +333,12 @@ pub fn ascii_amino_acid_iupac() -> Alphabet {`
`329`	`333`	`)`
`330`	`334`	`}`
`331`	`335`
`332`		-/// Includes all u8 values until the `max_symbol` value.
	`336`	+/// Includes all u8 values until including the `max_symbol` value.
`333`	`337`	`pub fn u8_until(max_symbol: u8) -> Alphabet {`
`334`	`338`	`Alphabet::from_io_symbols(0..=max_symbol, 0)`
`335`	`339`	`}`
`336`	`340`
`337`		`-/// Includes all printable symbols of the ASCII code (case-sensitive).`
	`341`	`+/// Includes all 95 printable symbols of the ASCII code (case-sensitive).`
`338`	`342`	`pub fn ascii_printable() -> Alphabet {`
`339`	`343`	Alphabet::from_io_symbols(b" !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{\|}~", 0)
`340`	`344`	`}`
Original file line number	Diff line number	Diff line change
`@@ -66,6 +66,7 @@ pub(crate) fn create_data_structures<I: IndexStorage, R: TextWithRankSupport<I>,`
`66`	`66`	`pub trait IndexStorage:`
`67`	`67`	`PrimInt + Pod + maybe_savefile::MaybeSavefile + sealed::Sealed + Send + Sync + 'static`
`68`	`68`	`{`
	`69`	`+ #[doc(hidden)]`
`69`	`70`	`type LibsaisOutput: OutputElement + IndexStorage;`
`70`	`71`
`71`	`72`	`#[doc(hidden)]`