Skip to content

Commit 0bbbee5

Browse files
committed
feat: finish batched search and add doc
1 parent ba1923f commit 0bbbee5

File tree

10 files changed

+249
-102
lines changed

10 files changed

+249
-102
lines changed

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,12 +7,13 @@
77
The [FM-Index] is a full-text index data structure that allows efficiently counting and retrieving all occurrenes of short sequences in very large texts. It is widely used in sequence analysis and bioinformatics.
88

99
The implementation of this library is based on an encoding for the text with rank support data structure (a.k.a. occurrence table)
10-
by Simon Gene Gottlieb, who also was a great help while developing this library. This data structure is central to the inner workings of
10+
by Simon Gene Gottlieb, who also was a great help while developing the library. This data structure is central to the inner workings of
1111
the FM-Index. The encoding attemps to provide a good trade-off between memory usage and running time of queries.
1212
A second, faster and less memory efficient encoding is also implemented in this library. Further benefits of `genedex` include:
1313

1414
- Fast, parallel and memory efficient index construction by leveraging [`libsais-rs`] and [`rayon`].
1515
- Support for indexing a set of texts, like chromosomes of a genome.
16+
- Optimized functions for searching multiple queries at once (per thread!).
1617
- A flexible cursor API.
1718
- Fast reading and writing the FM-Index from/to files, using [`savefile`].
1819
- Thoroughly tested using [`proptest`].

ROADMAP.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,11 @@
1717
the condensed text with rank support will get smaller and maybe faster.
1818
A text sampled suffix array could be an option, or a "sparse" text with rank support substructure.
1919
- paired blocks for improved memory usage when using larger alphabets
20+
- in the search, `lookup_tables::compute_lookup_idx_static_len` still seems to be one of the bottlenecks. this
21+
should be investigated further, maybe it can be optimized or it's a measuring error.
22+
- the batching of search queries could be improved. Currenty, it is not efficent if the queries have very different lengths
23+
or if many of them quickly get an empty interval, while other need ot be searched to the very end.
24+
- more documentation tests
2025

2126
### Large topics, is the goal to eventually support
2227

@@ -31,6 +36,10 @@
3136
- optional functionality for text recovery
3237
- text sampled suffix array (with text ids and optionally other annotations)
3338
- suffix array, lookup table compression using unconventional int widths (e.g. 33 bit)
39+
- optimized functions for reading directly from input files: both for texts to build the index and queries to search.
40+
the latter might be more important, because for simple searches, the search can be faster than reading the
41+
queries from disk.
42+
- optimize `sais-drum` to make the low memory construction mode less painful
3443

3544
### Large topics, might never happen
3645

examples/basic_usage.rs

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
use genedex::{FmIndexConfig, PerformancePriority, alphabet};
22

33
fn main() {
4-
// This example shows how to use the FM-Index in the most basic way.
4+
// This example shows how to use the FM-Index in a basic way.
55

66
let dna_n_alphabet = alphabet::ascii_dna_with_n();
77
let texts = [b"aACGT", b"acGtn"];
@@ -21,4 +21,16 @@ fn main() {
2121
hit.text_id, hit.position
2222
);
2323
}
24+
25+
// for many queries, the locate_many function can be used for convenience and to improve running time
26+
let many_queries = [b"AC".as_slice(), b"CG", b"GT", b"GTN"];
27+
28+
for (query_id, hits) in index.locate_many(many_queries).enumerate() {
29+
for hit in hits {
30+
println!(
31+
"Found query {query_id} in text {} at position {}.",
32+
hit.text_id, hit.position
33+
);
34+
}
35+
}
2436
}

src/alphabet.rs

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -246,17 +246,20 @@ impl Alphabet {
246246
}
247247
}
248248

249-
/// Includes only the four bases of DNA A,C,G and T (case-insensitive).
249+
/// Includes only the four bases of DNA A, C, G and T (case-insensitive).
250250
pub fn ascii_dna() -> Alphabet {
251251
Alphabet::from_ambiguous_io_symbols([b"Aa", b"Cc", b"Gg", b"Tt"], 0)
252252
}
253253

254-
/// Includes the four bases of DNA A,C,G and T, and the N character (case-insensitive). The N character is not allowed to be searched.
254+
/// Includes the four bases of DNA A, C, G and T, and the N character (case-insensitive). The N character is not allowed to be searched.
255255
pub fn ascii_dna_with_n() -> Alphabet {
256256
Alphabet::from_ambiguous_io_symbols([b"Aa", b"Cc", b"Gg", b"Tt", b"Nn"], 1)
257257
}
258258

259259
/// Includes all values of the IUPAC standard (or .fasta format) for DNA bases, except for gaps (case-insensitive).
260+
///
261+
/// All symbols are allowed to be searched, but the "degenerate" symbols are not resolved to match their base symbols.
262+
/// For example, M means "A or C", but an M in the searched query does not match at an A or C of the indexed texts.
260263
pub fn ascii_dna_iupac() -> Alphabet {
261264
Alphabet::from_ambiguous_io_symbols(
262265
[
@@ -282,7 +285,7 @@ pub fn ascii_dna_iupac_as_dna_with_n() -> Alphabet {
282285
)
283286
}
284287

285-
/// Includes only values that correspond to single amino acids (case-insensitive).
288+
/// Includes only values that correspond to single amino acids in the IUPAC standard (case-insensitive).
286289
pub fn ascii_amino_acid() -> Alphabet {
287290
Alphabet::from_ambiguous_io_symbols(
288291
[
@@ -294,6 +297,7 @@ pub fn ascii_amino_acid() -> Alphabet {
294297
}
295298

296299
/// Includes all values of the IUPAC standard (or .fasta format) for amino acids, except for gaps (case-insensitive).
300+
/// This alphabet therefore contains all letters of the basic latin alphabet, and the symbol `*`.
297301
pub fn ascii_amino_acid_iupac() -> Alphabet {
298302
Alphabet::from_ambiguous_io_symbols(
299303
[
@@ -329,12 +333,12 @@ pub fn ascii_amino_acid_iupac() -> Alphabet {
329333
)
330334
}
331335

332-
/// Includes all u8 values until the `max_symbol` value.
336+
/// Includes all u8 values until including the `max_symbol` value.
333337
pub fn u8_until(max_symbol: u8) -> Alphabet {
334338
Alphabet::from_io_symbols(0..=max_symbol, 0)
335339
}
336340

337-
/// Includes all printable symbols of the ASCII code (case-sensitive).
341+
/// Includes all 95 printable symbols of the ASCII code (case-sensitive).
338342
pub fn ascii_printable() -> Alphabet {
339343
Alphabet::from_io_symbols(b" !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~", 0)
340344
}

src/batch_computed_cursors.rs

Lines changed: 4 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -68,20 +68,15 @@ where
6868
let depths = &mut self.buffers.buffer1[..self.curr_batch_size];
6969
let idxs = &mut self.buffers.buffer2[..self.curr_batch_size];
7070

71-
for ((&query, depth), idx) in self
72-
.buffers
73-
.queries
74-
.iter()
75-
.take(self.curr_batch_size)
76-
.zip(depths)
77-
.zip(idxs)
78-
{
71+
for ((&query, depth), idx) in self.buffers.queries.iter().zip(depths).zip(idxs) {
7972
let query = query.unwrap();
8073
*depth = std::cmp::min(query.len(), self.index.lookup_tables.max_depth());
74+
let suffix_idx = query.len() - *depth;
75+
8176
*idx = self
8277
.index
8378
.lookup_tables
84-
.compute_lookup_idx(&mut self.index.get_query_iter(query), *depth);
79+
.compute_lookup_idx(&query[suffix_idx..], &self.index.alphabet);
8580
}
8681

8782
let depths = &mut self.buffers.buffer1[..self.curr_batch_size];

src/config.rs

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -88,11 +88,15 @@ impl<I: IndexStorage, R: TextWithRankSupport<I>> Default for FmIndexConfig<I, R>
8888
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
8989
pub enum PerformancePriority {
9090
HighSpeed,
91-
/// This is currently equivalent to `HighSpeed`, but that will change in the future.
91+
/// For alphabets with 16 or less symbols in dense encoding, the temporary concatenated text and
92+
/// the BWT buffer can be compressed to use less memory while constructing the BWT. This reduces the peak memory usage
93+
/// of the construction by about 10-15% and only takes a small amount of additional running time.
9294
Balanced,
93-
/// A slower, not parallel suffix array construction algorithm will be used for `u32`-based FM-Indices,
94-
/// if the `u32-saca` feature is activated (by default it is).
95+
/// In addition to the space improvements of the `Balanced` variant, a much slower, not parallel suffix
96+
/// array construction algorithm will be used for `u32`-based FM-Indices.
97+
/// This only happens if the `u32-saca` feature is activated (by default it is).
9598
/// This can save a lot of memory when the sum of text lengths fits into a `u32`, but not into a `i32`.
99+
/// The downside is that the construction is much slower (at least 5 times slower should be expected).
96100
LowMemory,
97101
}
98102

src/construction/mod.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,7 @@ pub(crate) fn create_data_structures<I: IndexStorage, R: TextWithRankSupport<I>,
6666
pub trait IndexStorage:
6767
PrimInt + Pod + maybe_savefile::MaybeSavefile + sealed::Sealed + Send + Sync + 'static
6868
{
69+
#[doc(hidden)]
6970
type LibsaisOutput: OutputElement + IndexStorage;
7071

7172
#[doc(hidden)]

src/lib.rs

Lines changed: 76 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -31,8 +31,14 @@
3131
* }
3232
* ```
3333
*
34-
* More information about the flexible [cursor](Cursor) API, build [configuration](FmIndexConfig) and [variants](TextWithRankSupport) of the FM-Index can
35-
* be found in the module-level and struct-level documentation.
34+
* More information about the flexible [cursor](Cursor) API, build [configuration](FmIndexConfig)
35+
* and [variants](TextWithRankSupport) of the FM-Index can be found in the module-level and struct-level documentation.
36+
*
37+
* Optimized functions such as [`FmIndex::locate_many`] exist for searching multiple queries at once. They do not use
38+
* multi-threading, but can still be significantly faster (around 2x) than calling the respective functions for single
39+
* queries in a loop. The reason for the improved performance is that the queries are searched in batches, which allows
40+
* different kinds of parallelism inside the CPU to be used. An example of how such a function is used can be found
41+
* [here](https://github.com/feldroop/genedex/blob/master/examples/basic_usage.rs).
3642
*
3743
* [original paper]: https://doi.org/10.1109/SFCS.2000.892127
3844
* [`libsais-rs`]: https://github.com/feldroop/libsais-rs
@@ -42,6 +48,9 @@
4248
pub mod alphabet;
4349

4450
/// Different implementations of the text with rank support (a.k.a. occurrence table) data structure that powers the FM-Index.
51+
///
52+
/// The [`TextWithRankSupport`] and [`Block`](text_with_rank_support::Block) traits are good places to start
53+
/// learning about this module.
4554
pub mod text_with_rank_support;
4655

4756
mod batch_computed_cursors;
@@ -89,10 +98,11 @@ pub struct FmIndex<I, R = CondensedTextWithRankSupport<I, Block64>> {
8998
lookup_tables: LookupTables<I>,
9099
}
91100

92-
/// A little faster than [`FmIndexCondensed512`], but still space efficient for larger alphabets.
101+
/// A little faster than [`FmIndexCondensed512`], and still space efficient for larger alphabets.
102+
/// This is the default version.
93103
pub type FmIndexCondensed64<I> = FmIndex<I, CondensedTextWithRankSupport<I, Block64>>;
94104

95-
/// The most space efficent version.
105+
/// The most space efficient version.
96106
pub type FmIndexCondensed512<I> = FmIndex<I, CondensedTextWithRankSupport<I, Block512>>;
97107

98108
/// The fastest version.
@@ -101,6 +111,8 @@ pub type FmIndexFlat64<I> = FmIndex<I, FlatTextWithRankSupport<I, Block64>>;
101111
/// A little smaller and slower than [`FmIndexFlat64`]. [`FmIndexCondensed64`] should be a better trade-off for most applications.
102112
pub type FmIndexFlat512<I> = FmIndex<I, FlatTextWithRankSupport<I, Block512>>;
103113

114+
const BATCH_SIZE: usize = 64;
115+
104116
impl<I: IndexStorage, R: TextWithRankSupport<I>> FmIndex<I, R> {
105117
fn new<T: AsRef<[u8]>>(
106118
texts: impl IntoIterator<Item = T>,
@@ -141,6 +153,18 @@ impl<I: IndexStorage, R: TextWithRankSupport<I>> FmIndex<I, R> {
141153
self.cursor_for_query(query).count()
142154
}
143155

156+
/// The results of [`Self::count`] for multiple queries.
157+
///
158+
/// The order of the queries is preserved for the counts. This function can improve the running
159+
/// time when many queries are searched.
160+
pub fn count_many<'a>(
161+
&'a self,
162+
queries: impl IntoIterator<Item = &'a [u8]>,
163+
) -> impl Iterator<Item = usize> {
164+
self.cursors_for_many_queries(queries)
165+
.map(|cursor| cursor.count())
166+
}
167+
144168
/// Returns the number of occurrences of `query` in the set of indexed texts.
145169
///
146170
/// The initial running time is the same as for [`count`](Self::count).
@@ -153,19 +177,15 @@ impl<I: IndexStorage, R: TextWithRankSupport<I>> FmIndex<I, R> {
153177
self.locate_interval(cursor.interval())
154178
}
155179

156-
pub fn count_many<'a>(
157-
&'a self,
158-
queries: impl IntoIterator<Item = &'a [u8]> + 'a,
159-
) -> impl Iterator<Item = usize> {
160-
BatchComputedCursors::<I, R, _, 32>::new(self, queries.into_iter())
161-
.map(|cursor| cursor.count())
162-
}
163-
180+
/// The results of [`Self::locate`] for multiple queries.
181+
///
182+
/// The order of the queries is preserved for the hits. This function can improve the running
183+
/// time when many queries are searched.
164184
pub fn locate_many<'a>(
165185
&'a self,
166-
queries: impl IntoIterator<Item = &'a [u8]> + 'a,
186+
queries: impl IntoIterator<Item = &'a [u8]>,
167187
) -> impl Iterator<Item: Iterator<Item = Hit>> {
168-
BatchComputedCursors::<I, R, _, 32>::new(self, queries.into_iter())
188+
self.cursors_for_many_queries(queries)
169189
.map(|cursor| self.locate_interval(cursor.interval()))
170190
}
171191

@@ -200,27 +220,16 @@ impl<I: IndexStorage, R: TextWithRankSupport<I>> FmIndex<I, R> {
200220
/// This allows using a lookup table jump and therefore can be more efficient than creating
201221
/// an empty cursor and repeatedly calling [`Cursor::extend_query_front`].
202222
pub fn cursor_for_query<'a>(&'a self, query: &[u8]) -> Cursor<'a, I, R> {
203-
let query_iter = self.get_query_iter(query);
204-
self.cursor_for_iter_without_alphabet_translation(query_iter)
205-
}
206-
207-
fn cursor_for_iter_without_alphabet_translation<'a, Q>(
208-
&'a self,
209-
query: impl IntoIterator<IntoIter = Q>,
210-
) -> Cursor<'a, I, R>
211-
where
212-
Q: ExactSizeIterator<Item = u8>,
213-
{
214-
let mut query_iter = query.into_iter();
215-
let interval = self.initial_lookup_table_jump(&mut query_iter);
223+
let (remaining_query, query_suffix) = self.split_query_for_lookup(query);
224+
let interval = self.lookup_tables.lookup(query_suffix, &self.alphabet);
216225

217226
let mut cursor = Cursor {
218227
index: self,
219228
interval,
220229
};
221230

222-
for symbol in query_iter {
223-
cursor.extend_front_without_alphabet_translation(symbol);
231+
for &symbol in remaining_query.iter().rev() {
232+
cursor.extend_query_front(symbol);
224233

225234
if cursor.count() == 0 {
226235
break;
@@ -230,25 +239,52 @@ impl<I: IndexStorage, R: TextWithRankSupport<I>> FmIndex<I, R> {
230239
cursor
231240
}
232241

233-
fn get_query_iter(&self, query: &[u8]) -> impl ExactSizeIterator<Item = u8> {
234-
query
235-
.iter()
236-
.rev()
237-
.map(|&s| self.alphabet.io_to_dense_representation(s))
242+
/// The results of [`Self::cursor_for_query`] for multiple queries.
243+
///
244+
/// The order of the queries is preserved for the cursors. This function can improve the running
245+
/// time when many queries are searched.
246+
pub fn cursors_for_many_queries<'a>(
247+
&'a self,
248+
queries: impl IntoIterator<Item = &'a [u8]>,
249+
) -> impl Iterator<Item = Cursor<'a, I, R>> {
250+
BatchComputedCursors::<I, R, _, BATCH_SIZE>::new(self, queries.into_iter())
238251
}
239252

240-
fn initial_lookup_table_jump(
241-
&self,
242-
query_iter: &mut impl ExactSizeIterator<Item = u8>,
243-
) -> HalfOpenInterval {
244-
let lookup_depth = std::cmp::min(query_iter.len(), self.lookup_tables.max_depth());
245-
self.lookup_tables.lookup(query_iter, lookup_depth)
253+
fn cursor_for_query_without_alphabet_translation<'a>(
254+
&'a self,
255+
query: &[u8],
256+
) -> Cursor<'a, I, R> {
257+
let (remaining_query, query_suffix) = self.split_query_for_lookup(query);
258+
let interval = self
259+
.lookup_tables
260+
.lookup_without_alphabet_translation(query_suffix);
261+
262+
let mut cursor = Cursor {
263+
index: self,
264+
interval,
265+
};
266+
267+
for &symbol in remaining_query.iter().rev() {
268+
cursor.extend_front_without_alphabet_translation(symbol);
269+
270+
if cursor.count() == 0 {
271+
break;
272+
}
273+
}
274+
275+
cursor
246276
}
247277

248278
fn lf_mapping_step(&self, symbol: u8, idx: usize) -> usize {
249279
self.count[symbol as usize] + self.text_with_rank_support.rank(symbol, idx)
250280
}
251281

282+
fn split_query_for_lookup<'a>(&self, query: &'a [u8]) -> (&'a [u8], &'a [u8]) {
283+
let lookup_depth = std::cmp::min(query.len(), self.lookup_tables.max_depth());
284+
let suffix_idx = query.len() - lookup_depth;
285+
query.split_at(suffix_idx)
286+
}
287+
252288
pub fn alphabet(&self) -> &Alphabet {
253289
&self.alphabet
254290
}

0 commit comments

Comments
 (0)