Open
Description
The current tokenizers use rust-lapper
as an overlap computation method. This is nice but we should be using our own algorithms -- namely AIList. Moreover, tokenization can become even faster if we make some assumptions about our data, like is it sorted?
Things we should decide on:
- Can we replace
rust-lapper
withAIList
? - Can we create a version of the tokenizers that assumes sorted files? Call it a
SpeedTokenizer
- Should the above
SpeedTokenizer
be checking for sorted-ness?
Activity