Skip to content

Rethink the core overlap computation for the tokenizers -- AIList and speed tokenizers #81

Open
@nleroy917

Description

@nleroy917

The current tokenizers use rust-lapper as an overlap computation method. This is nice but we should be using our own algorithms -- namely AIList. Moreover, tokenization can become even faster if we make some assumptions about our data, like is it sorted?

Things we should decide on:

  1. Can we replace rust-lapper with AIList?
  2. Can we create a version of the tokenizers that assumes sorted files? Call it a SpeedTokenizer
  3. Should the above SpeedTokenizer be checking for sorted-ness?

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions