Rethink the core overlap computation for the tokenizers -- AIList and speed tokenizers

The current tokenizers use `rust-lapper` as an overlap computation method. This is nice but we should be using our own algorithms -- namely AIList. Moreover, tokenization can become **even faster** if we make some assumptions about our data, like is it sorted?

Things we should decide on:
1. Can we replace `rust-lapper` with `AIList`?
2. Can we create a version of the tokenizers that assumes sorted files? Call it a `SpeedTokenizer`
3. Should the above `SpeedTokenizer` be checking for sorted-ness?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rethink the core overlap computation for the tokenizers -- AIList and speed tokenizers #81

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Rethink the core overlap computation for the tokenizers -- AIList and speed tokenizers #81

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions