Skip to content

Rethink the core overlap computation for the tokenizers -- AIList and speed tokenizers #81

Open
@nleroy917

Description

@nleroy917

The current tokenizers use rust-lapper as an overlap computation method. This is nice but we should be using our own algorithms -- namely AIList. Moreover, tokenization can become even faster if we make some assumptions about our data, like is it sorted?

Things we should decide on:

  1. Can we replace rust-lapper with AIList?
  2. Can we create a version of the tokenizers that assumes sorted files? Call it a SpeedTokenizer
  3. Should the above SpeedTokenizer be checking for sorted-ness?

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions