A living knowledge base for a collaborative research project on fixing issues with BPE tokenization in language models.
Systematically investigate and address known deficiencies in Byte Pair Encoding (BPE) tokenization, progressing from lightweight interventions to training-based solutions.
| Phase | Approach | Invasiveness |
|---|---|---|
| 1 | Training-free heuristics (pre-tokenization, vocabulary pruning) | Non-invasive |
| 2 | Auxiliary prediction model for tokenization scoring | Moderate |
| 3 | RL training — learn a tokenization reward model | Invasive |
| 4 | Domain-specific corpus ablations | Variable |
Target model families for experiments: Qwen3.5, OLMo v2 (known pretraining distributions).
docs/
research-plan.md — phased roadmap with open questions
related-work/
index.md — annotated bibliography
*.md — individual paper notes