Syllable-aware BPE tokenizer for the Amharic language (አማርኛ) – fast, accurate, trainable.
-
Updated
Nov 17, 2025 - Python
Syllable-aware BPE tokenizer for the Amharic language (አማርኛ) – fast, accurate, trainable.
A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust
Implemented GPT from scratch
A PHP implementation of OpenAI's BPE tokenizer tiktoken.
Teaching transformer-based architectures
High-Performance Tokenizer implementation in PHP.
BPE tokenizer for LLMs in Pure Zig
Byte-Pair Encoding tokenizer for training large language models on huge datasets
R-BPE: Improving BPE-Tokenizers with Token Reuse
implementation of Byte-Pair Encoding (BPE) for subword tokenization, written entirely in C++ . The tokenizer learns merges from raw text and supports encoding/decoding with UTF-8
(1) Train large language models to help people with automatic essay scoring. (2) Extract essay features and train new tokenizer to build tree models for score prediction.
🐍This is a fast, lightweight, and clean CPython extension for the Byte Pair Encoding (BPE) algorithm, which is commonly used in LLM tokenization and NLP tasks.
Multi-language BPE tokenizer implementation for Qwen3 models. Lightweight byte-pair encoding for C#/.NET
a parallel and minimal implementation of Byte Pair Encoding (BPE) from scratch in less than 200 lines of python.
[Rust] Unofficial implementation of "SuperBPE: Space Travel for Language Models" in Rust
GPT-style language model with Byte Pair Encoding tokenizer, built from scratch in PyTorch.
Visualize HuggingFace Byte-Pair Encoding (BPE) Tokenizer encoding process
Transformer Models for Humorous Text Generation. Fine-tuned on Russian jokes dataset with ALiBi, RoPE, GQA, and SwiGLU.Plus a custom Byte-level BPE tokenizer.
Add a description, image, and links to the bpe-tokenizer topic page so that developers can more easily learn about it.
To associate your repository with the bpe-tokenizer topic, visit your repo's landing page and select "manage topics."