Skip to content

Latest commit

 

History

History
56 lines (47 loc) · 1.42 KB

File metadata and controls

56 lines (47 loc) · 1.42 KB

Tokenizer

Alt text

BPE tokenizer implemented entirely in Zig.

Example integration with LLMs at nnx-lm.

Requirement

zig v0.13.0

Install

git clone https://github.com/jaco-bro/tokenizer
cd tokenizer
zig build exe --release=fast

Usage

  • zig-out/bin/tokenizer_exe [--model MODEL_NAME] COMMAND INPUT
  • zig build run -- [--model MODEL_NAME] COMMAND INPUT
zig build run -- --encode "hello world"
zig build run -- --decode "{14990, 1879}"
zig build run -- --model "phi-4-4bit" --encode "hello world"
zig build run -- --model "phi-4-4bit" --decode "15339 1917"
zig build run -- --repo "Qwen" --model "Qwen3-0.6B" --encode "안녕"
zig build run -- --repo "Qwen" --model "Qwen3-0.6B" --decode "126246 144370"

Python (optional)

Tokenizer is also pip-installable for use from Python:

pip install tokenizerz
python

Usage:

>>> import tokenizerz
>>> tokenizer = tokenizerz.Tokenizer()
Directory 'Qwen2.5-Coder-0.5B' created successfully.
DL% UL%  Dled  Uled  Xfers  Live Total     Current  Left    Speed
100 --  6866k     0     1     0   0:00:01  0:00:01 --:--:-- 4904k
Download successful.
>>> tokens = tokenizer.encode("Hello, world!")
>>> print(tokens)
[9707, 11, 1879, 0]
>>> tokenizer.decode(tokens)
'Hello, world!'

Shell:

bpe --encode "hello world"