[Tokenizer] Add tiktoken format loader for OpenAI BPE vocabularies. (… #47404
ci.yml
on: push
setup
/
setup
9s
Matrix: runtime
Matrix: runtime_tracing
runtime_small
3m 46s
linux_x64_bazel
/
linux_x64_bazel
36m 41s
linux_x64_clang
/
linux_x64_clang
26m 13s
linux_x64_clang_debug
/
linux_x64_clang_debug
34m 28s
linux_x64_clang_asan
/
linux_x64_clang_asan
11m 13s
linux_x64_clang_ubsan
/
linux_x64_clang_ubsan
7m 9s
windows_x64_msvc
/
windows_x64_msvc
25m 12s
linux_arm64_clang
/
linux_arm64_clang
linux_x64_clang_byollvm
/
linux_x64_clang_byollvm
linux_x64_clang_tsan
/
linux_x64_clang_tsan
linux_x64_gcc
/
linux_x64_gcc
macos_arm64_clang
/
macos_arm64_clang
macos_x64_clang
/
macos_x64_clang
ci_summary
/
summary
1m 36s
Annotations
2 errors and 4 warnings
|
windows_x64_msvc / windows_x64_msvc
Process completed with exit code 1.
|
|
ci_summary / summary
Process completed with exit code 1.
|
|
runtime_tracing :: macos-14 :: tracy
ninja 1.13.2 is already installed and up-to-date.
To reinstall 1.13.2, run:
brew reinstall ninja
|
|
runtime_tracing :: macos-14 :: console
ninja 1.13.2 is already installed and up-to-date.
To reinstall 1.13.2, run:
brew reinstall ninja
|
|
runtime :: macos-14
ninja 1.13.2 is already installed and up-to-date.
To reinstall 1.13.2, run:
brew reinstall ninja
|
|
ci_summary / summary
embed field value must be shorter than 1024, got 4896
[`205b17f`](https://github.com/iree-org/iree/commit/205b17f142756058f731f2a14d010ca4ae2d6d2c) [Tokenizer] Add tiktoken format loader for OpenAI BPE vocabularies. (#23663)
OpenAI's tiktoken is the second major tokenizer format in the ML
ecosystem (alongside HuggingFace's tokenizer.json). This adds a complete
tiktoken loader so IREE can ingest tokenizer definitions from either
ecosystem without external conversion tools.
## Loader
The tiktoken format stores BPE vocabularies as base64-encoded byte
tokens with integer ranks — no explicit merge list, no regex patterns,
no special tokens. The loader reconstructs the full BPE merge table from
ranks alone via simulation: for each multi-byte token at rank R, it
simulates BPE encoding of that token's raw bytes using only merges with
rank < R, and when two parts remain, those form the merge pair. This
produces a tokenizer behaviorally indistinguishable from the HuggingFace
equivalent.
Rank gaps are handled (p50k_base skips rank 50256, reserved for
`<|endoftext|>`): zero-length placeholder entries fill gaps to preserve
the entry_index==rank invariant, and explicit token IDs are assigned to
ensure correct vocab construction.
## Encoding Configs
All 7 standard OpenAI encoding names are supported via predefined
configs:
| Encoding | BPE File | BPE Tokens | Special Tokens | Models |
|---|---|---|---|---|
| `cl100k_base` | cl100k_base.tiktoken | 100,256 | 5 | GPT-4,
GPT-3.5-turbo, text-embedding-ada-002 |
| `o200k_base` | o200k_base.tiktoken | 199,998 | 2 | GPT-4o, GPT-4o-mini
|
| `o200k_harmony` | o200k_base.tiktoken | 199,998 | 10 named | GPT-4o
(ChatGPT message format) |
| `r50k_base` | r50k_base.tiktoken | 50,256 | 1 | GPT-3,
text-davinci-002/003 |
| `gpt2` | r50k_base.tiktoken | 50,256 | 1 | GPT-2 (identical to
r50k_base) |
| `p50k_base` | p50k_base.tiktoken | 50,280 | 1 | Codex,
code-davinci-002 |
| `p50k_edit` | p50k_base.tiktoken | 50,280 | 4 | Codex edit models
(adds FIM tokens) |
`iree_tokenizer_tiktoken_config_by_name()` resolves any of these names
to a config. Custom encodings are supported via the public
`iree_tokenizer_tiktoken_config_t` struct — populate it with your own
regex pattern, special tokens, and IDs.
## Integration Testing
72 test cases across 4 BPE files (19 per encoding × 4 encodings),
validated **token-for-token** against OpenAI's Python tiktoken library.
Test corpus covers: ASCII, code, numbers, punctuation, mixed case, CJK,
accented text, emoji, whitespace variations, empty strings, repeated
characters, special characters, leading spaces, mixed scripts, long
words, carriage returns, CRLF sequences, and special token matching
(`<|endoftext|>`).
Infrastructure:
- `generate_tiktoken_golden_ids.py` — generates golden token IDs from
the Python tiktoken library into `tokenizer_corpus.json`
- `tiktoken_smoketest.py` — downloads `.tiktoken` files from OpenAI's
CDN, runs IREE's tokenizer against the corpus, and compares output
against goldens
- `run_tiktoken_smoketest.sh` — uvx wrapper that installs dependencies
and invokes the smoketest
## Performance
Benchmark: `comprehensive_benchmark` with O3/march=native/thin_lto,
256KB per-corpus text (ASCII/CJK/Code), single-threaded, cache-hot.
**IREE tiktoken encode throughput (one-shot, MB/s):**
| Encoding | ASCII | CJK | Code |
|---|---|---|---|
| cl100k_base | 70.2 | 63.3 | 69.4 |
| o200k_base | 70.1 | 59.4 | 68.7 |
| r50k_base | 67.6 | 62.2 | 67.6 |
| p50k_base | 68.3 | 62.4 | 6.5 ¹ |
**vs Python tiktoken (code one-shot, 256KB):**
| Encoding | IREE | Python tiktoken | Speedup |
|---|---|---|---|
| cl100k_base | 69.4 MB/s | 13.6 MB/s | **5.1×** |
| o200k_base | 68.7 MB/s | 8.1 MB/s | **8.5×** |
| r50k_base | 67.6 MB/s | 13.7 MB/s | **4.9×** |
| p50k_base | 6.5 MB/s | 13.4 MB/s | 0.5× ¹ |
**Decode throughput:** ~2 GB/s across all encodings (decode is a simple
vocab lookup).
¹ p50k_base Code is slower due to 24 extra whitespace tokens (2-25
consecutive spaces, Codex indentation vocabulary) that cause
combinatorial work in BPE pair
|