Skip to content

[Tokenizer] Add tiktoken format loader for OpenAI BPE vocabularies. (… #47404

[Tokenizer] Add tiktoken format loader for OpenAI BPE vocabularies. (…

[Tokenizer] Add tiktoken format loader for OpenAI BPE vocabularies. (… #47404

Triggered via push March 5, 2026 22:14
Status Failure
Total duration 47m 14s
Artifacts

ci.yml

on: push
Matrix: runtime
Matrix: runtime_tracing
runtime_small
3m 46s
runtime_small
linux_x64_bazel  /  linux_x64_bazel
36m 41s
linux_x64_bazel / linux_x64_bazel
linux_x64_clang  /  linux_x64_clang
26m 13s
linux_x64_clang / linux_x64_clang
linux_x64_clang_debug  /  linux_x64_clang_debug
34m 28s
linux_x64_clang_debug / linux_x64_clang_debug
linux_x64_clang_asan  /  linux_x64_clang_asan
11m 13s
linux_x64_clang_asan / linux_x64_clang_asan
linux_x64_clang_ubsan  /  linux_x64_clang_ubsan
7m 9s
linux_x64_clang_ubsan / linux_x64_clang_ubsan
windows_x64_msvc  /  windows_x64_msvc
25m 12s
windows_x64_msvc / windows_x64_msvc
linux_arm64_clang  /  linux_arm64_clang
linux_arm64_clang / linux_arm64_clang
linux_x64_clang_byollvm  /  linux_x64_clang_byollvm
linux_x64_clang_byollvm / linux_x64_clang_byollvm
linux_x64_clang_tsan  /  linux_x64_clang_tsan
linux_x64_clang_tsan / linux_x64_clang_tsan
linux_x64_gcc  /  linux_x64_gcc
linux_x64_gcc / linux_x64_gcc
macos_arm64_clang  /  macos_arm64_clang
macos_arm64_clang / macos_arm64_clang
macos_x64_clang  /  macos_x64_clang
macos_x64_clang / macos_x64_clang
Fit to window
Zoom out
Zoom in

Annotations

2 errors and 4 warnings
windows_x64_msvc / windows_x64_msvc
Process completed with exit code 1.
ci_summary / summary
Process completed with exit code 1.
runtime_tracing :: macos-14 :: tracy
ninja 1.13.2 is already installed and up-to-date. To reinstall 1.13.2, run: brew reinstall ninja
runtime_tracing :: macos-14 :: console
ninja 1.13.2 is already installed and up-to-date. To reinstall 1.13.2, run: brew reinstall ninja
runtime :: macos-14
ninja 1.13.2 is already installed and up-to-date. To reinstall 1.13.2, run: brew reinstall ninja
ci_summary / summary
embed field value must be shorter than 1024, got 4896 [`205b17f`](https://github.com/iree-org/iree/commit/205b17f142756058f731f2a14d010ca4ae2d6d2c) [Tokenizer] Add tiktoken format loader for OpenAI BPE vocabularies. (#23663) OpenAI's tiktoken is the second major tokenizer format in the ML ecosystem (alongside HuggingFace's tokenizer.json). This adds a complete tiktoken loader so IREE can ingest tokenizer definitions from either ecosystem without external conversion tools. ## Loader The tiktoken format stores BPE vocabularies as base64-encoded byte tokens with integer ranks — no explicit merge list, no regex patterns, no special tokens. The loader reconstructs the full BPE merge table from ranks alone via simulation: for each multi-byte token at rank R, it simulates BPE encoding of that token's raw bytes using only merges with rank < R, and when two parts remain, those form the merge pair. This produces a tokenizer behaviorally indistinguishable from the HuggingFace equivalent. Rank gaps are handled (p50k_base skips rank 50256, reserved for `<|endoftext|>`): zero-length placeholder entries fill gaps to preserve the entry_index==rank invariant, and explicit token IDs are assigned to ensure correct vocab construction. ## Encoding Configs All 7 standard OpenAI encoding names are supported via predefined configs: | Encoding | BPE File | BPE Tokens | Special Tokens | Models | |---|---|---|---|---| | `cl100k_base` | cl100k_base.tiktoken | 100,256 | 5 | GPT-4, GPT-3.5-turbo, text-embedding-ada-002 | | `o200k_base` | o200k_base.tiktoken | 199,998 | 2 | GPT-4o, GPT-4o-mini | | `o200k_harmony` | o200k_base.tiktoken | 199,998 | 10 named | GPT-4o (ChatGPT message format) | | `r50k_base` | r50k_base.tiktoken | 50,256 | 1 | GPT-3, text-davinci-002/003 | | `gpt2` | r50k_base.tiktoken | 50,256 | 1 | GPT-2 (identical to r50k_base) | | `p50k_base` | p50k_base.tiktoken | 50,280 | 1 | Codex, code-davinci-002 | | `p50k_edit` | p50k_base.tiktoken | 50,280 | 4 | Codex edit models (adds FIM tokens) | `iree_tokenizer_tiktoken_config_by_name()` resolves any of these names to a config. Custom encodings are supported via the public `iree_tokenizer_tiktoken_config_t` struct — populate it with your own regex pattern, special tokens, and IDs. ## Integration Testing 72 test cases across 4 BPE files (19 per encoding × 4 encodings), validated **token-for-token** against OpenAI's Python tiktoken library. Test corpus covers: ASCII, code, numbers, punctuation, mixed case, CJK, accented text, emoji, whitespace variations, empty strings, repeated characters, special characters, leading spaces, mixed scripts, long words, carriage returns, CRLF sequences, and special token matching (`<|endoftext|>`). Infrastructure: - `generate_tiktoken_golden_ids.py` — generates golden token IDs from the Python tiktoken library into `tokenizer_corpus.json` - `tiktoken_smoketest.py` — downloads `.tiktoken` files from OpenAI's CDN, runs IREE's tokenizer against the corpus, and compares output against goldens - `run_tiktoken_smoketest.sh` — uvx wrapper that installs dependencies and invokes the smoketest ## Performance Benchmark: `comprehensive_benchmark` with O3/march=native/thin_lto, 256KB per-corpus text (ASCII/CJK/Code), single-threaded, cache-hot. **IREE tiktoken encode throughput (one-shot, MB/s):** | Encoding | ASCII | CJK | Code | |---|---|---|---| | cl100k_base | 70.2 | 63.3 | 69.4 | | o200k_base | 70.1 | 59.4 | 68.7 | | r50k_base | 67.6 | 62.2 | 67.6 | | p50k_base | 68.3 | 62.4 | 6.5 ¹ | **vs Python tiktoken (code one-shot, 256KB):** | Encoding | IREE | Python tiktoken | Speedup | |---|---|---|---| | cl100k_base | 69.4 MB/s | 13.6 MB/s | **5.1×** | | o200k_base | 68.7 MB/s | 8.1 MB/s | **8.5×** | | r50k_base | 67.6 MB/s | 13.7 MB/s | **4.9×** | | p50k_base | 6.5 MB/s | 13.4 MB/s | 0.5× ¹ | **Decode throughput:** ~2 GB/s across all encodings (decode is a simple vocab lookup). ¹ p50k_base Code is slower due to 24 extra whitespace tokens (2-25 consecutive spaces, Codex indentation vocabulary) that cause combinatorial work in BPE pair