[Tokenizer] Add tiktoken format loader for OpenAI BPE vocabularies. (… · iree-org/iree@205b17f

embed field value must be shorter than 1024, got 4896 [`205b17f`](https://github.com/iree-org/iree/commit/205b17f142756058f731f2a14d010ca4ae2d6d2c) [Tokenizer] Add tiktoken format loader for OpenAI BPE vocabularies. (#23663) OpenAI's tiktoken is the second major tokenizer format in the ML ecosystem (alongside HuggingFace's tokenizer.json). This adds a complete tiktoken loader so IREE can ingest tokenizer definitions from either ecosystem without external conversion tools. ## Loader The tiktoken format stores BPE vocabularies as base64-encoded byte tokens with integer ranks — no explicit merge list, no regex patterns, no special tokens. The loader reconstructs the full BPE merge table from ranks alone via simulation: for each multi-byte token at rank R, it simulates BPE encoding of that token's raw bytes using only merges with rank < R, and when two parts remain, those form the merge pair. This produces a tokenizer behaviorally indistinguishable from the HuggingFace equivalent. Rank gaps are handled (p50k_base skips rank 50256, reserved for `<|endoftext|>`): zero-length placeholder entries fill gaps to preserve the entry_index==rank invariant, and explicit token IDs are assigned to ensure correct vocab construction. ## Encoding Configs All 7 standard OpenAI encoding names are supported via predefined configs: | Encoding | BPE File | BPE Tokens | Special Tokens | Models | |---|---|---|---|---| | `cl100k_base` | cl100k_base.tiktoken | 100,256 | 5 | GPT-4, GPT-3.5-turbo, text-embedding-ada-002 | | `o200k_base` | o200k_base.tiktoken | 199,998 | 2 | GPT-4o, GPT-4o-mini | | `o200k_harmony` | o200k_base.tiktoken | 199,998 | 10 named | GPT-4o (ChatGPT message format) | | `r50k_base` | r50k_base.tiktoken | 50,256 | 1 | GPT-3, text-davinci-002/003 | | `gpt2` | r50k_base.tiktoken | 50,256 | 1 | GPT-2 (identical to r50k_base) | | `p50k_base` | p50k_base.tiktoken | 50,280 | 1 | Codex, code-davinci-002 | | `p50k_edit` | p50k_base.tiktoken | 50,280 | 4 | Codex edit models (adds FIM tokens) | `iree_tokenizer_tiktoken_config_by_name()` resolves any of these names to a config. Custom encodings are supported via the public `iree_tokenizer_tiktoken_config_t` struct — populate it with your own regex pattern, special tokens, and IDs. ## Integration Testing 72 test cases across 4 BPE files (19 per encoding × 4 encodings), validated **token-for-token** against OpenAI's Python tiktoken library. Test corpus covers: ASCII, code, numbers, punctuation, mixed case, CJK, accented text, emoji, whitespace variations, empty strings, repeated characters, special characters, leading spaces, mixed scripts, long words, carriage returns, CRLF sequences, and special token matching (`<|endoftext|>`). Infrastructure: - `generate_tiktoken_golden_ids.py` — generates golden token IDs from the Python tiktoken library into `tokenizer_corpus.json` - `tiktoken_smoketest.py` — downloads `.tiktoken` files from OpenAI's CDN, runs IREE's tokenizer against the corpus, and compares output against goldens - `run_tiktoken_smoketest.sh` — uvx wrapper that installs dependencies and invokes the smoketest ## Performance Benchmark: `comprehensive_benchmark` with O3/march=native/thin_lto, 256KB per-corpus text (ASCII/CJK/Code), single-threaded, cache-hot. **IREE tiktoken encode throughput (one-shot, MB/s):** | Encoding | ASCII | CJK | Code | |---|---|---|---| | cl100k_base | 70.2 | 63.3 | 69.4 | | o200k_base | 70.1 | 59.4 | 68.7 | | r50k_base | 67.6 | 62.2 | 67.6 | | p50k_base | 68.3 | 62.4 | 6.5 ¹ | **vs Python tiktoken (code one-shot, 256KB):** | Encoding | IREE | Python tiktoken | Speedup | |---|---|---|---| | cl100k_base | 69.4 MB/s | 13.6 MB/s | **5.1×** | | o200k_base | 68.7 MB/s | 8.1 MB/s | **8.5×** | | r50k_base | 67.6 MB/s | 13.7 MB/s | **4.9×** | | p50k_base | 6.5 MB/s | 13.4 MB/s | 0.5× ¹ | **Decode throughput:** ~2 GB/s across all encodings (decode is a simple vocab lookup). ¹ p50k_base Code is slower due to 24 extra whitespace tokens (2-25 consecutive spaces, Codex indentation vocabulary) that cause combinatorial work in BPE pair

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tokenizer] Add tiktoken format loader for OpenAI BPE vocabularies. (… #47404

Summary

[Tokenizer] Add tiktoken format loader for OpenAI BPE vocabularies. (… #47404

Uh oh!

ci.yml

Annotations