Description
When passing long texts via the text_pair argument to the tokenizer, performance degrades drastically compared to concatenating the texts manually and passing them as a single text input. The difference is not marginal — it is roughly 50x slower and causes >10 GB RAM consumption with text_pair, whereas the manual approach is fast and barely uses any additional memory.
Environment
- Model:
jhu-clsp/mmBERT-base
- Library:
transformers (HuggingFace)
- Python: 3.11.6
Reproduction
from time import time
from transformers import AutoTokenizer
model_name = "jhu-clsp/mmBERT-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
sep = tokenizer.sep_token
length = 10000
text = "text " * length
context_parts = ["context " * length] * 10
context = " ".join(f"<start_of_turn>{t}<end_of_turn>" for t in context_parts)
# --- Approach 1: Manual concatenation (fast) ---
full_text = context + sep + text
start_time = time()
tok = tokenizer(text=full_text, truncation=True, max_length=256)
print(f"Approach 1 (single text): {time() - start_time:.3f}s")
# Output: ~0.307s, minimal RAM usage
# --- Approach 2: text_pair (slow) ---
start_time = time()
tok = tokenizer(text=context, text_pair=text, truncation=True, max_length=256)
print(f"Approach 2 (text_pair): {time() - start_time:.3f}s")
# Output: ~15.47s, >10 GB RAM
Observed Behavior
| Approach |
Time |
RAM Impact |
text=full_text (manual concat) |
~0.31s |
negligible |
text=context, text_pair=text |
~15.47s |
>10 GB |
Expected Behavior
With truncation=True and a small max_length, the tokenizer should not need to fully tokenize both inputs before truncating. Since only 256 tokens will be kept regardless, processing millions of tokens internally is wasteful. The tokenizer should apply truncation eagerly, ideally by pre-truncating the raw text or stopping tokenization early once the budget is reached.
Root Cause Hypothesis
When text_pair is used, both sequences appear to be fully tokenized first, producing token lists that can be millions of tokens long, before the truncation strategy (longest_first, only_first, etc.) is applied. This results in O(n) memory and likely O(n²) time complexity for very long inputs. Eager truncation at the character/subword level before full tokenization would avoid this entirely.
Description
When passing long texts via the
text_pairargument to the tokenizer, performance degrades drastically compared to concatenating the texts manually and passing them as a singletextinput. The difference is not marginal — it is roughly 50x slower and causes >10 GB RAM consumption withtext_pair, whereas the manual approach is fast and barely uses any additional memory.Environment
jhu-clsp/mmBERT-basetransformers(HuggingFace)Reproduction
Observed Behavior
text=full_text(manual concat)text=context, text_pair=textExpected Behavior
With
truncation=Trueand a smallmax_length, the tokenizer should not need to fully tokenize both inputs before truncating. Since only 256 tokens will be kept regardless, processing millions of tokens internally is wasteful. The tokenizer should apply truncation eagerly, ideally by pre-truncating the raw text or stopping tokenization early once the budget is reached.Root Cause Hypothesis
When
text_pairis used, both sequences appear to be fully tokenized first, producing token lists that can be millions of tokens long, before the truncation strategy (longest_first,only_first, etc.) is applied. This results in O(n) memory and likely O(n²) time complexity for very long inputs. Eager truncation at the character/subword level before full tokenization would avoid this entirely.