Skip to content

Massive performance degradation and memory spike when using text_pair with long inputs #2035

@Leonater

Description

@Leonater

Description

When passing long texts via the text_pair argument to the tokenizer, performance degrades drastically compared to concatenating the texts manually and passing them as a single text input. The difference is not marginal — it is roughly 50x slower and causes >10 GB RAM consumption with text_pair, whereas the manual approach is fast and barely uses any additional memory.

Environment

  • Model: jhu-clsp/mmBERT-base
  • Library: transformers (HuggingFace)
  • Python: 3.11.6

Reproduction

from time import time
from transformers import AutoTokenizer

model_name = "jhu-clsp/mmBERT-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)

sep = tokenizer.sep_token
length = 10000
text = "text " * length

context_parts = ["context " * length] * 10
context = " ".join(f"<start_of_turn>{t}<end_of_turn>" for t in context_parts)

# --- Approach 1: Manual concatenation (fast) ---
full_text = context + sep + text

start_time = time()
tok = tokenizer(text=full_text, truncation=True, max_length=256)
print(f"Approach 1 (single text): {time() - start_time:.3f}s")
# Output: ~0.307s, minimal RAM usage

# --- Approach 2: text_pair (slow) ---
start_time = time()
tok = tokenizer(text=context, text_pair=text, truncation=True, max_length=256)
print(f"Approach 2 (text_pair):   {time() - start_time:.3f}s")
# Output: ~15.47s, >10 GB RAM

Observed Behavior

Approach Time RAM Impact
text=full_text (manual concat) ~0.31s negligible
text=context, text_pair=text ~15.47s >10 GB

Expected Behavior

With truncation=True and a small max_length, the tokenizer should not need to fully tokenize both inputs before truncating. Since only 256 tokens will be kept regardless, processing millions of tokens internally is wasteful. The tokenizer should apply truncation eagerly, ideally by pre-truncating the raw text or stopping tokenization early once the budget is reached.

Root Cause Hypothesis

When text_pair is used, both sequences appear to be fully tokenized first, producing token lists that can be millions of tokens long, before the truncation strategy (longest_first, only_first, etc.) is applied. This results in O(n) memory and likely O(n²) time complexity for very long inputs. Eager truncation at the character/subword level before full tokenization would avoid this entirely.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions