Massive performance degradation and memory spike when using text_pair with long inputs

### Description

When passing long texts via the `text_pair` argument to the tokenizer, performance degrades drastically compared to concatenating the texts manually and passing them as a single `text` input. The difference is not marginal — it is roughly **50x slower** and causes **>10 GB RAM consumption** with `text_pair`, whereas the manual approach is fast and barely uses any additional memory.

### Environment

- **Model:** `jhu-clsp/mmBERT-base`
- **Library:** `transformers` (HuggingFace)
- **Python:** 3.11.6

### Reproduction

```python
from time import time
from transformers import AutoTokenizer

model_name = "jhu-clsp/mmBERT-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)

sep = tokenizer.sep_token
length = 10000
text = "text " * length

context_parts = ["context " * length] * 10
context = " ".join(f"<start_of_turn>{t}<end_of_turn>" for t in context_parts)

# --- Approach 1: Manual concatenation (fast) ---
full_text = context + sep + text

start_time = time()
tok = tokenizer(text=full_text, truncation=True, max_length=256)
print(f"Approach 1 (single text): {time() - start_time:.3f}s")
# Output: ~0.307s, minimal RAM usage

# --- Approach 2: text_pair (slow) ---
start_time = time()
tok = tokenizer(text=context, text_pair=text, truncation=True, max_length=256)
print(f"Approach 2 (text_pair):   {time() - start_time:.3f}s")
# Output: ~15.47s, >10 GB RAM
```

### Observed Behavior

| Approach | Time | RAM Impact |
|---|---|---|
| `text=full_text` (manual concat) | ~0.31s | negligible |
| `text=context, text_pair=text` | ~15.47s | >10 GB |

### Expected Behavior

With `truncation=True` and a small `max_length`, the tokenizer should **not** need to fully tokenize both inputs before truncating. Since only 256 tokens will be kept regardless, processing millions of tokens internally is wasteful. The tokenizer should apply truncation eagerly, ideally by pre-truncating the raw text or stopping tokenization early once the budget is reached.

### Root Cause Hypothesis

When `text_pair` is used, both sequences appear to be **fully tokenized first**, producing token lists that can be millions of tokens long, before the truncation strategy (`longest_first`, `only_first`, etc.) is applied. This results in O(n) memory and likely O(n²) time complexity for very long inputs. Eager truncation at the character/subword level before full tokenization would avoid this entirely.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Massive performance degradation and memory spike when using text_pair with long inputs #2035

Description

Environment

Reproduction

Observed Behavior

Expected Behavior

Root Cause Hypothesis

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Approach	Time	RAM Impact
`text=full_text` (manual concat)	~0.31s	negligible
`text=context, text_pair=text`	~15.47s	>10 GB

Massive performance degradation and memory spike when using text_pair with long inputs #2035

Description

Description

Environment

Reproduction

Observed Behavior

Expected Behavior

Root Cause Hypothesis

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions