Skip to content

tokenize: bound peak memory on outlier records #5231

Merged
ravwojdyla merged 4 commits intomainfrom
rav-enable-tokenize-split-by-default
Apr 28, 2026
Merged

tokenize: bound peak memory on outlier records #5231
ravwojdyla merged 4 commits intomainfrom
rav-enable-tokenize-split-by-default

Conversation

@ravwojdyla
Copy link
Copy Markdown
Contributor

  • lib/marin/src/marin/processing/tokenize/tokenize.py: flip batch_processor._long_string_workaround = True unconditionally — Levanter's BatchTokenizer then splits each record over _workaround_len (10K chars) at safe whitespace boundaries before encode_batch, so a 64M-char outlier never reaches the underlying Rust tokenizer as one giant string
    • lib/levanter/src/levanter/data/text/_batch_tokenizer.py: rewrite the workaround path to bound peak Python memory
      • encode per-record so an outlier's pieces never coexist with the rest of the batch's encodings
      • new _encode_long_string helper accumulates ids in-place via ids.extend(...) instead of building a fresh concatenated list per record
      • sub-batch the per-outlier encode_batch calls in groups of _LONG_STRING_BATCH_SIZE (256) — peak in-flight strings + tokens for one outlier are bounded by one sub-batch, not the full record
      • prepend BOS via ids.insert(0, bos_id) instead of [bos_id, *ids] to skip a full-list copy at peak

@ravwojdyla
Copy link
Copy Markdown
Contributor Author

@dlwh is there a reason why we should not enable _long_string_workaround by default?

tokens, regardless of how long the original text is.
"""
ids: list[int] = []
with log_time(f"BatchTokenizer encoded {len(text):,}-char outlier record"):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this log

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope - I can remove it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

ravwojdyla and others added 4 commits April 28, 2026 22:18
Multi-MB outliers (e.g. concatenated books) passed through whole would
OOM the worker via the underlying Rust tokenizer. The Levanter flag
splits per-record texts >10K chars at safe whitespace boundaries before
encode_batch and merges results back; no-op for short records.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three coupled changes in BatchTokenizer.__call__:

1. Encode per-record instead of the whole batch's pieces at once. Outlier
   pieces never coexist with the rest of the batch's encodings.
2. Replace the two-pass split-then-merge (which materialized a fresh
   concatenated list per record) with in-place ``ids.extend(...)``
   accumulation in the new ``_encode_long_string`` helper.
3. Sub-batch the workaround's ``encode_batch`` calls in groups of
   ``_LONG_STRING_BATCH_SIZE`` (256) so peak in-flight strings + tokens
   for one outlier are bounded by one sub-batch, not the full record.

Also prepends BOS via ``ids.insert(0, bos_id)`` instead of
``[bos_id, *ids]`` to avoid a full-list copy on huge ids.

Deletes the now-unused ``_break_for_long_sequences``,
``_merge_split_encodings``, and ``itertools.chain`` import. Existing
parity tests (split-vs-whole equality and single-BOS invariant) still
pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wrap _encode_long_string in rigging.timing.log_time so wedged or
unexpectedly slow outliers surface in worker logs without needing
external profiling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ravwojdyla ravwojdyla force-pushed the rav-enable-tokenize-split-by-default branch from a43e229 to a202a30 Compare April 28, 2026 20:18
@ravwojdyla ravwojdyla enabled auto-merge (squash) April 28, 2026 20:19
@ravwojdyla ravwojdyla merged commit 413277e into main Apr 28, 2026
42 checks passed
@ravwojdyla ravwojdyla deleted the rav-enable-tokenize-split-by-default branch April 28, 2026 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants