tokenize: bound peak memory on outlier records by ravwojdyla · Pull Request #5231 · marin-community/marin

ravwojdyla · 2026-04-28T02:31:21Z

lib/marin/src/marin/processing/tokenize/tokenize.py: flip batch_processor._long_string_workaround = True unconditionally — Levanter's BatchTokenizer then splits each record over _workaround_len (10K chars) at safe whitespace boundaries before encode_batch, so a 64M-char outlier never reaches the underlying Rust tokenizer as one giant string
- lib/levanter/src/levanter/data/text/_batch_tokenizer.py: rewrite the workaround path to bound peak Python memory
  - encode per-record so an outlier's pieces never coexist with the rest of the batch's encodings
  - new _encode_long_string helper accumulates ids in-place via ids.extend(...) instead of building a fresh concatenated list per record
  - sub-batch the per-outlier encode_batch calls in groups of _LONG_STRING_BATCH_SIZE (256) — peak in-flight strings + tokens for one outlier are bounded by one sub-batch, not the full record
  - prepend BOS via ids.insert(0, bos_id) instead of [bos_id, *ids] to skip a full-list copy at peak

ravwojdyla · 2026-04-28T02:31:51Z

@dlwh is there a reason why we should not enable _long_string_workaround by default?

dlwh · 2026-04-28T19:49:08Z

+        tokens, regardless of how long the original text is.
+        """
+        ids: list[int] = []
+        with log_time(f"BatchTokenizer encoded {len(text):,}-char outlier record"):


do we need this log

nope - I can remove it

Multi-MB outliers (e.g. concatenated books) passed through whole would OOM the worker via the underlying Rust tokenizer. The Levanter flag splits per-record texts >10K chars at safe whitespace boundaries before encode_batch and merges results back; no-op for short records. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three coupled changes in BatchTokenizer.__call__: 1. Encode per-record instead of the whole batch's pieces at once. Outlier pieces never coexist with the rest of the batch's encodings. 2. Replace the two-pass split-then-merge (which materialized a fresh concatenated list per record) with in-place ``ids.extend(...)`` accumulation in the new ``_encode_long_string`` helper. 3. Sub-batch the workaround's ``encode_batch`` calls in groups of ``_LONG_STRING_BATCH_SIZE`` (256) so peak in-flight strings + tokens for one outlier are bounded by one sub-batch, not the full record. Also prepends BOS via ``ids.insert(0, bos_id)`` instead of ``[bos_id, *ids]`` to avoid a full-list copy on huge ids. Deletes the now-unused ``_break_for_long_sequences``, ``_merge_split_encodings``, and ``itertools.chain`` import. Existing parity tests (split-vs-whole equality and single-BOS invariant) still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Wrap _encode_long_string in rigging.timing.log_time so wedged or unexpectedly slow outliers surface in worker logs without needing external profiling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

This reverts commit fd2be14.

ravwojdyla requested review from Helw150, ahmeda14960 and dlwh April 28, 2026 02:32

dlwh approved these changes Apr 28, 2026

View reviewed changes

ravwojdyla and others added 4 commits April 28, 2026 22:18

levanter: log per-outlier encode time in BatchTokenizer

fd2be14

Wrap _encode_long_string in rigging.timing.log_time so wedged or unexpectedly slow outliers surface in worker logs without needing external profiling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Revert "levanter: log per-outlier encode time in BatchTokenizer"

a202a30

This reverts commit fd2be14.

ravwojdyla force-pushed the rav-enable-tokenize-split-by-default branch from a43e229 to a202a30 Compare April 28, 2026 20:18

ravwojdyla enabled auto-merge (squash) April 28, 2026 20:19

ravwojdyla merged commit 413277e into main Apr 28, 2026
42 checks passed

ravwojdyla deleted the rav-enable-tokenize-split-by-default branch April 28, 2026 20:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenize: bound peak memory on outlier records #5231

tokenize: bound peak memory on outlier records #5231
ravwojdyla merged 4 commits intomainfrom
rav-enable-tokenize-split-by-default

ravwojdyla commented Apr 28, 2026

Uh oh!

ravwojdyla commented Apr 28, 2026

Uh oh!

dlwh Apr 28, 2026

Uh oh!

ravwojdyla Apr 28, 2026

Uh oh!

ravwojdyla Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ravwojdyla commented Apr 28, 2026

Uh oh!

ravwojdyla commented Apr 28, 2026

Uh oh!

dlwh Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

ravwojdyla Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

ravwojdyla Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants