tokenize: bound peak memory on outlier records #5231
Merged
ravwojdyla merged 4 commits intomainfrom Apr 28, 2026
Merged
Conversation
Contributor
Author
|
@dlwh is there a reason why we should not enable |
dlwh
approved these changes
Apr 28, 2026
| tokens, regardless of how long the original text is. | ||
| """ | ||
| ids: list[int] = [] | ||
| with log_time(f"BatchTokenizer encoded {len(text):,}-char outlier record"): |
Contributor
Author
There was a problem hiding this comment.
nope - I can remove it
dlwh
approved these changes
Apr 28, 2026
Multi-MB outliers (e.g. concatenated books) passed through whole would OOM the worker via the underlying Rust tokenizer. The Levanter flag splits per-record texts >10K chars at safe whitespace boundaries before encode_batch and merges results back; no-op for short records. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three coupled changes in BatchTokenizer.__call__: 1. Encode per-record instead of the whole batch's pieces at once. Outlier pieces never coexist with the rest of the batch's encodings. 2. Replace the two-pass split-then-merge (which materialized a fresh concatenated list per record) with in-place ``ids.extend(...)`` accumulation in the new ``_encode_long_string`` helper. 3. Sub-batch the workaround's ``encode_batch`` calls in groups of ``_LONG_STRING_BATCH_SIZE`` (256) so peak in-flight strings + tokens for one outlier are bounded by one sub-batch, not the full record. Also prepends BOS via ``ids.insert(0, bos_id)`` instead of ``[bos_id, *ids]`` to avoid a full-list copy on huge ids. Deletes the now-unused ``_break_for_long_sequences``, ``_merge_split_encodings``, and ``itertools.chain`` import. Existing parity tests (split-vs-whole equality and single-BOS invariant) still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wrap _encode_long_string in rigging.timing.log_time so wedged or unexpectedly slow outliers surface in worker logs without needing external profiling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This reverts commit fd2be14.
a43e229 to
a202a30
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
lib/marin/src/marin/processing/tokenize/tokenize.py: flipbatch_processor._long_string_workaround = Trueunconditionally — Levanter'sBatchTokenizerthen splits each record over_workaround_len(10K chars) at safe whitespace boundaries beforeencode_batch, so a 64M-char outlier never reaches the underlying Rust tokenizer as one giant stringlib/levanter/src/levanter/data/text/_batch_tokenizer.py: rewrite the workaround path to bound peak Python memory_encode_long_stringhelper accumulates ids in-place viaids.extend(...)instead of building a fresh concatenated list per recordencode_batchcalls in groups of_LONG_STRING_BATCH_SIZE(256) — peak in-flight strings + tokens for one outlier are bounded by one sub-batch, not the full recordids.insert(0, bos_id)instead of[bos_id, *ids]to skip a full-list copy at peak