Use incremental binary checkpoint for tokenization resume#1633
Use incremental binary checkpoint for tokenization resume#1633finbarrtimbers merged 36 commits intomainfrom
Conversation
…_conversion Moves the HF-to-OLMo-core numpy mmap conversion logic out of scripts/data/convert_sft_data_for_olmocore.py and into a new module open_instruct/numpy_dataset_conversion.py so it can be imported by downstream callers (e.g. the upcoming OLMo-core SFT main). The CLI script keeps its argument surface and just delegates to the library. Split out of #1620 (match-sft). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…y: Claude Opus 4.7 <noreply@anthropic.com>
…ion harness Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…red-By: Claude Opus 4.7 <noreply@anthropic.com>
…Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…rify script Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…thored-By: Claude Opus 4.7 <noreply@anthropic.com>
…rify script Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ir Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…By: Claude Opus 4.7 <noreply@anthropic.com>
…<noreply@anthropic.com>
…ation script Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ng instead of AutoConfig Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ed-By: Claude Opus 4.7 <noreply@anthropic.com>
…ark harness Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…logs, beaker description updates Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…g Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…y: Claude Opus 4.7 <noreply@anthropic.com>
…hored-By: Claude Opus 4.7 <noreply@anthropic.com>
…d to separate PR Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 62dc0868a2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Code Review
This pull request introduces incremental checkpointing for the numpy dataset conversion process by storing token IDs, labels, and document boundaries in binary files and appending new data during each checkpoint. Feedback focuses on improving the robustness and performance of this system: specifically, handling potential file corruption in _truncate_to by raising errors if files are smaller than expected, optimizing memory usage during checkpointing by avoiding list slicing, and suggesting a broader refactor to use numpy arrays instead of Python lists to eliminate performance bottlenecks caused by .tolist() conversions.
…oint API; covered by test_numpy_dataset_conversion.py) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…thlib.Path Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…okenization # Conflicts: # CHANGELOG.md # open_instruct/numpy_dataset_conversion.py # open_instruct/test_checkpoint.py # open_instruct/test_numpy_dataset_conversion.py
Documentation Changes Detected📄
|
…heckpoint Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…noreply@anthropic.com>
Documentation Changes Detected📄
|
…y@anthropic.com>
Documentation Changes Detected📄
|
Replaces the single-file JSON checkpoint in
numpy_dataset_conversionwith an incremental binary format:_checkpoint_token_ids.bin,_checkpoint_labels_mask.bin,_checkpoint_document_boundaries.binfor the array data, and_checkpoint.jsonfor scalar metadata (tokens_written,samples_written, counters, etc.).On each checkpoint, only the newly collected tokens/labels/boundaries are appended to the
.binfiles; the JSON is written atomically last so it pins the valid prefix length of each binary file. On resume, binary files are truncated back to the recorded prefix before being loaded (in case of a preemption during the write).load_checkpointstill understands the legacy JSON-only format so in-flight runs that started on the previous format will continue to resume.Removed
open_instruct/test_checkpoint.py, as it tested the old single-file JSON checkpoint API and was broken under the incremental binary format.Eliminates the O(N²) cost of re-serializing the growing token list on every checkpoint. Measured end-to-end speedup: 4.6x on the production Olmo 3 7B think SFT mixer.
Validation
To prove the incremental format produces byte-identical output to
origin/main, Claude is running a two-stage A/B (tracked in #1622):50k controlled run (done, PASSED). Ran the production mixer with
--num_examples 50000on bothorigin/mainand this stack, thensha256'd every output artifact viascripts/train/olmo-hybrid/_compare_tokenization.sh. Result:=== PASSED: byte-for-byte match ===on all 7 artifacts.Full-scale origin/main repro (in progress). Running the full ~2.94M-sample production mixer on
origin/mainto establish a permanent byte-for-byte reference that this branch's output will be diffed against./weka/oe-adapt-default/finbarrt/dataset/olmo-hybrid-main-reproSee
docs/verify-tokenization.mdfor the procedure. Unit-test coverage for the incremental format (append correctness, truncation on resume, checkpoint metadata round-trip) lives inopen_instruct/test_numpy_dataset_conversion.py::TestIncrementalCheckpoint.