Use incremental binary checkpoint for tokenization resume by finbarrtimbers · Pull Request #1633 · allenai/open-instruct

finbarrtimbers · 2026-04-21T21:36:13Z

Replaces the single-file JSON checkpoint in numpy_dataset_conversion with an incremental binary format: _checkpoint_token_ids.bin, _checkpoint_labels_mask.bin, _checkpoint_document_boundaries.bin for the array data, and _checkpoint.json for scalar metadata (tokens_written, samples_written, counters, etc.).

On each checkpoint, only the newly collected tokens/labels/boundaries are appended to the .bin files; the JSON is written atomically last so it pins the valid prefix length of each binary file. On resume, binary files are truncated back to the recorded prefix before being loaded (in case of a preemption during the write).

load_checkpoint still understands the legacy JSON-only format so in-flight runs that started on the previous format will continue to resume.

Removed open_instruct/test_checkpoint.py, as it tested the old single-file JSON checkpoint API and was broken under the incremental binary format.

Eliminates the O(N²) cost of re-serializing the growing token list on every checkpoint. Measured end-to-end speedup: 4.6x on the production Olmo 3 7B think SFT mixer.

Validation

To prove the incremental format produces byte-identical output to origin/main, Claude is running a two-stage A/B (tracked in #1622):

50k controlled run (done, PASSED). Ran the production mixer with --num_examples 50000 on both origin/main and this stack, then sha256'd every output artifact via scripts/train/olmo-hybrid/_compare_tokenization.sh. Result: === PASSED: byte-for-byte match === on all 7 artifacts.
- HEAD-50k: https://beaker.org/ex/01KPR5MRQVP6X6Y8W4SJF6N1D2
- main-50k: https://beaker.org/ex/01KPR6MP04TGC91RRPPY9SX31Z
- compare: https://beaker.org/ex/01KPR7HQYPMDMG3BRX9XBTS64W
Full-scale origin/main repro (in progress). Running the full ~2.94M-sample production mixer on origin/main to establish a permanent byte-for-byte reference that this branch's output will be diffed against.
- https://beaker.org/ex/01KPRDGYEM81EASNNSBZ2HA7KA
- Output: /weka/oe-adapt-default/finbarrt/dataset/olmo-hybrid-main-repro

See docs/verify-tokenization.md for the procedure. Unit-test coverage for the incremental format (append correctness, truncation on resume, checkpoint metadata round-trip) lives in open_instruct/test_numpy_dataset_conversion.py::TestIncrementalCheckpoint.

…_conversion Moves the HF-to-OLMo-core numpy mmap conversion logic out of scripts/data/convert_sft_data_for_olmocore.py and into a new module open_instruct/numpy_dataset_conversion.py so it can be imported by downstream callers (e.g. the upcoming OLMo-core SFT main). The CLI script keeps its argument surface and just delegates to the library. Split out of #1620 (match-sft). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…y: Claude Opus 4.7 <noreply@anthropic.com>

…ion harness Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…red-By: Claude Opus 4.7 <noreply@anthropic.com>

…Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…rify script Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…thored-By: Claude Opus 4.7 <noreply@anthropic.com>

…rify script Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ir Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…By: Claude Opus 4.7 <noreply@anthropic.com>

…<noreply@anthropic.com>

…ation script Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ng instead of AutoConfig Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ed-By: Claude Opus 4.7 <noreply@anthropic.com>

…ark harness Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…logs, beaker description updates Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…g Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…y: Claude Opus 4.7 <noreply@anthropic.com>

…hored-By: Claude Opus 4.7 <noreply@anthropic.com>

…d to separate PR Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…-By: Claude Opus 4.7 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 62dc0868a2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

gemini-code-assist

Code Review

This pull request introduces incremental checkpointing for the numpy dataset conversion process by storing token IDs, labels, and document boundaries in binary files and appending new data during each checkpoint. Feedback focuses on improving the robustness and performance of this system: specifically, handling potential file corruption in _truncate_to by raising errors if files are smaller than expected, optimizing memory usage during checkpointing by avoiding list slicing, and suggesting a broader refactor to use numpy arrays instead of Python lists to eliminate performance bottlenecks caused by .tolist() conversions.

…oint API; covered by test_numpy_dataset_conversion.py) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…thlib.Path Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… silent corruption Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…okenization # Conflicts: # CHANGELOG.md # open_instruct/numpy_dataset_conversion.py # open_instruct/test_checkpoint.py # open_instruct/test_numpy_dataset_conversion.py

github-actions · 2026-04-23T17:56:05Z

Documentation Changes Detected

📄 404.html

--- site-base/404.html	2026-04-23 17:56:04.615185591 +0000
+++ site-pr/404.html	2026-04-23 17:56:01.474865880 +0000
@@ -846,6 +846,34 @@
   
   
     <li class="md-nav__item">
+      <a href="/allenai/open-instruct/verify-tokenization/" class="md-nav__link">
+        
+  
+

📄 DGX_SPARK/index.html

--- site-base/DGX_SPARK/index.html	2026-04-23 17:56:04.615366729 +0000
+++ site-pr/DGX_SPARK/index.html	2026-04-23 17:56:01.474993318 +0000
@@ -853,6 +853,34 @@
   
   
     <li class="md-nav__item">
+      <a href="../verify-tokenization/" class="md-nav__link">
+        
+  
+

📄 ai2_internal/index.html

--- site-base/ai2_internal/index.html	2026-04-23 17:56:04.615446659 +0000
+++ site-pr/ai2_internal/index.html	2026-04-23 17:56:01.475077305 +0000
@@ -848,6 +848,34 @@
   
   
     <li class="md-nav__item">
+      <a href="../verify-tokenization/" class="md-nav__link">
+        
+  
+

📄 algorithms/dataset_transformation/index.html

--- site-base/algorithms/dataset_transformation/index.html	2026-04-23 17:56:04.610725106 +0000
+++ site-pr/algorithms/dataset_transformation/index.html	2026-04-23 17:56:01.470376109 +0000
@@ -915,6 +915,34 @@
   
   
     <li class="md-nav__item">
+      <a href="../../verify-tokenization/" class="md-nav__link">
+        
+  
+

📄 algorithms/dpo/index.html

--- site-base/algorithms/dpo/index.html	2026-04-23 17:56:04.600191903 +0000
+++ site-pr/algorithms/dpo/index.html	2026-04-23 17:56:01.459663441 +0000
@@ -1048,6 +1048,34 @@
   
   
     <li class="md-nav__item">
+      <a href="../../verify-tokenization/" class="md-nav__link">
+        
+  
+

Showing first 10 lines of diff for each changed file (up to 5 files, excluding search indices).

…heckpoint Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…noreply@anthropic.com>

github-actions · 2026-04-23T17:58:43Z

Documentation Changes Detected

📄 404.html

--- site-base/404.html	2026-04-23 17:58:42.936083066 +0000
+++ site-pr/404.html	2026-04-23 17:58:38.957597934 +0000
@@ -846,6 +846,34 @@
   
   
     <li class="md-nav__item">
+      <a href="/allenai/open-instruct/verify-tokenization/" class="md-nav__link">
+        
+  
+

📄 DGX_SPARK/index.html

--- site-base/DGX_SPARK/index.html	2026-04-23 17:58:42.936261341 +0000
+++ site-pr/DGX_SPARK/index.html	2026-04-23 17:58:38.957715104 +0000
@@ -853,6 +853,34 @@
   
   
     <li class="md-nav__item">
+      <a href="../verify-tokenization/" class="md-nav__link">
+        
+  
+

📄 ai2_internal/index.html

--- site-base/ai2_internal/index.html	2026-04-23 17:58:42.936379182 +0000
+++ site-pr/ai2_internal/index.html	2026-04-23 17:58:38.957800645 +0000
@@ -848,6 +848,34 @@
   
   
     <li class="md-nav__item">
+      <a href="../verify-tokenization/" class="md-nav__link">
+        
+  
+

📄 algorithms/dataset_transformation/index.html

--- site-base/algorithms/dataset_transformation/index.html	2026-04-23 17:58:42.931370706 +0000
+++ site-pr/algorithms/dataset_transformation/index.html	2026-04-23 17:58:38.953023814 +0000
@@ -915,6 +915,34 @@
   
   
     <li class="md-nav__item">
+      <a href="../../verify-tokenization/" class="md-nav__link">
+        
+  
+

📄 algorithms/dpo/index.html

--- site-base/algorithms/dpo/index.html	2026-04-23 17:58:42.920708202 +0000
+++ site-pr/algorithms/dpo/index.html	2026-04-23 17:58:38.942247676 +0000
@@ -1048,6 +1048,34 @@
   
   
     <li class="md-nav__item">
+      <a href="../../verify-tokenization/" class="md-nav__link">
+        
+  
+

Showing first 10 lines of diff for each changed file (up to 5 files, excluding search indices).

…y@anthropic.com>

github-actions · 2026-04-23T18:00:33Z

Documentation Changes Detected

📄 404.html

--- site-base/404.html	2026-04-23 18:00:33.183086717 +0000
+++ site-pr/404.html	2026-04-23 18:00:29.123466052 +0000
@@ -846,6 +846,34 @@
   
   
     <li class="md-nav__item">
+      <a href="/allenai/open-instruct/verify-tokenization/" class="md-nav__link">
+        
+  
+

📄 DGX_SPARK/index.html

--- site-base/DGX_SPARK/index.html	2026-04-23 18:00:33.200991684 +0000
+++ site-pr/DGX_SPARK/index.html	2026-04-23 18:00:29.174583001 +0000
@@ -853,6 +853,34 @@
   
   
     <li class="md-nav__item">
+      <a href="../verify-tokenization/" class="md-nav__link">
+        
+  
+

📄 ai2_internal/index.html

--- site-base/ai2_internal/index.html	2026-04-23 18:00:33.201075661 +0000
+++ site-pr/ai2_internal/index.html	2026-04-23 18:00:29.174946541 +0000
@@ -848,6 +848,34 @@
   
   
     <li class="md-nav__item">
+      <a href="../verify-tokenization/" class="md-nav__link">
+        
+  
+

📄 algorithms/dataset_transformation/index.html

--- site-base/algorithms/dataset_transformation/index.html	2026-04-23 18:00:33.196518407 +0000
+++ site-pr/algorithms/dataset_transformation/index.html	2026-04-23 18:00:29.162196256 +0000
@@ -915,6 +915,34 @@
   
   
     <li class="md-nav__item">
+      <a href="../../verify-tokenization/" class="md-nav__link">
+        
+  
+

📄 algorithms/dpo/index.html

--- site-base/algorithms/dpo/index.html	2026-04-23 18:00:33.185917818 +0000
+++ site-pr/algorithms/dpo/index.html	2026-04-23 18:00:29.132295956 +0000
@@ -1048,6 +1048,34 @@
   
   
     <li class="md-nav__item">
+      <a href="../../verify-tokenization/" class="md-nav__link">
+        
+  
+

Showing first 10 lines of diff for each changed file (up to 5 files, excluding search indices).

finbarrtimbers and others added 26 commits April 17, 2026 19:03

Add CHANGELOG entry for numpy dataset refactor PR #1622 Co-Authored-B…

eadf9da

…y: Claude Opus 4.7 <noreply@anthropic.com>

Add numpy_dataset_conversion tests and byte-for-byte Beaker verificat…

5a0808c

…ion harness Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Reduce verify script to 1 GPU to avoid Jupiter queue backlog Co-Autho…

0e11036

…red-By: Claude Opus 4.7 <noreply@anthropic.com>

Use uv run prefix for huggingface-cli and python in verify script Co-…

23850d8

…Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Use python snapshot_download instead of missing huggingface-cli in ve…

2419efe

…rify script Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Pass tokenizer repo ID directly to avoid separate download step Co-Au…

93dbecd

…thored-By: Claude Opus 4.7 <noreply@anthropic.com>

Add download_hf_repo.py helper and restore local tokenizer path in ve…

2c4ca8a

…rify script Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Temporarily point 7b_think_sft_tokenization.sh to olmo-hybrid-fresh d…

7a3ee56

…ir Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Point tokenization script to open-instruct-dev workspace Co-Authored-…

592814a

…By: Claude Opus 4.7 <noreply@anthropic.com>

Make tokenization script preemptible Co-Authored-By: Claude Opus 4.7 …

896a089

…<noreply@anthropic.com>

Use download_hf_repo.py instead of missing huggingface-cli in tokeniz…

eff1064

…ation script Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Fix get_tokenizer_tulu_v2_2 for transformers v5 by using path substri…

3f209e6

…ng instead of AutoConfig Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Switch hybrid SFT tokenization to CPU-only and add --resume Co-Author…

065ca63

…ed-By: Claude Opus 4.7 <noreply@anthropic.com>

Add incremental binary checkpoint + local byte-for-byte verify/benchm…

b192a91

…ark harness Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Add small-scale tokenization verify launch script, checkpoint timing …

8ab864c

…logs, beaker description updates Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Updated code with verification

2620bea

Added doc

21e5538

Minimize diff in convert_sft_data_for_olmocore.py docstring formattin…

b44b25d

…g Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Replace download_hf_repo.py helper with hf download CLI Co-Authored-B…

be09651

…y: Claude Opus 4.7 <noreply@anthropic.com>

cleaned up pr

deb9282

cleaned up pr

befe8b3

Trim test_numpy_dataset_conversion.py to core regression tests Co-Aut…

8900b77

…hored-By: Claude Opus 4.7 <noreply@anthropic.com>

cleaned up PR

7d1d4d3

Revert checkpoint format to single-file JSON; incremental binary move…

82e9c2e

…d to separate PR Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Use incremental binary checkpoint for tokenization resume Co-Authored…

62dc086

…-By: Claude Opus 4.7 <noreply@anthropic.com>

chatgpt-codex-connector Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread open_instruct/numpy_dataset_conversion.py

gemini-code-assist Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread open_instruct/numpy_dataset_conversion.py Outdated

Comment thread open_instruct/numpy_dataset_conversion.py Outdated

Comment thread open_instruct/numpy_dataset_conversion.py

finbarrtimbers added 2 commits April 21, 2026 15:42

Remove legacy test_checkpoint.py (broken under new incremental checkp…

e0a59ff

…oint API; covered by test_numpy_dataset_conversion.py) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Refactor save_checkpoint for readability; switch checkpoint API to pa…

4474223

…thlib.Path Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

finbarrtimbers requested a review from jacob-morrison April 21, 2026 21:58

jacob-morrison approved these changes Apr 21, 2026

View reviewed changes

finbarrtimbers added 3 commits April 21, 2026 16:10

Raise in _truncate_to on missing/undersized checkpoint files to catch…

d518baf

… silent corruption Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Added doc

4c97b1d

Added doc

b789749

Base automatically changed from finbarr/numpy-dataset-refactor to main April 23, 2026 17:50

finbarrtimbers added 2 commits April 23, 2026 11:53

Merge remote-tracking branch 'origin/main' into finbarr/incremental-t…

02ed574

…okenization # Conflicts: # CHANGELOG.md # open_instruct/numpy_dataset_conversion.py # open_instruct/test_checkpoint.py # open_instruct/test_numpy_dataset_conversion.py

removed docs

e327191

finbarrtimbers added 2 commits April 23, 2026 11:57

Use np.fromiter + itertools.islice to avoid list slice copy in save_c…

fb53ea2

…heckpoint Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Restore docs/verify-tokenization.md Co-Authored-By: Claude Opus 4.7 <…

da76a70

…noreply@anthropic.com>

Add CHANGELOG entry for #1633 Co-Authored-By: Claude Opus 4.7 <norepl…

06863a3

…y@anthropic.com>

finbarrtimbers enabled auto-merge April 23, 2026 17:59

finbarrtimbers added this pull request to the merge queue Apr 23, 2026

Merged via the queue into main with commit 17bfd1b Apr 23, 2026
6 of 7 checks passed

finbarrtimbers deleted the finbarr/incremental-tokenization branch April 23, 2026 18:13

finbarrtimbers mentioned this pull request Apr 23, 2026

Now, SFT tokenization streams directly to disk, making it ~9.4x faster. #1631

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use incremental binary checkpoint for tokenization resume#1633

Use incremental binary checkpoint for tokenization resume#1633
finbarrtimbers merged 36 commits intomainfrom
finbarr/incremental-tokenization

finbarrtimbers commented Apr 21, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

finbarrtimbers commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Validation

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 23, 2026

Documentation Changes Detected

Uh oh!

github-actions Bot commented Apr 23, 2026

Documentation Changes Detected

Uh oh!

github-actions Bot commented Apr 23, 2026

Documentation Changes Detected

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

finbarrtimbers commented Apr 21, 2026 •

edited

Loading