Skip to content

Use incremental binary checkpoint for tokenization resume#1633

Merged
finbarrtimbers merged 36 commits intomainfrom
finbarr/incremental-tokenization
Apr 23, 2026
Merged

Use incremental binary checkpoint for tokenization resume#1633
finbarrtimbers merged 36 commits intomainfrom
finbarr/incremental-tokenization

Conversation

@finbarrtimbers
Copy link
Copy Markdown
Collaborator

@finbarrtimbers finbarrtimbers commented Apr 21, 2026

Replaces the single-file JSON checkpoint in numpy_dataset_conversion with an incremental binary format: _checkpoint_token_ids.bin, _checkpoint_labels_mask.bin, _checkpoint_document_boundaries.bin for the array data, and _checkpoint.json for scalar metadata (tokens_written, samples_written, counters, etc.).

On each checkpoint, only the newly collected tokens/labels/boundaries are appended to the .bin files; the JSON is written atomically last so it pins the valid prefix length of each binary file. On resume, binary files are truncated back to the recorded prefix before being loaded (in case of a preemption during the write).

load_checkpoint still understands the legacy JSON-only format so in-flight runs that started on the previous format will continue to resume.

Removed open_instruct/test_checkpoint.py, as it tested the old single-file JSON checkpoint API and was broken under the incremental binary format.

Eliminates the O(N²) cost of re-serializing the growing token list on every checkpoint. Measured end-to-end speedup: 4.6x on the production Olmo 3 7B think SFT mixer.

Validation

To prove the incremental format produces byte-identical output to origin/main, Claude is running a two-stage A/B (tracked in #1622):

  1. 50k controlled run (done, PASSED). Ran the production mixer with --num_examples 50000 on both origin/main and this stack, then sha256'd every output artifact via scripts/train/olmo-hybrid/_compare_tokenization.sh. Result: === PASSED: byte-for-byte match === on all 7 artifacts.

  2. Full-scale origin/main repro (in progress). Running the full ~2.94M-sample production mixer on origin/main to establish a permanent byte-for-byte reference that this branch's output will be diffed against.

See docs/verify-tokenization.md for the procedure. Unit-test coverage for the incremental format (append correctness, truncation on resume, checkpoint metadata round-trip) lives in open_instruct/test_numpy_dataset_conversion.py::TestIncrementalCheckpoint.

finbarrtimbers and others added 26 commits April 17, 2026 19:03
…_conversion

Moves the HF-to-OLMo-core numpy mmap conversion logic out of
scripts/data/convert_sft_data_for_olmocore.py and into a new module
open_instruct/numpy_dataset_conversion.py so it can be imported by
downstream callers (e.g. the upcoming OLMo-core SFT main). The CLI
script keeps its argument surface and just delegates to the library.

Split out of #1620 (match-sft).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…y: Claude Opus 4.7 <noreply@anthropic.com>
…ion harness Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…red-By: Claude Opus 4.7 <noreply@anthropic.com>
…Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…rify script Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…thored-By: Claude Opus 4.7 <noreply@anthropic.com>
…rify script Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ir Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ation script Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ng instead of AutoConfig Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ed-By: Claude Opus 4.7 <noreply@anthropic.com>
…ark harness Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…logs, beaker description updates Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…g Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…hored-By: Claude Opus 4.7 <noreply@anthropic.com>
…d to separate PR Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 62dc0868a2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread open_instruct/numpy_dataset_conversion.py
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces incremental checkpointing for the numpy dataset conversion process by storing token IDs, labels, and document boundaries in binary files and appending new data during each checkpoint. Feedback focuses on improving the robustness and performance of this system: specifically, handling potential file corruption in _truncate_to by raising errors if files are smaller than expected, optimizing memory usage during checkpointing by avoiding list slicing, and suggesting a broader refactor to use numpy arrays instead of Python lists to eliminate performance bottlenecks caused by .tolist() conversions.

Comment thread open_instruct/numpy_dataset_conversion.py Outdated
Comment thread open_instruct/numpy_dataset_conversion.py Outdated
Comment thread open_instruct/numpy_dataset_conversion.py
…oint API; covered by test_numpy_dataset_conversion.py) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…thlib.Path Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… silent corruption Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Base automatically changed from finbarr/numpy-dataset-refactor to main April 23, 2026 17:50
…okenization

# Conflicts:
#	CHANGELOG.md
#	open_instruct/numpy_dataset_conversion.py
#	open_instruct/test_checkpoint.py
#	open_instruct/test_numpy_dataset_conversion.py
@github-actions
Copy link
Copy Markdown
Contributor

Documentation Changes Detected

📄 404.html
--- site-base/404.html	2026-04-23 17:56:04.615185591 +0000
+++ site-pr/404.html	2026-04-23 17:56:01.474865880 +0000
@@ -846,6 +846,34 @@
   
   
     <li class="md-nav__item">
+      <a href="/allenai/open-instruct/verify-tokenization/" class="md-nav__link">
+        
+  
+  
📄 DGX_SPARK/index.html
--- site-base/DGX_SPARK/index.html	2026-04-23 17:56:04.615366729 +0000
+++ site-pr/DGX_SPARK/index.html	2026-04-23 17:56:01.474993318 +0000
@@ -853,6 +853,34 @@
   
   
     <li class="md-nav__item">
+      <a href="../verify-tokenization/" class="md-nav__link">
+        
+  
+  
📄 ai2_internal/index.html
--- site-base/ai2_internal/index.html	2026-04-23 17:56:04.615446659 +0000
+++ site-pr/ai2_internal/index.html	2026-04-23 17:56:01.475077305 +0000
@@ -848,6 +848,34 @@
   
   
     <li class="md-nav__item">
+      <a href="../verify-tokenization/" class="md-nav__link">
+        
+  
+  
📄 algorithms/dataset_transformation/index.html
--- site-base/algorithms/dataset_transformation/index.html	2026-04-23 17:56:04.610725106 +0000
+++ site-pr/algorithms/dataset_transformation/index.html	2026-04-23 17:56:01.470376109 +0000
@@ -915,6 +915,34 @@
   
   
     <li class="md-nav__item">
+      <a href="../../verify-tokenization/" class="md-nav__link">
+        
+  
+  
📄 algorithms/dpo/index.html
--- site-base/algorithms/dpo/index.html	2026-04-23 17:56:04.600191903 +0000
+++ site-pr/algorithms/dpo/index.html	2026-04-23 17:56:01.459663441 +0000
@@ -1048,6 +1048,34 @@
   
   
     <li class="md-nav__item">
+      <a href="../../verify-tokenization/" class="md-nav__link">
+        
+  
+  

Showing first 10 lines of diff for each changed file (up to 5 files, excluding search indices).

@github-actions
Copy link
Copy Markdown
Contributor

Documentation Changes Detected

📄 404.html
--- site-base/404.html	2026-04-23 17:58:42.936083066 +0000
+++ site-pr/404.html	2026-04-23 17:58:38.957597934 +0000
@@ -846,6 +846,34 @@
   
   
     <li class="md-nav__item">
+      <a href="/allenai/open-instruct/verify-tokenization/" class="md-nav__link">
+        
+  
+  
📄 DGX_SPARK/index.html
--- site-base/DGX_SPARK/index.html	2026-04-23 17:58:42.936261341 +0000
+++ site-pr/DGX_SPARK/index.html	2026-04-23 17:58:38.957715104 +0000
@@ -853,6 +853,34 @@
   
   
     <li class="md-nav__item">
+      <a href="../verify-tokenization/" class="md-nav__link">
+        
+  
+  
📄 ai2_internal/index.html
--- site-base/ai2_internal/index.html	2026-04-23 17:58:42.936379182 +0000
+++ site-pr/ai2_internal/index.html	2026-04-23 17:58:38.957800645 +0000
@@ -848,6 +848,34 @@
   
   
     <li class="md-nav__item">
+      <a href="../verify-tokenization/" class="md-nav__link">
+        
+  
+  
📄 algorithms/dataset_transformation/index.html
--- site-base/algorithms/dataset_transformation/index.html	2026-04-23 17:58:42.931370706 +0000
+++ site-pr/algorithms/dataset_transformation/index.html	2026-04-23 17:58:38.953023814 +0000
@@ -915,6 +915,34 @@
   
   
     <li class="md-nav__item">
+      <a href="../../verify-tokenization/" class="md-nav__link">
+        
+  
+  
📄 algorithms/dpo/index.html
--- site-base/algorithms/dpo/index.html	2026-04-23 17:58:42.920708202 +0000
+++ site-pr/algorithms/dpo/index.html	2026-04-23 17:58:38.942247676 +0000
@@ -1048,6 +1048,34 @@
   
   
     <li class="md-nav__item">
+      <a href="../../verify-tokenization/" class="md-nav__link">
+        
+  
+  

Showing first 10 lines of diff for each changed file (up to 5 files, excluding search indices).

@github-actions
Copy link
Copy Markdown
Contributor

Documentation Changes Detected

📄 404.html
--- site-base/404.html	2026-04-23 18:00:33.183086717 +0000
+++ site-pr/404.html	2026-04-23 18:00:29.123466052 +0000
@@ -846,6 +846,34 @@
   
   
     <li class="md-nav__item">
+      <a href="/allenai/open-instruct/verify-tokenization/" class="md-nav__link">
+        
+  
+  
📄 DGX_SPARK/index.html
--- site-base/DGX_SPARK/index.html	2026-04-23 18:00:33.200991684 +0000
+++ site-pr/DGX_SPARK/index.html	2026-04-23 18:00:29.174583001 +0000
@@ -853,6 +853,34 @@
   
   
     <li class="md-nav__item">
+      <a href="../verify-tokenization/" class="md-nav__link">
+        
+  
+  
📄 ai2_internal/index.html
--- site-base/ai2_internal/index.html	2026-04-23 18:00:33.201075661 +0000
+++ site-pr/ai2_internal/index.html	2026-04-23 18:00:29.174946541 +0000
@@ -848,6 +848,34 @@
   
   
     <li class="md-nav__item">
+      <a href="../verify-tokenization/" class="md-nav__link">
+        
+  
+  
📄 algorithms/dataset_transformation/index.html
--- site-base/algorithms/dataset_transformation/index.html	2026-04-23 18:00:33.196518407 +0000
+++ site-pr/algorithms/dataset_transformation/index.html	2026-04-23 18:00:29.162196256 +0000
@@ -915,6 +915,34 @@
   
   
     <li class="md-nav__item">
+      <a href="../../verify-tokenization/" class="md-nav__link">
+        
+  
+  
📄 algorithms/dpo/index.html
--- site-base/algorithms/dpo/index.html	2026-04-23 18:00:33.185917818 +0000
+++ site-pr/algorithms/dpo/index.html	2026-04-23 18:00:29.132295956 +0000
@@ -1048,6 +1048,34 @@
   
   
     <li class="md-nav__item">
+      <a href="../../verify-tokenization/" class="md-nav__link">
+        
+  
+  

Showing first 10 lines of diff for each changed file (up to 5 files, excluding search indices).

@finbarrtimbers finbarrtimbers added this pull request to the merge queue Apr 23, 2026
Merged via the queue into main with commit 17bfd1b Apr 23, 2026
6 of 7 checks passed
@finbarrtimbers finbarrtimbers deleted the finbarr/incremental-tokenization branch April 23, 2026 18:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants