Skip to content
Merged
Show file tree
Hide file tree
Changes from 26 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
0af8a0b
Extract numpy SFT conversion helpers into open_instruct.numpy_dataset…
finbarrtimbers Apr 18, 2026
eadf9da
Add CHANGELOG entry for numpy dataset refactor PR #1622 Co-Authored-B…
finbarrtimbers Apr 18, 2026
5a0808c
Add numpy_dataset_conversion tests and byte-for-byte Beaker verificat…
finbarrtimbers Apr 18, 2026
0e11036
Reduce verify script to 1 GPU to avoid Jupiter queue backlog Co-Autho…
finbarrtimbers Apr 18, 2026
23850d8
Use uv run prefix for huggingface-cli and python in verify script Co-…
finbarrtimbers Apr 18, 2026
2419efe
Use python snapshot_download instead of missing huggingface-cli in ve…
finbarrtimbers Apr 18, 2026
93dbecd
Pass tokenizer repo ID directly to avoid separate download step Co-Au…
finbarrtimbers Apr 18, 2026
2c4ca8a
Add download_hf_repo.py helper and restore local tokenizer path in ve…
finbarrtimbers Apr 18, 2026
7a3ee56
Temporarily point 7b_think_sft_tokenization.sh to olmo-hybrid-fresh d…
finbarrtimbers Apr 18, 2026
592814a
Point tokenization script to open-instruct-dev workspace Co-Authored-…
finbarrtimbers Apr 18, 2026
896a089
Make tokenization script preemptible Co-Authored-By: Claude Opus 4.7 …
finbarrtimbers Apr 18, 2026
eff1064
Use download_hf_repo.py instead of missing huggingface-cli in tokeniz…
finbarrtimbers Apr 18, 2026
3f209e6
Fix get_tokenizer_tulu_v2_2 for transformers v5 by using path substri…
finbarrtimbers Apr 18, 2026
065ca63
Switch hybrid SFT tokenization to CPU-only and add --resume Co-Author…
finbarrtimbers Apr 20, 2026
b192a91
Add incremental binary checkpoint + local byte-for-byte verify/benchm…
finbarrtimbers Apr 20, 2026
8ab864c
Add small-scale tokenization verify launch script, checkpoint timing …
finbarrtimbers Apr 21, 2026
2620bea
Updated code with verification
finbarrtimbers Apr 21, 2026
21e5538
Added doc
finbarrtimbers Apr 21, 2026
b44b25d
Minimize diff in convert_sft_data_for_olmocore.py docstring formattin…
finbarrtimbers Apr 21, 2026
be09651
Replace download_hf_repo.py helper with hf download CLI Co-Authored-B…
finbarrtimbers Apr 21, 2026
deb9282
cleaned up pr
finbarrtimbers Apr 21, 2026
befe8b3
cleaned up pr
finbarrtimbers Apr 21, 2026
8900b77
Trim test_numpy_dataset_conversion.py to core regression tests Co-Aut…
finbarrtimbers Apr 21, 2026
7d1d4d3
cleaned up PR
finbarrtimbers Apr 21, 2026
82e9c2e
Revert checkpoint format to single-file JSON; incremental binary move…
finbarrtimbers Apr 21, 2026
62dc086
Use incremental binary checkpoint for tokenization resume Co-Authored…
finbarrtimbers Apr 21, 2026
e0a59ff
Remove legacy test_checkpoint.py (broken under new incremental checkp…
finbarrtimbers Apr 21, 2026
4474223
Refactor save_checkpoint for readability; switch checkpoint API to pa…
finbarrtimbers Apr 21, 2026
d518baf
Raise in _truncate_to on missing/undersized checkpoint files to catch…
finbarrtimbers Apr 21, 2026
4c97b1d
Added doc
finbarrtimbers Apr 23, 2026
b789749
Added doc
finbarrtimbers Apr 23, 2026
02ed574
Merge remote-tracking branch 'origin/main' into finbarr/incremental-t…
finbarrtimbers Apr 23, 2026
e327191
removed docs
finbarrtimbers Apr 23, 2026
fb53ea2
Use np.fromiter + itertools.islice to avoid list slice copy in save_c…
finbarrtimbers Apr 23, 2026
da76a70
Restore docs/verify-tokenization.md Co-Authored-By: Claude Opus 4.7 <…
finbarrtimbers Apr 23, 2026
06863a3
Add CHANGELOG entry for #1633 Co-Authored-By: Claude Opus 4.7 <norepl…
finbarrtimbers Apr 23, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ All notable changes to this project will be documented in this file.


### Changed
- Extract numpy SFT conversion helpers into `open_instruct.numpy_dataset_conversion` (https://github.com/allenai/open-instruct/pull/1622).
- Pass `attention_mask=None` in GRPO `forward_for_logprobs` calls — HF constructs the correct 3D intra-document mask from `position_ids` internally (https://github.com/allenai/open-instruct/pull/1617).
- Migrate GRPO trainer→vLLM weight sync to vLLM 0.16.0's native weight transfer API (`NCCLWeightTransferEngine`), replacing custom NCCL process-group and broadcast code (https://github.com/allenai/open-instruct/pull/1515).
- Extend pre-commit hook to also ban `nonlocal` keyword (https://github.com/allenai/open-instruct/pull/1613).
Expand Down
12 changes: 2 additions & 10 deletions open_instruct/dataset_transformation.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,14 +60,7 @@
from huggingface_hub import ModelCard, revision_exists
from rich.console import Console
from rich.text import Text
from transformers import (
AutoConfig,
AutoTokenizer,
GPTNeoXTokenizerFast,
LlamaTokenizer,
LlamaTokenizerFast,
PreTrainedTokenizer,
)
from transformers import AutoTokenizer, GPTNeoXTokenizerFast, LlamaTokenizer, LlamaTokenizerFast, PreTrainedTokenizer
from transformers.utils.hub import extract_commit_hash

from open_instruct import launch_utils, logger_utils
Expand Down Expand Up @@ -788,9 +781,8 @@ def get_tokenizer_tulu_v2_1(tc: "TokenizerConfig"):


def get_tokenizer_tulu_v2_2(tc: "TokenizerConfig"):
config = AutoConfig.from_pretrained(tc.tokenizer_name_or_path, revision=tc.tokenizer_revision)
# @vwxyzjn: "olmo" handles both `olmo2` and `olmoe`.
if "olmo" in config.model_type:
if "olmo" in str(tc.tokenizer_name_or_path).lower():
if tc.chat_template_name is None:
pass # just assume the user knows what they're doing
elif "olmo" in tc.chat_template_name:
Expand Down
Loading
Loading