Skip to content

Commit 32f51c8

Browse files
authored
Rename Marin tokenizer repository and fix chat template expectation (#4977)
Update configs, docs, and tests to use marin-community/marin-tokenizer. Fix the Levanter chat dataset test to assert against the tokenizer's rendered chat template instead of a stale hardcoded newline. Fixes fixes #4974
1 parent 2460770 commit 32f51c8

15 files changed

Lines changed: 30 additions & 28 deletions

File tree

docs/model-cards/marin-8b.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -124,7 +124,7 @@ work out-of-the-box with the [Hugging Face Transformers](https://huggingface.co/
124124
and any other library that supports the Llama architecture.
125125

126126

127-
We use a variant of the Llama 3 tokenizer: [stanford-crfm/marin-tokenizer](https://huggingface.co/stanford-crfm/marin-tokenizer/).
127+
We use a variant of the Llama 3 tokenizer: [marin-community/marin-tokenizer](https://huggingface.co/marin-community/marin-tokenizer/).
128128

129129
## Inference
130130

@@ -200,7 +200,7 @@ Please see [our technical retrospective](https://marin.readthedocs.io/en/latest/
200200

201201
### Tokenizer Details
202202

203-
Marin 8B uses a variant of the Llama 3 tokenizer: [stanford-crfm/marin-tokenizer](https://huggingface.co/stanford-crfm/marin-tokenizer/). It has the same vocabulary but bundles a chat template into the base tokenizer for convenience.
203+
Marin 8B uses a variant of the Llama 3 tokenizer: [marin-community/marin-tokenizer](https://huggingface.co/marin-community/marin-tokenizer/). It has the same vocabulary but bundles a chat template into the base tokenizer for convenience.
204204

205205
### Training Phases
206206

experiments/scaling_law_sweeps/c_adamc.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ class CAdamCHeuristic:
6868
"""C-AdamC scaling heuristic using CautiousConfig optimizer."""
6969

7070
name: str = "c-adamc"
71-
tokenizer: str = "stanford-crfm/marin-tokenizer"
71+
tokenizer: str = "marin-community/marin-tokenizer"
7272

7373
@property
7474
def vocab_size(self) -> int:

experiments/scaling_law_sweeps/completed_adamh.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,7 @@ class CompletedAdamHHeuristic:
102102
"""
103103

104104
name: str = "completed-adamh"
105-
tokenizer: str = "stanford-crfm/marin-tokenizer"
105+
tokenizer: str = "marin-community/marin-tokenizer"
106106

107107
@property
108108
def vocab_size(self) -> int:

lib/levanter/config/gpt2_small_fast_mix_chat.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ data:
33
owt: 0.6
44
wikitext: 0.3
55
tulu: 0.1
6-
tokenizer: stanford-crfm/marin-tokenizer
6+
tokenizer: marin-community/marin-tokenizer
77
cache_dir: gs://marin-us-central2/scratch/dlwh/marin_small_fast_mix
88
components:
99
owt:

lib/levanter/config/train_lm_llama3_tulu_sft.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
data:
22
train_weights:
33
tulu: 1.0
4-
tokenizer: stanford-crfm/marin-tokenizer
4+
tokenizer: marin-community/marin-tokenizer
55
cache_dir: gs://marin-us-central2/tokenized/marin-tokenizer/tulu-3-sft-mixture
66
shuffle: true
77
components:

lib/levanter/docs/guides/Training-Data-Guide.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@ data:
6262
type: prebuilt
6363
input_ids_key: input_ids
6464
loss_weights_key: loss_weights
65-
tokenizer: stanford-crfm/marin-tokenizer
65+
tokenizer: marin-community/marin-tokenizer
6666
cache_dir: gs://bucket/cache
6767
```
6868
@@ -94,7 +94,7 @@ data:
9494
owt: 0.5
9595
alpaca: 0.3
9696
tulu: 0.2
97-
tokenizer: stanford-crfm/marin-tokenizer
97+
tokenizer: marin-community/marin-tokenizer
9898
cache_dir: gs://bucket/cache
9999
```
100100
@@ -107,7 +107,7 @@ data:
107107

108108
To use a chat format, your tokenizer must have a `chat_template`, or you must provide one in the config.
109109
This template must be formatted to work for training (which most are not, and it is not well documented in Hugging Face).
110-
The `stanford-crfm/marin-tokenizer` has a default template that works. See our [chat template docs](../reference/Data-Formats.md#chat-templates) for more details.
110+
The `marin-community/marin-tokenizer` has a default template that works. See our [chat template docs](../reference/Data-Formats.md#chat-templates) for more details.
111111

112112
https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils_base.py#L1530
113113

@@ -186,7 +186,7 @@ data:
186186
train_weights:
187187
- [0, {"owt": 0.5, "alpaca": 0.3, "tulu": 0.2}]
188188
- [1000, {"owt": 0.2, "alpaca": 0.4, "tulu": 0.4}]
189-
tokenizer: stanford-crfm/marin-tokenizer
189+
tokenizer: marin-community/marin-tokenizer
190190
```
191191

192192
(Again, the weights need not sum to 1.)

lib/levanter/docs/reference/Data-Formats.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -101,7 +101,7 @@ We need this tag to construct the `loss_weight` for training, unless `mask_user_
101101

102102
Unfortunately, almost no tokenizers use this format, so you will need to write your own.
103103

104-
Here is an example we use in the [stanford-crfm/marin-tokenizer](https://huggingface.co/stanford-crfm/marin-tokenizer)
104+
Here is an example we use in the [marin-community/marin-tokenizer](https://huggingface.co/marin-community/marin-tokenizer)
105105
tokenizer:
106106

107107
```

lib/levanter/tests/test_dpo.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@
4343
from levanter.utils.tree_utils import inference_mode
4444

4545

46-
MODEL_NAME = "stanford-crfm/marin-tokenizer"
46+
MODEL_NAME = "marin-community/marin-tokenizer"
4747

4848

4949
@pytest.fixture(scope="module")

lib/levanter/tests/test_eval_harness.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ def test_iterate_tokenized_requests_with_chat_template():
1414
from lm_eval.api.instance import Instance
1515

1616
# Load a tokenizer with chat template - Llama 3 has one
17-
hf_tokenizer = AutoTokenizer.from_pretrained("stanford-crfm/marin-tokenizer")
17+
hf_tokenizer = AutoTokenizer.from_pretrained("marin-community/marin-tokenizer")
1818
if hf_tokenizer.pad_token is None:
1919
hf_tokenizer.pad_token = hf_tokenizer.eos_token
2020

@@ -98,7 +98,7 @@ def test_iterate_tokenized_requests_with_chat_template():
9898
def test_iterate_tokenized_requests():
9999
from lm_eval.api.instance import Instance
100100

101-
hf_tokenizer = AutoTokenizer.from_pretrained("stanford-crfm/marin-tokenizer")
101+
hf_tokenizer = AutoTokenizer.from_pretrained("marin-community/marin-tokenizer")
102102
if hf_tokenizer.pad_token is None:
103103
hf_tokenizer.pad_token = hf_tokenizer.eos_token
104104

lib/levanter/tests/test_text.py

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -426,7 +426,7 @@ def test_chat_dataset_build_and_pack(dummy_chat_data):
426426
with tempfile.TemporaryDirectory() as tmpdir:
427427
cache_dir = tmpdir
428428

429-
tokenizer = load_tokenizer("stanford-crfm/marin-tokenizer")
429+
tokenizer = load_tokenizer("marin-community/marin-tokenizer")
430430

431431
component = DatasetComponent(
432432
source=UrlDatasetSourceConfig(train_urls=[dummy_chat_data]),
@@ -454,11 +454,14 @@ def test_chat_dataset_build_and_pack(dummy_chat_data):
454454
assert sample["assistant_masks"].shape == sample["input_ids"].shape
455455
assert 8 < sample["assistant_masks"].sum() <= 10
456456
# assert sample["input_ids"].shape[0] > 20
457-
assert (
458-
tokenizer.decode(sample["input_ids"], skip_special_tokens=False)
459-
== "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\nHello!<|eot_id|>\n<|start_header_id|>assistant"
460-
"<|end_header_id|>\nHi there, how can I help?<|eot_id|>\n"
457+
expected_rendered = tokenizer.apply_chat_template(
458+
[
459+
{"role": "user", "content": "Hello!"},
460+
{"role": "assistant", "content": "Hi there, how can I help?"},
461+
],
462+
tokenize=False,
461463
)
464+
assert tokenizer.decode(sample["input_ids"], skip_special_tokens=False) == expected_rendered
462465

463466
# now test packing
464467
Pos = hax.Axis("position", 100)

0 commit comments

Comments
 (0)