You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Rename Marin tokenizer repository and fix chat template expectation (#4977)
Update configs, docs, and tests to use marin-community/marin-tokenizer.
Fix the Levanter chat dataset test to assert against the tokenizer's
rendered chat template instead of a stale hardcoded newline.
Fixes fixes#4974
Copy file name to clipboardExpand all lines: docs/model-cards/marin-8b.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -124,7 +124,7 @@ work out-of-the-box with the [Hugging Face Transformers](https://huggingface.co/
124
124
and any other library that supports the Llama architecture.
125
125
126
126
127
-
We use a variant of the Llama 3 tokenizer: [stanford-crfm/marin-tokenizer](https://huggingface.co/stanford-crfm/marin-tokenizer/).
127
+
We use a variant of the Llama 3 tokenizer: [marin-community/marin-tokenizer](https://huggingface.co/marin-community/marin-tokenizer/).
128
128
129
129
## Inference
130
130
@@ -200,7 +200,7 @@ Please see [our technical retrospective](https://marin.readthedocs.io/en/latest/
200
200
201
201
### Tokenizer Details
202
202
203
-
Marin 8B uses a variant of the Llama 3 tokenizer: [stanford-crfm/marin-tokenizer](https://huggingface.co/stanford-crfm/marin-tokenizer/). It has the same vocabulary but bundles a chat template into the base tokenizer for convenience.
203
+
Marin 8B uses a variant of the Llama 3 tokenizer: [marin-community/marin-tokenizer](https://huggingface.co/marin-community/marin-tokenizer/). It has the same vocabulary but bundles a chat template into the base tokenizer for convenience.
Copy file name to clipboardExpand all lines: lib/levanter/docs/guides/Training-Data-Guide.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -62,7 +62,7 @@ data:
62
62
type: prebuilt
63
63
input_ids_key: input_ids
64
64
loss_weights_key: loss_weights
65
-
tokenizer: stanford-crfm/marin-tokenizer
65
+
tokenizer: marin-community/marin-tokenizer
66
66
cache_dir: gs://bucket/cache
67
67
```
68
68
@@ -94,7 +94,7 @@ data:
94
94
owt: 0.5
95
95
alpaca: 0.3
96
96
tulu: 0.2
97
-
tokenizer: stanford-crfm/marin-tokenizer
97
+
tokenizer: marin-community/marin-tokenizer
98
98
cache_dir: gs://bucket/cache
99
99
```
100
100
@@ -107,7 +107,7 @@ data:
107
107
108
108
To use a chat format, your tokenizer must have a `chat_template`, or you must provide one in the config.
109
109
This template must be formatted to work for training (which most are not, and it is not well documented in Hugging Face).
110
-
The `stanford-crfm/marin-tokenizer` has a default template that works. See our [chat template docs](../reference/Data-Formats.md#chat-templates) for more details.
110
+
The `marin-community/marin-tokenizer` has a default template that works. See our [chat template docs](../reference/Data-Formats.md#chat-templates) for more details.
0 commit comments