-
Notifications
You must be signed in to change notification settings - Fork 103
Fix tokenizer loading for lm-eval on HF checkpoints #4677
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -42,7 +42,7 @@ | |
| from jax.sharding import PartitionSpec | ||
|
|
||
| import levanter.tracker | ||
| from levanter.compat.hf_checkpoints import HFCheckpointConverter, load_tokenizer | ||
| from levanter.compat.hf_checkpoints import HFCheckpointConverter | ||
| from levanter.data.packing import ( | ||
| PromptCompletion, | ||
| greedy_pack_prompt_completions, | ||
|
|
@@ -56,7 +56,7 @@ | |
| from levanter.models.gpt2 import Gpt2Config | ||
| from levanter.models.loss import fused_cross_entropy_loss_and_logsumexp_penalty | ||
| from levanter.utils.background_iterable import BackgroundIterator | ||
| from levanter.tokenizers import MarinTokenizer | ||
| from levanter.tokenizers import MarinTokenizer, load_tokenizer | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Switching Useful? React with 👍 / 👎. |
||
| from levanter.utils.py_utils import set_global_rng_seeds | ||
|
|
||
| try: | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Importing
load_tokenizerfromlevanter.tokenizerschangesEvalHarnessMainConfig.the_tokenizerto returnHfMarinTokenizer, which is a frozen dataclass without apad_token_idsetter. In this same module, bothloglikelihoodandgenerate_untildoself.tokenizer.pad_token_id = self.tokenizer.eos_token_idwhen padding is missing, so models whose tokenizer has no pad token (common for Llama-family checkpoints) will now raise at runtime instead of evaluating. The previous loader fromcompat.hf_checkpointsreturned a mutable HF tokenizer, so this regression is introduced by the import swap.Useful? React with 👍 / 👎.