Fix tokenizer loading for lm-eval on HF checkpoints#4677
Fix tokenizer loading for lm-eval on HF checkpoints#4677eric-czech wants to merge 2 commits intomainfrom
Conversation
Import load_tokenizer from levanter.tokenizers instead of levanter.compat.hf_checkpoints. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1c23f655ac
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| from levanter.models.loss import fused_cross_entropy_loss_and_logsumexp_penalty | ||
| from levanter.utils.background_iterable import BackgroundIterator | ||
| from levanter.tokenizers import MarinTokenizer | ||
| from levanter.tokenizers import MarinTokenizer, load_tokenizer |
There was a problem hiding this comment.
Preserve mutable tokenizer for pad token fallback
Importing load_tokenizer from levanter.tokenizers changes EvalHarnessMainConfig.the_tokenizer to return HfMarinTokenizer, which is a frozen dataclass without a pad_token_id setter. In this same module, both loglikelihood and generate_until do self.tokenizer.pad_token_id = self.tokenizer.eos_token_id when padding is missing, so models whose tokenizer has no pad token (common for Llama-family checkpoints) will now raise at runtime instead of evaluating. The previous loader from compat.hf_checkpoints returned a mutable HF tokenizer, so this regression is introduced by the import swap.
Useful? React with 👍 / 👎.
Verifies that EvalHarnessMainConfig.the_tokenizer returns a MarinTokenizer, which validates the correct load_tokenizer import source. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fc5a81fe6f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| from levanter.models.loss import fused_cross_entropy_loss_and_logsumexp_penalty | ||
| from levanter.utils.background_iterable import BackgroundIterator | ||
| from levanter.tokenizers import MarinTokenizer | ||
| from levanter.tokenizers import MarinTokenizer, load_tokenizer |
There was a problem hiding this comment.
Keep tokenizer callable in generation paths
Switching load_tokenizer to levanter.tokenizers.load_tokenizer now returns HfMarinTokenizer, but generate_until() still relies on HF-style call semantics via tok_encode() (self.tokenizer(...) in eval_harness.py). HfMarinTokenizer does not implement __call__, so any lm-eval task that hits generate_until will now fail at runtime with a TypeError instead of generating outputs. This regression is introduced by the import swap because the previous loader returned a callable HF tokenizer.
Useful? React with 👍 / 👎.
I ran into an issue where configuring an eval like this:
results in later calls to MarinTokenizer methods like
tokenizer.encode_batch(combined_batch)here:marin/lib/levanter/src/levanter/eval_harness.py
Line 1668 in f8d4889
That doesn't exist on HF tokenizers so a script run this way fails with:
I think this should be failing on existing tests in lib/levanter/tests/test_eval_harness.py, but they're currently skipped in CI (e.g. see here):
Are these tests being skipped intentionally due to no lm_eval install? Is there some CI workflow in which they should be running?
Regardless, this PR replaces the tokenizer loading with
levanter.tokenizers.load_tokenizerinstead oflevanter.compat.hf_checkpoints.load_tokenizer, which has worked for me. I'm not sure where to go next with adding better test coverage on it, of it that's worth it. This may not be the approach we want vs sticking to HF tokenizers in eval_harness.py.