Skip to content

Fix tokenizer loading for lm-eval on HF checkpoints#4677

Draft
eric-czech wants to merge 2 commits intomainfrom
eac/lm-eval-hf-tok
Draft

Fix tokenizer loading for lm-eval on HF checkpoints#4677
eric-czech wants to merge 2 commits intomainfrom
eac/lm-eval-hf-tok

Conversation

@eric-czech
Copy link
Copy Markdown
Contributor

@eric-czech eric-czech commented Apr 12, 2026

I ran into an issue where configuring an eval like this:

eval_harness.EvalHarnessMainConfig(
  ...
  tokenizer="stanford-crfm/marin-tokenizer",
  checkpoint_path=CHECKPOINT_PATH,
  checkpoint_is_hf=True,
)

results in later calls to MarinTokenizer methods like tokenizer.encode_batch(combined_batch) here:

combined_encodings = {"input_ids": tokenizer.encode_batch(combined_batch)}

That doesn't exist on HF tokenizers so a script run this way fails with:

AttributeError: PreTrainedTokenizerFast has no attribute encode_batch

I think this should be failing on existing tests in lib/levanter/tests/test_eval_harness.py, but they're currently skipped in CI (e.g. see here):

lib/levanter/tests/test_eval_harness.py::test_iterate_tokenized_requests_with_chat_template SKIPPED [ 40%]
lib/levanter/tests/test_eval_harness.py::test_iterate_tokenized_requests SKIPPED [ 40%]

Are these tests being skipped intentionally due to no lm_eval install? Is there some CI workflow in which they should be running?

Regardless, this PR replaces the tokenizer loading with levanter.tokenizers.load_tokenizer instead of levanter.compat.hf_checkpoints.load_tokenizer, which has worked for me. I'm not sure where to go next with adding better test coverage on it, of it that's worth it. This may not be the approach we want vs sticking to HF tokenizers in eval_harness.py.

Import load_tokenizer from levanter.tokenizers instead of
levanter.compat.hf_checkpoints.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@eric-czech eric-czech added the agent-generated Created by automation/agent label Apr 12, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1c23f655ac

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

from levanter.models.loss import fused_cross_entropy_loss_and_logsumexp_penalty
from levanter.utils.background_iterable import BackgroundIterator
from levanter.tokenizers import MarinTokenizer
from levanter.tokenizers import MarinTokenizer, load_tokenizer
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve mutable tokenizer for pad token fallback

Importing load_tokenizer from levanter.tokenizers changes EvalHarnessMainConfig.the_tokenizer to return HfMarinTokenizer, which is a frozen dataclass without a pad_token_id setter. In this same module, both loglikelihood and generate_until do self.tokenizer.pad_token_id = self.tokenizer.eos_token_id when padding is missing, so models whose tokenizer has no pad token (common for Llama-family checkpoints) will now raise at runtime instead of evaluating. The previous loader from compat.hf_checkpoints returned a mutable HF tokenizer, so this regression is introduced by the import swap.

Useful? React with 👍 / 👎.

@eric-czech eric-czech marked this pull request as draft April 12, 2026 12:46
Verifies that EvalHarnessMainConfig.the_tokenizer returns a MarinTokenizer,
which validates the correct load_tokenizer import source.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@eric-czech eric-czech marked this pull request as ready for review April 12, 2026 13:08
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fc5a81fe6f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

from levanter.models.loss import fused_cross_entropy_loss_and_logsumexp_penalty
from levanter.utils.background_iterable import BackgroundIterator
from levanter.tokenizers import MarinTokenizer
from levanter.tokenizers import MarinTokenizer, load_tokenizer
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep tokenizer callable in generation paths

Switching load_tokenizer to levanter.tokenizers.load_tokenizer now returns HfMarinTokenizer, but generate_until() still relies on HF-style call semantics via tok_encode() (self.tokenizer(...) in eval_harness.py). HfMarinTokenizer does not implement __call__, so any lm-eval task that hits generate_until will now fail at runtime with a TypeError instead of generating outputs. This regression is introduced by the import swap because the previous loader returned a callable HF tokenizer.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant