Fix tokenizer loading for lm-eval on HF checkpoints by eric-czech · Pull Request #4677 · marin-community/marin

eric-czech · 2026-04-12T12:12:51Z

I ran into an issue where configuring an eval like this:

eval_harness.EvalHarnessMainConfig(
  ...
  tokenizer="stanford-crfm/marin-tokenizer",
  checkpoint_path=CHECKPOINT_PATH,
  checkpoint_is_hf=True,
)

results in later calls to MarinTokenizer methods like tokenizer.encode_batch(combined_batch) here:

marin/lib/levanter/src/levanter/eval_harness.py

Line 1668 in f8d4889

combined_encodings = {"input_ids": tokenizer.encode_batch(combined_batch)}

That doesn't exist on HF tokenizers so a script run this way fails with:

AttributeError: PreTrainedTokenizerFast has no attribute encode_batch

I think this should be failing on existing tests in lib/levanter/tests/test_eval_harness.py, but they're currently skipped in CI (e.g. see here):

lib/levanter/tests/test_eval_harness.py::test_iterate_tokenized_requests_with_chat_template SKIPPED [ 40%]
lib/levanter/tests/test_eval_harness.py::test_iterate_tokenized_requests SKIPPED [ 40%]

Are these tests being skipped intentionally due to no lm_eval install? Is there some CI workflow in which they should be running?

Regardless, this PR replaces the tokenizer loading with levanter.tokenizers.load_tokenizer instead of levanter.compat.hf_checkpoints.load_tokenizer, which has worked for me. I'm not sure where to go next with adding better test coverage on it, of it that's worth it. This may not be the approach we want vs sticking to HF tokenizers in eval_harness.py.

Import load_tokenizer from levanter.tokenizers instead of levanter.compat.hf_checkpoints. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1c23f655ac

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-12T12:15:57Z

lib/levanter/src/levanter/eval_harness.py

 from levanter.models.loss import fused_cross_entropy_loss_and_logsumexp_penalty
 from levanter.utils.background_iterable import BackgroundIterator
-from levanter.tokenizers import MarinTokenizer
+from levanter.tokenizers import MarinTokenizer, load_tokenizer


Preserve mutable tokenizer for pad token fallback

Importing load_tokenizer from levanter.tokenizers changes EvalHarnessMainConfig.the_tokenizer to return HfMarinTokenizer, which is a frozen dataclass without a pad_token_id setter. In this same module, both loglikelihood and generate_until do self.tokenizer.pad_token_id = self.tokenizer.eos_token_id when padding is missing, so models whose tokenizer has no pad token (common for Llama-family checkpoints) will now raise at runtime instead of evaluating. The previous loader from compat.hf_checkpoints returned a mutable HF tokenizer, so this regression is introduced by the import swap.

Useful? React with 👍 / 👎.

Verifies that EvalHarnessMainConfig.the_tokenizer returns a MarinTokenizer, which validates the correct load_tokenizer import source. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fc5a81fe6f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-12T13:12:26Z

lib/levanter/src/levanter/eval_harness.py

 from levanter.models.loss import fused_cross_entropy_loss_and_logsumexp_penalty
 from levanter.utils.background_iterable import BackgroundIterator
-from levanter.tokenizers import MarinTokenizer
+from levanter.tokenizers import MarinTokenizer, load_tokenizer


Keep tokenizer callable in generation paths

Switching load_tokenizer to levanter.tokenizers.load_tokenizer now returns HfMarinTokenizer, but generate_until() still relies on HF-style call semantics via tok_encode() (self.tokenizer(...) in eval_harness.py). HfMarinTokenizer does not implement __call__, so any lm-eval task that hits generate_until will now fail at runtime with a TypeError instead of generating outputs. This regression is introduced by the import swap because the previous loader returned a callable HF tokenizer.

Useful? React with 👍 / 👎.

Fix tokenizer loading for lm-eval on HF checkpoints

1c23f65

Import load_tokenizer from levanter.tokenizers instead of levanter.compat.hf_checkpoints. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

eric-czech added the agent-generated Created by automation/agent label Apr 12, 2026

chatgpt-codex-connector bot reviewed Apr 12, 2026

View reviewed changes

eric-czech marked this pull request as draft April 12, 2026 12:46

Add test for MarinTokenizer in EvalHarnessMainConfig

fc5a81f

Verifies that EvalHarnessMainConfig.the_tokenizer returns a MarinTokenizer, which validates the correct load_tokenizer import source. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

eric-czech marked this pull request as ready for review April 12, 2026 13:08

chatgpt-codex-connector bot reviewed Apr 12, 2026

View reviewed changes

eric-czech marked this pull request as draft April 12, 2026 13:20

eric-czech mentioned this pull request Apr 12, 2026

[levanter] eval_harness.py mixes use of Marin and HF tokenizers #4678

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix tokenizer loading for lm-eval on HF checkpoints#4677

Fix tokenizer loading for lm-eval on HF checkpoints#4677
eric-czech wants to merge 2 commits intomainfrom
eac/lm-eval-hf-tok

eric-czech commented Apr 12, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 12, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

eric-czech commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

eric-czech commented Apr 12, 2026 •

edited

Loading