Skip to content

[levanter] eval_harness.py mixes use of Marin and HF tokenizers #4678

@eric-czech

Description

@eric-czech

Describe the bug

At least one use of tokenizers in eval_harness.py requires MarinTokenizer methods e.g.:

combined_encodings = {"input_ids": tokenizer.encode_batch(combined_batch)}

That encode_batch method is not supported by HF tokenizers while some other functions in that module are relying on mutation:

self.tokenizer.pad_token_id = self.tokenizer.eos_token_id

or callable tokenizers:

encoding: Union[List[List[int]], List[int]] = self.tokenizer(

AFAICT these should be all be the same tokenizer provided to _LmEvalHarnessWorker, which has no type annotation making it hard to infer intent on.

Should everything in eval_harness.py be ported to work through the MarinTokenizer interface? This came up first for me as an error in trying to run lm_eval on HF checkpoints in #4677.

I'm unclear on what the right way to fix this is, especially since test_iterate_tokenized_requests in test_eval_harness.py is creating an HF tokenizer and explicitly passing it to _iterate_tokenized_requests, which should fail. These tests are being skipped in CI now (see the PR).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions