Skip to content

Rename Marin tokenizer repository and fix chat template expectation#4977

Merged
dlwh merged 2 commits intomainfrom
codex/rename-marin-tokenizer-repo
Apr 21, 2026
Merged

Rename Marin tokenizer repository and fix chat template expectation#4977
dlwh merged 2 commits intomainfrom
codex/rename-marin-tokenizer-repo

Conversation

@dlwh
Copy link
Copy Markdown
Member

@dlwh dlwh commented Apr 20, 2026

Update configs, docs, and tests to use marin-community/marin-tokenizer. Fix the Levanter chat dataset test to assert against the tokenizer's rendered chat template instead of a stale hardcoded newline.

Fixes fixes #4974

@dlwh dlwh added the agent-generated Created by automation/agent label Apr 20, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 20, 2026

Claude finished @dlwh's task in 1m 41s —— View job


Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

Verification performed:

  • No remaining stanford-crfm/marin-tokenizer references in the repo (grep clean).
  • All other stanford-crfm/* references are unrelated (levanter, haliax, helm, lm-evaluation-harness repos) and correctly left untouched.
  • The vocab-size dedup in lib/marin/src/marin/processing/tokenize/data_configs.py keeps marin-community/marin-tokenizer: 128_256 and drops the duplicate; _EQUIVALENT_TOKENIZERS already points at the new name.
  • Unchanged references that mention marin-tokenizer without the org prefix are intentional (temp-dir prefix in lm_evaluation_harness_evaluator.py, the existing gs://marin-us-central2/tokenized/marin-tokenizer/… cache path, and the marin-community--marin-tokenizer% GCS protect-rule pattern).

Note: the PR is currently marked as a draft.
· codex/rename-marin-tokenizer-repo

test_chat_dataset_build_and_pack hardcoded a trailing newline after the final <|eot_id|>, but both the Hub tokenizer and the locally generated tokenizer render no newline.
@dlwh dlwh changed the title Rename Marin tokenizer repository Rename Marin tokenizer repository and fix chat template expectation Apr 20, 2026
@dlwh dlwh marked this pull request as ready for review April 20, 2026 23:36
@dlwh dlwh requested a review from yonromai April 20, 2026 23:36
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 20, 2026

Claude finished @dlwh's task in 2m 44s —— View job


Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

Verification performed:

  • No remaining stanford-crfm/marin-tokenizer references in the repo (grep clean).
  • _EQUIVALENT_TOKENIZERS in lib/marin/src/marin/processing/tokenize/data_configs.py correctly points at marin-community/marin-tokenizer, and the VOCAB_SIZES dict now only lists the new name (no dangling references to the old key).
  • Other stanford-crfm/* references (levanter, haliax, helm, lm-evaluation-harness repos) are unrelated and correctly left untouched.
  • The test refactor in lib/levanter/tests/test_text.py — replacing the hardcoded chat-rendered string with a value computed by tokenizer.apply_chat_template(...) — is a correctness improvement consistent with the repo's testing guidance (assert against externally-observable behavior, not implementation details).
  • Unchanged references that mention marin-tokenizer without the org prefix are intentional (temp-dir prefix in lm_evaluation_harness_evaluator.py, existing gs://marin-us-central2/tokenized/marin-tokenizer/… cache path, marin-community--marin-tokenizer% GCS protect-rule pattern).

· codex/rename-marin-tokenizer-repo ·

Copy link
Copy Markdown
Contributor

@yonromai yonromai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@dlwh dlwh merged commit 32f51c8 into main Apr 21, 2026
41 checks passed
@dlwh dlwh deleted the codex/rename-marin-tokenizer-repo branch April 21, 2026 00:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CI: levanter-ray-tests fails with HF 401 on stanford-crfm/marin-tokenizer

2 participants