[levanter] Add generation_config.json support for chat model checkpoints by ahmeda14960 · Pull Request #4160 · marin-community/marin

ahmeda14960 · 2026-03-26T00:59:55Z

Summary

Add hf_generation_eos_token_ids config field to SimpleDPOConfig, SimpleSFTConfig, SimpleTrainConfig, TrainDpoConfig, and TrainLmConfig
When set (e.g. [128001, 128009]), write a validated generation_config.json alongside HF checkpoints so vLLM stops on the right tokens for chat models
config.json is unchanged — pretraining checkpoints are unaffected
New shared helper levanter/utils/hf_export.py with build_generation_config() for validation/normalization
LLAMA3_CHAT_STOP_TOKEN_IDS constant in experiments/llama.py

Replaces #4154 (closed). Does not modify the tokenizer's eos_token or override eos_token_id in config.json.

Test plan

14 unit tests in test_hf_export.py — validation, dedup, sort, auto-add EOS, error cases
./infra/pre-commit.py --all-files --fix passes
Pre-commit hooks pass on commit
Verify generation_config.json is written when hf_generation_eos_token_ids=[128001, 128009] is set on a DPO run
Verify no generation_config.json when field is None (default)

🤖 Generated with Claude Code

Chat models need vLLM to stop on <|eot_id|> (128009), but the tokenizer's eos_token is <|end_of_text|> (128001) for pretraining. Add explicit hf_generation_eos_token_ids config field that writes a generation_config.json alongside saved checkpoints with the validated stop token IDs. - New helper module levanter/utils/hf_export.py with build_generation_config() - save_pretrained() and save_hf_checkpoint_callback() accept generation_config - Config field threaded through SimpleDPOConfig, SimpleSFTConfig, SimpleTrainConfig, TrainDpoConfig, TrainLmConfig, and defaults.py - LLAMA3_CHAT_STOP_TOKEN_IDS constant in experiments/llama.py - 14 unit tests for validation and normalization Fixes #4153 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bb197956fc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-26T01:03:47Z


    tokenizer = config.data.the_tokenizer

+    _generation_config = build_generation_config(tokenizer, config.hf_generation_eos_token_ids)


Gate DPO generation-config validation on HF export

build_generation_config(...) is executed at startup even when HF checkpoint export is disabled (hf_save_path is None or hf_save_steps is None), so a malformed hf_generation_eos_token_ids value can fail the entire DPO run even though no generation_config.json will be written. This is an avoidable regression in behavior (and inconsistent with train_lm, which computes this only inside the HF-save block), so the call should be deferred until export is actually enabled.

Useful? React with 👍 / 👎.

No need for a separate module — put it alongside the checkpoint serialization code it supports. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Explain how to set chat stop tokens for generation_config.json, how to determine the right token for a given chat template, and that Llama 3 defaults are available via LLAMA3_CHAT_STOP_TOKEN_IDS. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2-step DPO run on v5p-8 with marin-8b-instruct to verify hf_generation_eos_token_ids writes generation_config.json. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

lm_data_config needs ExecutorStep objects, not raw GCS paths. Use default_tokenize which will find existing caches. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

v5p-8 slots all occupied in us-central1-a. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tor) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Ensures DPO checkpoints include generation_config.json with eos_token_id: [128001, 128009] so vLLM stops on <|eot_id|>. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…nts (#4160) ## Summary - Add `hf_generation_eos_token_ids` config field to `SimpleDPOConfig`, `SimpleSFTConfig`, `SimpleTrainConfig`, `TrainDpoConfig`, and `TrainLmConfig` - When set (e.g. `[128001, 128009]`), write a validated `generation_config.json` alongside HF checkpoints so vLLM stops on the right tokens for chat models - `config.json` is unchanged — pretraining checkpoints are unaffected - New shared helper `levanter/utils/hf_export.py` with `build_generation_config()` for validation/normalization - `LLAMA3_CHAT_STOP_TOKEN_IDS` constant in `experiments/llama.py` Replaces #4154 (closed). Does **not** modify the tokenizer's `eos_token` or override `eos_token_id` in `config.json`. Fixes #4153 Fixes #4159 ## Test plan - [x] 14 unit tests in `test_hf_export.py` — validation, dedup, sort, auto-add EOS, error cases - [x] `./infra/pre-commit.py --all-files --fix` passes - [x] Pre-commit hooks pass on commit - [x] Verify `generation_config.json` is written when `hf_generation_eos_token_ids=[128001, 128009]` is set on a DPO run - [ ] Verify no `generation_config.json` when field is `None` (default) 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ahmeda14960 added the agent-generated Created by automation/agent label Mar 26, 2026

chatgpt-codex-connector Bot reviewed Mar 26, 2026

View reviewed changes

ahmeda14960 and others added 14 commits March 25, 2026 18:06

Move build_generation_config into hf_checkpoints.py

aab2cba

No need for a separate module — put it alongside the checkpoint serialization code it supports. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add smoke test for generation_config.json in DPO checkpoints

e53ca6d

2-step DPO run on v5p-8 with marin-8b-instruct to verify hf_generation_eos_token_ids writes generation_config.json. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix smoke test: use ExecutorStep pattern for tokenized data

b9ee7e5

lm_data_config needs ExecutorStep objects, not raw GCS paths. Use default_tokenize which will find existing caches. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Switch smoke test to v5p-16 (available in us-east5-a)

dd5ee3d

v5p-8 slots all occupied in us-central1-a. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Switch smoke test to v5e-4 (idle workers available)

000742b

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Switch smoke test to v5e-8 (smallest valid v5e slice)

25105a2

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix TPU type: v5litepod-8 (not v5e-8)

d060298

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Switch to v5p-8 in us-central1 (8 available, same region as CPU execu…

74767f3

…tor) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Bump seq_len to 2048 (some ultrafeedback examples exceed 512)

01ffae8

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Bump to seq_len=4096 (full Llama 3 context, some UF examples >2048)

49e5105

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Set ram=400g for DPO smoke test (OOM during HF save at 128g default)

7a4b1d4

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add LLAMA3_CHAT_STOP_TOKEN_IDS to dpo_ultrafeedback experiment

ee0194a

Ensures DPO checkpoints include generation_config.json with eos_token_id: [128001, 128009] so vLLM stops on <|eot_id|>. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge branch 'main' into instruct_tokenizer

2b4fb5e

dlwh approved these changes Mar 26, 2026

View reviewed changes

ahmeda14960 merged commit 180cb79 into main Mar 26, 2026
42 checks passed

ahmeda14960 deleted the instruct_tokenizer branch March 26, 2026 16:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[levanter] Add generation_config.json support for chat model checkpoints#4160

[levanter] Add generation_config.json support for chat model checkpoints#4160
ahmeda14960 merged 15 commits intomainfrom
instruct_tokenizer

ahmeda14960 commented Mar 26, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Mar 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		tokenizer = config.data.the_tokenizer

		_generation_config = build_generation_config(tokenizer, config.hf_generation_eos_token_ids)

Conversation

ahmeda14960 commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ahmeda14960 commented Mar 26, 2026 •

edited

Loading