[levanter] Add generation_config.json support for chat model checkpoints#4160
[levanter] Add generation_config.json support for chat model checkpoints#4160ahmeda14960 merged 15 commits intomainfrom
Conversation
Chat models need vLLM to stop on <|eot_id|> (128009), but the tokenizer's eos_token is <|end_of_text|> (128001) for pretraining. Add explicit hf_generation_eos_token_ids config field that writes a generation_config.json alongside saved checkpoints with the validated stop token IDs. - New helper module levanter/utils/hf_export.py with build_generation_config() - save_pretrained() and save_hf_checkpoint_callback() accept generation_config - Config field threaded through SimpleDPOConfig, SimpleSFTConfig, SimpleTrainConfig, TrainDpoConfig, TrainLmConfig, and defaults.py - LLAMA3_CHAT_STOP_TOKEN_IDS constant in experiments/llama.py - 14 unit tests for validation and normalization Fixes #4153 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bb197956fc
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| tokenizer = config.data.the_tokenizer | ||
|
|
||
| _generation_config = build_generation_config(tokenizer, config.hf_generation_eos_token_ids) |
There was a problem hiding this comment.
Gate DPO generation-config validation on HF export
build_generation_config(...) is executed at startup even when HF checkpoint export is disabled (hf_save_path is None or hf_save_steps is None), so a malformed hf_generation_eos_token_ids value can fail the entire DPO run even though no generation_config.json will be written. This is an avoidable regression in behavior (and inconsistent with train_lm, which computes this only inside the HF-save block), so the call should be deferred until export is actually enabled.
Useful? React with 👍 / 👎.
No need for a separate module — put it alongside the checkpoint serialization code it supports. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Explain how to set chat stop tokens for generation_config.json, how to determine the right token for a given chat template, and that Llama 3 defaults are available via LLAMA3_CHAT_STOP_TOKEN_IDS. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2-step DPO run on v5p-8 with marin-8b-instruct to verify hf_generation_eos_token_ids writes generation_config.json. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
lm_data_config needs ExecutorStep objects, not raw GCS paths. Use default_tokenize which will find existing caches. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
v5p-8 slots all occupied in us-central1-a. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tor) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ensures DPO checkpoints include generation_config.json with eos_token_id: [128001, 128009] so vLLM stops on <|eot_id|>. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nts (#4160) ## Summary - Add `hf_generation_eos_token_ids` config field to `SimpleDPOConfig`, `SimpleSFTConfig`, `SimpleTrainConfig`, `TrainDpoConfig`, and `TrainLmConfig` - When set (e.g. `[128001, 128009]`), write a validated `generation_config.json` alongside HF checkpoints so vLLM stops on the right tokens for chat models - `config.json` is unchanged — pretraining checkpoints are unaffected - New shared helper `levanter/utils/hf_export.py` with `build_generation_config()` for validation/normalization - `LLAMA3_CHAT_STOP_TOKEN_IDS` constant in `experiments/llama.py` Replaces #4154 (closed). Does **not** modify the tokenizer's `eos_token` or override `eos_token_id` in `config.json`. Fixes #4153 Fixes #4159 ## Test plan - [x] 14 unit tests in `test_hf_export.py` — validation, dedup, sort, auto-add EOS, error cases - [x] `./infra/pre-commit.py --all-files --fix` passes - [x] Pre-commit hooks pass on commit - [x] Verify `generation_config.json` is written when `hf_generation_eos_token_ids=[128001, 128009]` is set on a DPO run - [ ] Verify no `generation_config.json` when field is `None` (default) 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nts (#4160) ## Summary - Add `hf_generation_eos_token_ids` config field to `SimpleDPOConfig`, `SimpleSFTConfig`, `SimpleTrainConfig`, `TrainDpoConfig`, and `TrainLmConfig` - When set (e.g. `[128001, 128009]`), write a validated `generation_config.json` alongside HF checkpoints so vLLM stops on the right tokens for chat models - `config.json` is unchanged — pretraining checkpoints are unaffected - New shared helper `levanter/utils/hf_export.py` with `build_generation_config()` for validation/normalization - `LLAMA3_CHAT_STOP_TOKEN_IDS` constant in `experiments/llama.py` Replaces #4154 (closed). Does **not** modify the tokenizer's `eos_token` or override `eos_token_id` in `config.json`. Fixes #4153 Fixes #4159 ## Test plan - [x] 14 unit tests in `test_hf_export.py` — validation, dedup, sort, auto-add EOS, error cases - [x] `./infra/pre-commit.py --all-files --fix` passes - [x] Pre-commit hooks pass on commit - [x] Verify `generation_config.json` is written when `hf_generation_eos_token_ids=[128001, 128009]` is set on a DPO run - [ ] Verify no `generation_config.json` when field is `None` (default) 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
hf_generation_eos_token_idsconfig field toSimpleDPOConfig,SimpleSFTConfig,SimpleTrainConfig,TrainDpoConfig, andTrainLmConfig[128001, 128009]), write a validatedgeneration_config.jsonalongside HF checkpoints so vLLM stops on the right tokens for chat modelsconfig.jsonis unchanged — pretraining checkpoints are unaffectedlevanter/utils/hf_export.pywithbuild_generation_config()for validation/normalizationLLAMA3_CHAT_STOP_TOKEN_IDSconstant inexperiments/llama.pyReplaces #4154 (closed). Does not modify the tokenizer's
eos_tokenor overrideeos_token_idinconfig.json.Fixes #4153
Fixes #4159
Test plan
test_hf_export.py— validation, dedup, sort, auto-add EOS, error cases./infra/pre-commit.py --all-files --fixpassesgeneration_config.jsonis written whenhf_generation_eos_token_ids=[128001, 128009]is set on a DPO rungeneration_config.jsonwhen field isNone(default)🤖 Generated with Claude Code