Skip to content

[levanter] Add generation_config.json support for chat model checkpoints#4160

Merged
ahmeda14960 merged 15 commits intomainfrom
instruct_tokenizer
Mar 26, 2026
Merged

[levanter] Add generation_config.json support for chat model checkpoints#4160
ahmeda14960 merged 15 commits intomainfrom
instruct_tokenizer

Conversation

@ahmeda14960
Copy link
Copy Markdown
Contributor

@ahmeda14960 ahmeda14960 commented Mar 26, 2026

Summary

  • Add hf_generation_eos_token_ids config field to SimpleDPOConfig, SimpleSFTConfig, SimpleTrainConfig, TrainDpoConfig, and TrainLmConfig
  • When set (e.g. [128001, 128009]), write a validated generation_config.json alongside HF checkpoints so vLLM stops on the right tokens for chat models
  • config.json is unchanged — pretraining checkpoints are unaffected
  • New shared helper levanter/utils/hf_export.py with build_generation_config() for validation/normalization
  • LLAMA3_CHAT_STOP_TOKEN_IDS constant in experiments/llama.py

Replaces #4154 (closed). Does not modify the tokenizer's eos_token or override eos_token_id in config.json.

Fixes #4153
Fixes #4159

Test plan

  • 14 unit tests in test_hf_export.py — validation, dedup, sort, auto-add EOS, error cases
  • ./infra/pre-commit.py --all-files --fix passes
  • Pre-commit hooks pass on commit
  • Verify generation_config.json is written when hf_generation_eos_token_ids=[128001, 128009] is set on a DPO run
  • Verify no generation_config.json when field is None (default)

🤖 Generated with Claude Code

Chat models need vLLM to stop on <|eot_id|> (128009), but the tokenizer's
eos_token is <|end_of_text|> (128001) for pretraining. Add explicit
hf_generation_eos_token_ids config field that writes a generation_config.json
alongside saved checkpoints with the validated stop token IDs.

- New helper module levanter/utils/hf_export.py with build_generation_config()
- save_pretrained() and save_hf_checkpoint_callback() accept generation_config
- Config field threaded through SimpleDPOConfig, SimpleSFTConfig,
  SimpleTrainConfig, TrainDpoConfig, TrainLmConfig, and defaults.py
- LLAMA3_CHAT_STOP_TOKEN_IDS constant in experiments/llama.py
- 14 unit tests for validation and normalization

Fixes #4153

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ahmeda14960 ahmeda14960 added the agent-generated Created by automation/agent label Mar 26, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bb197956fc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


tokenizer = config.data.the_tokenizer

_generation_config = build_generation_config(tokenizer, config.hf_generation_eos_token_ids)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Gate DPO generation-config validation on HF export

build_generation_config(...) is executed at startup even when HF checkpoint export is disabled (hf_save_path is None or hf_save_steps is None), so a malformed hf_generation_eos_token_ids value can fail the entire DPO run even though no generation_config.json will be written. This is an avoidable regression in behavior (and inconsistent with train_lm, which computes this only inside the HF-save block), so the call should be deferred until export is actually enabled.

Useful? React with 👍 / 👎.

ahmeda14960 and others added 14 commits March 25, 2026 18:06
No need for a separate module — put it alongside the checkpoint
serialization code it supports.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Explain how to set chat stop tokens for generation_config.json,
how to determine the right token for a given chat template,
and that Llama 3 defaults are available via LLAMA3_CHAT_STOP_TOKEN_IDS.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2-step DPO run on v5p-8 with marin-8b-instruct to verify
hf_generation_eos_token_ids writes generation_config.json.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
lm_data_config needs ExecutorStep objects, not raw GCS paths.
Use default_tokenize which will find existing caches.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
v5p-8 slots all occupied in us-central1-a.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tor)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ensures DPO checkpoints include generation_config.json with
eos_token_id: [128001, 128009] so vLLM stops on <|eot_id|>.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ahmeda14960 ahmeda14960 merged commit 180cb79 into main Mar 26, 2026
42 checks passed
@ahmeda14960 ahmeda14960 deleted the instruct_tokenizer branch March 26, 2026 16:52
ravwojdyla pushed a commit that referenced this pull request Mar 26, 2026
…nts (#4160)

## Summary

- Add `hf_generation_eos_token_ids` config field to `SimpleDPOConfig`,
`SimpleSFTConfig`, `SimpleTrainConfig`, `TrainDpoConfig`, and
`TrainLmConfig`
- When set (e.g. `[128001, 128009]`), write a validated
`generation_config.json` alongside HF checkpoints so vLLM stops on the
right tokens for chat models
- `config.json` is unchanged — pretraining checkpoints are unaffected
- New shared helper `levanter/utils/hf_export.py` with
`build_generation_config()` for validation/normalization
- `LLAMA3_CHAT_STOP_TOKEN_IDS` constant in `experiments/llama.py`

Replaces #4154 (closed). Does **not** modify the tokenizer's `eos_token`
or override `eos_token_id` in `config.json`.

Fixes #4153
Fixes #4159

## Test plan

- [x] 14 unit tests in `test_hf_export.py` — validation, dedup, sort,
auto-add EOS, error cases
- [x] `./infra/pre-commit.py --all-files --fix` passes
- [x] Pre-commit hooks pass on commit
- [x] Verify `generation_config.json` is written when
`hf_generation_eos_token_ids=[128001, 128009]` is set on a DPO run
- [ ] Verify no `generation_config.json` when field is `None` (default)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Helw150 pushed a commit that referenced this pull request Apr 8, 2026
…nts (#4160)

## Summary

- Add `hf_generation_eos_token_ids` config field to `SimpleDPOConfig`,
`SimpleSFTConfig`, `SimpleTrainConfig`, `TrainDpoConfig`, and
`TrainLmConfig`
- When set (e.g. `[128001, 128009]`), write a validated
`generation_config.json` alongside HF checkpoints so vLLM stops on the
right tokens for chat models
- `config.json` is unchanged — pretraining checkpoints are unaffected
- New shared helper `levanter/utils/hf_export.py` with
`build_generation_config()` for validation/normalization
- `LLAMA3_CHAT_STOP_TOKEN_IDS` constant in `experiments/llama.py`

Replaces #4154 (closed). Does **not** modify the tokenizer's `eos_token`
or override `eos_token_id` in `config.json`.

Fixes #4153
Fixes #4159

## Test plan

- [x] 14 unit tests in `test_hf_export.py` — validation, dedup, sort,
auto-add EOS, error cases
- [x] `./infra/pre-commit.py --all-files --fix` passes
- [x] Pre-commit hooks pass on commit
- [x] Verify `generation_config.json` is written when
`hf_generation_eos_token_ids=[128001, 128009]` is set on a DPO run
- [ ] Verify no `generation_config.json` when field is `None` (default)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

2 participants