Make default checkpoint retention conservative#4457
Conversation
Default 'keep' for default_train/default_sft/default_dpo is now empty, so only the final checkpoint (plus rolling temp checkpoints for resumption) is retained. Users can still set steps_per_export / steps_per_checkpoint to keep periodic permanent checkpoints. Closes #4456 Co-authored-by: William Held <Helw150@users.noreply.github.com>
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. |
|
@claude Just sanity check me that we'll still keep doing rolling checkpoints for resumption with this change. |
|
Claude finished @Helw150's task in 25s —— View job Sanity check: rolling resumption checkpointsYes — rolling temp checkpoints for resumption are preserved. In
So with this PR: every 10 minutes a single rolling temp checkpoint is maintained for preemption/resumption, and only the final checkpoint is retained permanently. Matches the PR description. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ebb85095e3
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
|
||
| steps_per_checkpoint: int = 1000 | ||
| """How often to save checkpoints.""" | ||
| steps_per_checkpoint: int | None = None |
There was a problem hiding this comment.
Disable default HF exports in SFT no-checkpoint mode
Setting steps_per_checkpoint to None here makes SFT look like it will keep only a final checkpoint, but steps_per_hf_export still defaults to 500, and default_sft forwards that value into default_train (experiments/defaults.py), which always gets an hf_save_path (lib/marin/src/marin/training/training.py) and registers periodic HF saves when hf_save_steps is non-null (lib/levanter/src/levanter/main/train_lm.py). In long SFT runs this still creates many permanent HF snapshots, so retention is not actually conservative and storage usage can grow unexpectedly.
Useful? React with 👍 / 👎.
|
|
||
| steps_per_eval: int = 1000 | ||
| steps_per_checkpoint: int = 1000 | ||
| steps_per_checkpoint: int | None = None |
There was a problem hiding this comment.
Disable default HF exports in DPO no-checkpoint mode
This default now disables periodic Levanter checkpoints, but leaves steps_per_hf_export at 500, and default_dpo still resolves and passes that interval through (experiments/defaults.py), with HF save hooks enabled whenever hf_save_steps is set (lib/levanter/src/levanter/main/train_dpo.py). As a result, DPO runs with defaults continue to emit frequent permanent HF checkpoints, which undercuts the new conservative-retention behavior and can cause avoidable artifact bloat.
Useful? React with 👍 / 👎.
Closes #4456. Default `keep` for `default_train`/`default_sft`/`default_dpo` is now empty, so only the final checkpoint (plus rolling temp checkpoints for resumption) is retained. Generated with [Claude Code](https://claude.ai/code) Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: William Held <Helw150@users.noreply.github.com>
Closes #4456. Default
keepfordefault_train/default_sft/default_dpois now empty, so only the final checkpoint (plus rolling temp checkpoints for resumption) is retained.Generated with Claude Code