Skip to content

Make default checkpoint retention conservative#4457

Merged
Helw150 merged 1 commit intomainfrom
claude/issue-4456-20260406-2314
Apr 6, 2026
Merged

Make default checkpoint retention conservative#4457
Helw150 merged 1 commit intomainfrom
claude/issue-4456-20260406-2314

Conversation

@Helw150
Copy link
Copy Markdown
Member

@Helw150 Helw150 commented Apr 6, 2026

Closes #4456. Default keep for default_train/default_sft/default_dpo is now empty, so only the final checkpoint (plus rolling temp checkpoints for resumption) is retained.

Generated with Claude Code

Default 'keep' for default_train/default_sft/default_dpo is now empty,
so only the final checkpoint (plus rolling temp checkpoints for resumption)
is retained. Users can still set steps_per_export / steps_per_checkpoint
to keep periodic permanent checkpoints.

Closes #4456

Co-authored-by: William Held <Helw150@users.noreply.github.com>
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 6, 2026

Claude finished @Helw150's task in 30s —— View job


I'll analyze this and get back to you.

@Helw150 Helw150 requested a review from rjpower April 6, 2026 23:21
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 6, 2026

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

View job run

@Helw150
Copy link
Copy Markdown
Member Author

Helw150 commented Apr 6, 2026

@claude Just sanity check me that we'll still keep doing rolling checkpoints for resumption with this change.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 6, 2026

Claude finished @Helw150's task in 25s —— View job


Sanity check: rolling resumption checkpoints

Yes — rolling temp checkpoints for resumption are preserved.

In experiments/defaults.py the CheckpointerConfig still sets save_interval=timedelta(minutes=10), and in Levanter's checkpointer (lib/levanter/src/levanter/checkpoint.py) save_interval and keep are independent:

  • save_interval drives the time-based temporary checkpoint (is_temporary=True); the previous temp checkpoint is deleted after the next one is written — that's the rolling behavior (see ~lines 184–242).
  • keep only controls permanent step-interval checkpoints. With the new _checkpoint_keep(None) -> [], no permanent intermediate checkpoints are written.
  • The final end-of-training checkpoint is still saved as permanent regardless of keep.

So with this PR: every 10 minutes a single rolling temp checkpoint is maintained for preemption/resumption, and only the final checkpoint is retained permanently. Matches the PR description.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ebb85095e3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".


steps_per_checkpoint: int = 1000
"""How often to save checkpoints."""
steps_per_checkpoint: int | None = None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Disable default HF exports in SFT no-checkpoint mode

Setting steps_per_checkpoint to None here makes SFT look like it will keep only a final checkpoint, but steps_per_hf_export still defaults to 500, and default_sft forwards that value into default_train (experiments/defaults.py), which always gets an hf_save_path (lib/marin/src/marin/training/training.py) and registers periodic HF saves when hf_save_steps is non-null (lib/levanter/src/levanter/main/train_lm.py). In long SFT runs this still creates many permanent HF snapshots, so retention is not actually conservative and storage usage can grow unexpectedly.

Useful? React with 👍 / 👎.


steps_per_eval: int = 1000
steps_per_checkpoint: int = 1000
steps_per_checkpoint: int | None = None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Disable default HF exports in DPO no-checkpoint mode

This default now disables periodic Levanter checkpoints, but leaves steps_per_hf_export at 500, and default_dpo still resolves and passes that interval through (experiments/defaults.py), with HF save hooks enabled whenever hf_save_steps is set (lib/levanter/src/levanter/main/train_dpo.py). As a result, DPO runs with defaults continue to emit frequent permanent HF checkpoints, which undercuts the new conservative-retention behavior and can cause avoidable artifact bloat.

Useful? React with 👍 / 👎.

@Helw150 Helw150 merged commit 1cd337a into main Apr 6, 2026
49 checks passed
@Helw150 Helw150 deleted the claude/issue-4456-20260406-2314 branch April 6, 2026 23:36
Helw150 added a commit that referenced this pull request Apr 8, 2026
Closes #4456. Default `keep` for
`default_train`/`default_sft`/`default_dpo` is now empty, so only the
final checkpoint (plus rolling temp checkpoints for resumption) is
retained.

Generated with [Claude Code](https://claude.ai/code)

Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: William Held <Helw150@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make Default Checkpointing Behavior Extremely Conservative

2 participants