Make default checkpoint retention conservative by Helw150 · Pull Request #4457 · marin-community/marin

Helw150 · 2026-04-06T23:20:54Z

Closes #4456. Default keep for default_train/default_sft/default_dpo is now empty, so only the final checkpoint (plus rolling temp checkpoints for resumption) is retained.

Generated with Claude Code

Default 'keep' for default_train/default_sft/default_dpo is now empty, so only the final checkpoint (plus rolling temp checkpoints for resumption) is retained. Users can still set steps_per_export / steps_per_checkpoint to keep periodic permanent checkpoints. Closes #4456 Co-authored-by: William Held <Helw150@users.noreply.github.com>

claude · 2026-04-06T23:21:16Z

Claude finished @Helw150's task in 30s —— View job

I'll analyze this and get back to you.

claude · 2026-04-06T23:22:00Z

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

View job run

Helw150 · 2026-04-06T23:22:51Z

@claude Just sanity check me that we'll still keep doing rolling checkpoints for resumption with this change.

claude · 2026-04-06T23:23:20Z

Claude finished @Helw150's task in 25s —— View job

Sanity check: rolling resumption checkpoints

Yes — rolling temp checkpoints for resumption are preserved.

In experiments/defaults.py the CheckpointerConfig still sets save_interval=timedelta(minutes=10), and in Levanter's checkpointer (lib/levanter/src/levanter/checkpoint.py) save_interval and keep are independent:

save_interval drives the time-based temporary checkpoint (is_temporary=True); the previous temp checkpoint is deleted after the next one is written — that's the rolling behavior (see ~lines 184–242).
keep only controls permanent step-interval checkpoints. With the new _checkpoint_keep(None) -> [], no permanent intermediate checkpoints are written.
The final end-of-training checkpoint is still saved as permanent regardless of keep.

So with this PR: every 10 minutes a single rolling temp checkpoint is maintained for preemption/resumption, and only the final checkpoint is retained permanently. Matches the PR description.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ebb85095e3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-06T23:26:52Z


-    steps_per_checkpoint: int = 1000
-    """How often to save checkpoints."""
+    steps_per_checkpoint: int | None = None


Disable default HF exports in SFT no-checkpoint mode

Setting steps_per_checkpoint to None here makes SFT look like it will keep only a final checkpoint, but steps_per_hf_export still defaults to 500, and default_sft forwards that value into default_train (experiments/defaults.py), which always gets an hf_save_path (lib/marin/src/marin/training/training.py) and registers periodic HF saves when hf_save_steps is non-null (lib/levanter/src/levanter/main/train_lm.py). In long SFT runs this still creates many permanent HF snapshots, so retention is not actually conservative and storage usage can grow unexpectedly.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-06T23:26:52Z


    steps_per_eval: int = 1000
-    steps_per_checkpoint: int = 1000
+    steps_per_checkpoint: int | None = None


Disable default HF exports in DPO no-checkpoint mode

This default now disables periodic Levanter checkpoints, but leaves steps_per_hf_export at 500, and default_dpo still resolves and passes that interval through (experiments/defaults.py), with HF save hooks enabled whenever hf_save_steps is set (lib/levanter/src/levanter/main/train_dpo.py). As a result, DPO runs with defaults continue to emit frequent permanent HF checkpoints, which undercuts the new conservative-retention behavior and can cause avoidable artifact bloat.

Useful? React with 👍 / 👎.

Closes #4456. Default `keep` for `default_train`/`default_sft`/`default_dpo` is now empty, so only the final checkpoint (plus rolling temp checkpoints for resumption) is retained. Generated with [Claude Code](https://claude.ai/code) Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: William Held <Helw150@users.noreply.github.com>

Helw150 requested a review from rjpower April 6, 2026 23:21

chatgpt-codex-connector Bot reviewed Apr 6, 2026

View reviewed changes

rjpower approved these changes Apr 6, 2026

View reviewed changes

Helw150 merged commit 1cd337a into main Apr 6, 2026
49 checks passed

Helw150 deleted the claude/issue-4456-20260406-2314 branch April 6, 2026 23:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make default checkpoint retention conservative#4457

Make default checkpoint retention conservative#4457
Helw150 merged 1 commit intomainfrom
claude/issue-4456-20260406-2314

Helw150 commented Apr 6, 2026

Uh oh!

claude Bot commented Apr 6, 2026 •

edited

Loading

Uh oh!

claude Bot commented Apr 6, 2026

Uh oh!

Helw150 commented Apr 6, 2026

Uh oh!

claude Bot commented Apr 6, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 6, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Helw150 commented Apr 6, 2026

Uh oh!

claude Bot commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot commented Apr 6, 2026

Code review

Uh oh!

Helw150 commented Apr 6, 2026

Uh oh!

claude Bot commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Sanity check: rolling resumption checkpoints

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude Bot commented Apr 6, 2026 •

edited

Loading

claude Bot commented Apr 6, 2026 •

edited

Loading