[levanter] Separate temporary checkpoint base path and use Marin temp buckets by claude[bot] · Pull Request #4387 · marin-community/marin

claude · 2026-04-03T05:49:06Z

Add temporary_base_path to CheckpointerConfig and Checkpointer so time-policy checkpoints route separately while step-policy checkpoints stay durable. Marin launchers now derive run-specific region-local temp checkpoint roots, and Trainer/Grug restore plus direct Levanter load sites discover concrete checkpoint paths before calling load_checkpoint.

Fixes #4386

Add temporary_base_path to CheckpointerConfig and Checkpointer so time-policy checkpoints route to a separate directory (e.g. region-local temp buckets with lifecycle TTL) while step-policy checkpoints stay on the durable base_path. Update discover_latest_checkpoint to search across multiple roots, update grug restore to merge candidates from both paths, and wire Marin training wrapper to use marin_temp_bucket. Fixes #4386

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f3f7cf1bb1

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-03T05:53:59Z

-    fs, plain_path = _get_fs_and_plain_path(checkpoint_path)
-    base_path_protocol = urllib.parse.urlparse(checkpoint_path).scheme
+def _checkpoint_candidates(checkpoint_path: str, *, additional_paths: list[str] | None = None) -> list[str]:
+    all_roots = [checkpoint_path] + (additional_paths or [])


Respect explicit checkpoint path when temp roots are provided

When checkpoint_path is a concrete checkpoint directory (supported by trainer.load_checkpoint_path), this function now mixes it with additional_paths and globally ranks all candidates by step. That allows a newer temp checkpoint to be loaded instead of the explicitly requested checkpoint, which silently changes resume behavior and breaks reproducibility for users pinning a specific step. This was introduced by adding additional_paths into the same candidate pool without a guard for explicit checkpoint paths.

Useful? React with 👍 / 👎.

🤖 Fixed in 415a3e0 by treating any path with its own metadata.json as an explicit checkpoint directory. Additional temporary roots are only considered when checkpoint_path is a parent/root directory, and tests now cover the pinned-checkpoint case.

🤖 Reworked this into the cleaner API: restore_grug_state_from_checkpoint now takes checkpoint_search_paths directly. Callers pass [explicit_checkpoint_path] when a checkpoint is pinned, or [permanent_root, temporary_root] for normal resume discovery, so the restore helper no longer has to infer intent from an additional_paths parameter.

ravwojdyla

Approving, but do we need to update the trainer?

claude Bot added the agent-generated Created by automation/agent label Apr 3, 2026

claude Bot mentioned this pull request Apr 3, 2026

[levanter] Separate temporary checkpoint base path and use Marin temp buckets #4386

Closed

chatgpt-codex-connector Bot reviewed Apr 3, 2026

View reviewed changes

dlwh added 4 commits April 21, 2026 14:56

Merge origin/main into checkpoint temp path branch

0eecf1c

Respect explicit grug checkpoint paths

415a3e0

Use explicit grug checkpoint search paths

a93bf7a

Centralize checkpoint search paths

72ffd63

ravwojdyla approved these changes Apr 22, 2026

View reviewed changes

Comment thread lib/levanter/src/levanter/checkpoint.py

Comment thread lib/levanter/src/levanter/trainer.py Outdated

dlwh added 2 commits April 22, 2026 10:55

Search temporary checkpoints from trainer restore

45935c8

Separate checkpoint discovery from loading

5dcbd85

dlwh merged commit 534544b into main Apr 22, 2026
37 checks passed

dlwh deleted the agent/20260403-fix-4386 branch April 22, 2026 18:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[levanter] Separate temporary checkpoint base path and use Marin temp buckets#4387

[levanter] Separate temporary checkpoint base path and use Marin temp buckets#4387
dlwh merged 7 commits intomainfrom
agent/20260403-fix-4386

claude Bot commented Apr 3, 2026 •

edited by dlwh

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 3, 2026

Uh oh!

dlwh Apr 21, 2026 •

edited

Loading

Uh oh!

dlwh Apr 21, 2026

Uh oh!

ravwojdyla left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

claude Bot commented Apr 3, 2026 • edited by dlwh Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

dlwh Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dlwh Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

ravwojdyla left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude Bot commented Apr 3, 2026 •

edited by dlwh

Loading

dlwh Apr 21, 2026 •

edited

Loading