Skip to content

[levanter] Separate temporary checkpoint base path and use Marin temp buckets#4387

Merged
dlwh merged 7 commits intomainfrom
agent/20260403-fix-4386
Apr 22, 2026
Merged

[levanter] Separate temporary checkpoint base path and use Marin temp buckets#4387
dlwh merged 7 commits intomainfrom
agent/20260403-fix-4386

Conversation

@claude
Copy link
Copy Markdown
Contributor

@claude claude Bot commented Apr 3, 2026

Add temporary_base_path to CheckpointerConfig and Checkpointer so time-policy checkpoints route separately while step-policy checkpoints stay durable. Marin launchers now derive run-specific region-local temp checkpoint roots, and Trainer/Grug restore plus direct Levanter load sites discover concrete checkpoint paths before calling load_checkpoint.

Fixes #4386

Add temporary_base_path to CheckpointerConfig and Checkpointer so
time-policy checkpoints route to a separate directory (e.g. region-local
temp buckets with lifecycle TTL) while step-policy checkpoints stay on
the durable base_path. Update discover_latest_checkpoint to search
across multiple roots, update grug restore to merge candidates from
both paths, and wire Marin training wrapper to use marin_temp_bucket.

Fixes #4386
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f3f7cf1bb1

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread experiments/grug/checkpointing.py Outdated
fs, plain_path = _get_fs_and_plain_path(checkpoint_path)
base_path_protocol = urllib.parse.urlparse(checkpoint_path).scheme
def _checkpoint_candidates(checkpoint_path: str, *, additional_paths: list[str] | None = None) -> list[str]:
all_roots = [checkpoint_path] + (additional_paths or [])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Respect explicit checkpoint path when temp roots are provided

When checkpoint_path is a concrete checkpoint directory (supported by trainer.load_checkpoint_path), this function now mixes it with additional_paths and globally ranks all candidates by step. That allows a newer temp checkpoint to be loaded instead of the explicitly requested checkpoint, which silently changes resume behavior and breaks reproducibility for users pinning a specific step. This was introduced by adding additional_paths into the same candidate pool without a guard for explicit checkpoint paths.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member

@dlwh dlwh Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Fixed in 415a3e0 by treating any path with its own metadata.json as an explicit checkpoint directory. Additional temporary roots are only considered when checkpoint_path is a parent/root directory, and tests now cover the pinned-checkpoint case.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Reworked this into the cleaner API: restore_grug_state_from_checkpoint now takes checkpoint_search_paths directly. Callers pass [explicit_checkpoint_path] when a checkpoint is pinned, or [permanent_root, temporary_root] for normal resume discovery, so the restore helper no longer has to infer intent from an additional_paths parameter.

Copy link
Copy Markdown
Contributor

@ravwojdyla ravwojdyla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, but do we need to update the trainer?

Comment thread lib/levanter/src/levanter/checkpoint.py
Comment thread lib/levanter/src/levanter/trainer.py Outdated
@dlwh dlwh merged commit 534544b into main Apr 22, 2026
37 checks passed
@dlwh dlwh deleted the agent/20260403-fix-4386 branch April 22, 2026 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[levanter] Separate temporary checkpoint base path and use Marin temp buckets

2 participants