[levanter] Separate temporary checkpoint base path and use Marin temp buckets#4387
[levanter] Separate temporary checkpoint base path and use Marin temp buckets#4387
Conversation
Add temporary_base_path to CheckpointerConfig and Checkpointer so time-policy checkpoints route to a separate directory (e.g. region-local temp buckets with lifecycle TTL) while step-policy checkpoints stay on the durable base_path. Update discover_latest_checkpoint to search across multiple roots, update grug restore to merge candidates from both paths, and wire Marin training wrapper to use marin_temp_bucket. Fixes #4386
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f3f7cf1bb1
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| fs, plain_path = _get_fs_and_plain_path(checkpoint_path) | ||
| base_path_protocol = urllib.parse.urlparse(checkpoint_path).scheme | ||
| def _checkpoint_candidates(checkpoint_path: str, *, additional_paths: list[str] | None = None) -> list[str]: | ||
| all_roots = [checkpoint_path] + (additional_paths or []) |
There was a problem hiding this comment.
Respect explicit checkpoint path when temp roots are provided
When checkpoint_path is a concrete checkpoint directory (supported by trainer.load_checkpoint_path), this function now mixes it with additional_paths and globally ranks all candidates by step. That allows a newer temp checkpoint to be loaded instead of the explicitly requested checkpoint, which silently changes resume behavior and breaks reproducibility for users pinning a specific step. This was introduced by adding additional_paths into the same candidate pool without a guard for explicit checkpoint paths.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
🤖 Fixed in 415a3e0 by treating any path with its own metadata.json as an explicit checkpoint directory. Additional temporary roots are only considered when checkpoint_path is a parent/root directory, and tests now cover the pinned-checkpoint case.
There was a problem hiding this comment.
🤖 Reworked this into the cleaner API: restore_grug_state_from_checkpoint now takes checkpoint_search_paths directly. Callers pass [explicit_checkpoint_path] when a checkpoint is pinned, or [permanent_root, temporary_root] for normal resume discovery, so the restore helper no longer has to infer intent from an additional_paths parameter.
ravwojdyla
left a comment
There was a problem hiding this comment.
Approving, but do we need to update the trainer?
Add temporary_base_path to CheckpointerConfig and Checkpointer so time-policy checkpoints route separately while step-policy checkpoints stay durable. Marin launchers now derive run-specific region-local temp checkpoint roots, and Trainer/Grug restore plus direct Levanter load sites discover concrete checkpoint paths before calling load_checkpoint.
Fixes #4386