Description
Levanter currently writes step checkpoints and time-policy temporary checkpoints under the same base path (lib/levanter/src/levanter/checkpoint.py:195-221, lib/levanter/src/levanter/checkpoint.py:585-615). Marin only rewrites the main checkpoint path (lib/marin/src/marin/training/training.py:106-111), and grug launchers also point CheckpointerConfig.base_path at <output_path>/checkpoints (experiments/grug/base/launch.py:100-105, experiments/grug/moe/launch.py:110-115, experiments/grug/modular_opt/launch.py:206-211).
We should split temporary checkpoint writes onto a separate base path and have Marin route that path to region-local temp buckets (marin_temp_bucket(...)) with lifecycle TTL.
Proposed plan:
- Extend Levanter checkpointer config/API with
temporary_base_path (default None).
- Route time-policy saves to
temporary_base_path and keep step-policy/permanent saves on base_path.
- Update resume/discovery logic so default restore can consider both roots (newest valid checkpoint wins) while keeping explicit
load_checkpoint_path behavior unchanged.
- Update Marin training wrapper to set
temporary_base_path to marin_temp_bucket(ttl_days=14, prefix="checkpoints-temp") when normalizing training output paths.
- Update grug launch wiring and grug restore path selection (
experiments/grug/base/train.py:374-378, experiments/grug/checkpointing.py:72-103, and moe/modular mirrors) so it also restores from temp-root candidates.
- Update docs and comments that currently claim checkpoints only live under
<output_path>/checkpoints (for example experiments/grug/README.md).
Definition of Done
CheckpointerConfig supports a separate temporary checkpoint base path, with tests proving temporary checkpoints are written there while permanent checkpoints remain under the main base path.
- Default checkpoint discovery for resume works across permanent + temporary roots and prefers the newest valid checkpoint.
- Marin LM/DPO wrapper configures the temporary checkpoint path via
marin_temp_bucket(...) (region-local) with TTL=14 days.
- Grug launch + restore code uses the same temp-path mechanism and can resume from temporary checkpoints.
- Existing checkpoint tests are updated and new coverage is added for split-path save/delete/discovery behavior.
- Relevant docs are updated to describe permanent vs temporary checkpoint locations.
Description
Levanter currently writes step checkpoints and time-policy temporary checkpoints under the same base path (
lib/levanter/src/levanter/checkpoint.py:195-221,lib/levanter/src/levanter/checkpoint.py:585-615). Marin only rewrites the main checkpoint path (lib/marin/src/marin/training/training.py:106-111), and grug launchers also pointCheckpointerConfig.base_pathat<output_path>/checkpoints(experiments/grug/base/launch.py:100-105,experiments/grug/moe/launch.py:110-115,experiments/grug/modular_opt/launch.py:206-211).We should split temporary checkpoint writes onto a separate base path and have Marin route that path to region-local temp buckets (
marin_temp_bucket(...)) with lifecycle TTL.Proposed plan:
temporary_base_path(defaultNone).temporary_base_pathand keep step-policy/permanent saves onbase_path.load_checkpoint_pathbehavior unchanged.temporary_base_pathtomarin_temp_bucket(ttl_days=14, prefix="checkpoints-temp")when normalizing training output paths.experiments/grug/base/train.py:374-378,experiments/grug/checkpointing.py:72-103, and moe/modular mirrors) so it also restores from temp-root candidates.<output_path>/checkpoints(for exampleexperiments/grug/README.md).Definition of Done
CheckpointerConfigsupports a separate temporary checkpoint base path, with tests proving temporary checkpoints are written there while permanent checkpoints remain under the main base path.marin_temp_bucket(...)(region-local) with TTL=14 days.