Skip to content

[levanter] Separate temporary checkpoint base path and use Marin temp buckets #4386

@dlwh

Description

@dlwh

Description

Levanter currently writes step checkpoints and time-policy temporary checkpoints under the same base path (lib/levanter/src/levanter/checkpoint.py:195-221, lib/levanter/src/levanter/checkpoint.py:585-615). Marin only rewrites the main checkpoint path (lib/marin/src/marin/training/training.py:106-111), and grug launchers also point CheckpointerConfig.base_path at <output_path>/checkpoints (experiments/grug/base/launch.py:100-105, experiments/grug/moe/launch.py:110-115, experiments/grug/modular_opt/launch.py:206-211).

We should split temporary checkpoint writes onto a separate base path and have Marin route that path to region-local temp buckets (marin_temp_bucket(...)) with lifecycle TTL.

Proposed plan:

  1. Extend Levanter checkpointer config/API with temporary_base_path (default None).
  2. Route time-policy saves to temporary_base_path and keep step-policy/permanent saves on base_path.
  3. Update resume/discovery logic so default restore can consider both roots (newest valid checkpoint wins) while keeping explicit load_checkpoint_path behavior unchanged.
  4. Update Marin training wrapper to set temporary_base_path to marin_temp_bucket(ttl_days=14, prefix="checkpoints-temp") when normalizing training output paths.
  5. Update grug launch wiring and grug restore path selection (experiments/grug/base/train.py:374-378, experiments/grug/checkpointing.py:72-103, and moe/modular mirrors) so it also restores from temp-root candidates.
  6. Update docs and comments that currently claim checkpoints only live under <output_path>/checkpoints (for example experiments/grug/README.md).

Definition of Done

  • CheckpointerConfig supports a separate temporary checkpoint base path, with tests proving temporary checkpoints are written there while permanent checkpoints remain under the main base path.
  • Default checkpoint discovery for resume works across permanent + temporary roots and prefers the newest valid checkpoint.
  • Marin LM/DPO wrapper configures the temporary checkpoint path via marin_temp_bucket(...) (region-local) with TTL=14 days.
  • Grug launch + restore code uses the same temp-path mechanism and can resume from temporary checkpoints.
  • Existing checkpoint tests are updated and new coverage is added for split-path save/delete/discovery behavior.
  • Relevant docs are updated to describe permanent vs temporary checkpoint locations.

Metadata

Metadata

Assignees

No one assigned

    Labels

    agent-generatedCreated by automation/agentp2Do before next release

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions