[levanter] Separate temporary checkpoint base path and use Marin temp buckets

## Description
Levanter currently writes step checkpoints and time-policy temporary checkpoints under the same base path (`lib/levanter/src/levanter/checkpoint.py:195-221`, `lib/levanter/src/levanter/checkpoint.py:585-615`). Marin only rewrites the main checkpoint path (`lib/marin/src/marin/training/training.py:106-111`), and grug launchers also point `CheckpointerConfig.base_path` at `<output_path>/checkpoints` (`experiments/grug/base/launch.py:100-105`, `experiments/grug/moe/launch.py:110-115`, `experiments/grug/modular_opt/launch.py:206-211`).

We should split temporary checkpoint writes onto a separate base path and have Marin route that path to region-local temp buckets (`marin_temp_bucket(...)`) with lifecycle TTL.

Proposed plan:
1. Extend Levanter checkpointer config/API with `temporary_base_path` (default `None`).
2. Route time-policy saves to `temporary_base_path` and keep step-policy/permanent saves on `base_path`.
3. Update resume/discovery logic so default restore can consider both roots (newest valid checkpoint wins) while keeping explicit `load_checkpoint_path` behavior unchanged.
4. Update Marin training wrapper to set `temporary_base_path` to `marin_temp_bucket(ttl_days=14, prefix="checkpoints-temp")` when normalizing training output paths.
5. Update grug launch wiring and grug restore path selection (`experiments/grug/base/train.py:374-378`, `experiments/grug/checkpointing.py:72-103`, and moe/modular mirrors) so it also restores from temp-root candidates.
6. Update docs and comments that currently claim checkpoints only live under `<output_path>/checkpoints` (for example `experiments/grug/README.md`).

### Definition of Done
- `CheckpointerConfig` supports a separate temporary checkpoint base path, with tests proving temporary checkpoints are written there while permanent checkpoints remain under the main base path.
- Default checkpoint discovery for resume works across permanent + temporary roots and prefers the newest valid checkpoint.
- Marin LM/DPO wrapper configures the temporary checkpoint path via `marin_temp_bucket(...)` (region-local) with TTL=14 days.
- Grug launch + restore code uses the same temp-path mechanism and can resume from temporary checkpoints.
- Existing checkpoint tests are updated and new coverage is added for split-path save/delete/discovery behavior.
- Relevant docs are updated to describe permanent vs temporary checkpoint locations.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[levanter] Separate temporary checkpoint base path and use Marin temp buckets #4386

Description

Definition of Done

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[levanter] Separate temporary checkpoint base path and use Marin temp buckets #4386

Description

Description

Definition of Done

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions