Skip to content

Commit b860d5b

Browse files
committed
wip
1 parent b90a1eb commit b860d5b

2 files changed

Lines changed: 134 additions & 3 deletions

File tree

.agents/logbooks/midtraining_delphi.md

Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -394,6 +394,98 @@ Notes captured while writing `experiments/exp_delphi_math_10b_midtrain.py`. Upda
394394
| 2026-04-21 22:35Z | v8 got past cache-copy (succeeded this time), dispatched the 6 TPU `train_lm` jobs on v5p-64 → all hosts crashed with `ValueError: Unsupported URI scheme for tensorstore: 'mirror' in ...`. | **Lesson:** `mirror://` is an fsspec protocol. Levanter's checkpoint loader uses TensorStore directly (native GCS paths, no fsspec). So `mirror://` works for data loading but NOT for `initialize_from_checkpoint_path`. |
395395
| 2026-04-21 22:38Z | Reverted ckpt fields in `experiments/exp_delphi_math_10b_midtrain.py` from `mirror://<path>``gs://marin-us-central2/<path>`. Cross-region reads are fine: TensorStore doesn't consult the fsspec `CrossRegionGuardedFS`, so `MARIN_I_WILL_PAY_FOR_ALL_FEES=1` isn't even strictly needed for the ckpt read (kept it for the data fsspec paths). Committed `c13560c3f`, pushed. | Normalize/tokenize caches from v7 remain in us-central1; v9 should skip straight to cache-copy then training. |
396396
| 2026-04-21 22:39Z | Submitted `/ahmed/delphi-math-10b-sweep-v9`. | Job accepted. |
397+
| 2026-04-21 22:40Z | v9 FAILED in 33s with `ValueError: initialize_from_checkpoint_path is not in the same region (us-central2) as the VM (us-central1)` for all 6 sweep steps. Marin's `rigging.filesystem.check_gcs_paths_same_region` (invoked via `_doublecheck_paths` in `lib/marin/src/marin/training/training.py`) hard-fails any `gs://` path whose bucket region doesn't match the VM's region — no env-var override exists. | Fix: pre-copy the two base ckpts into us-central1 so they're co-located with the pinned `--region us-central1` coordinator. |
398+
| 2026-04-21 22:42Z | Server-side `gcloud storage cp --recursive` both ckpts us-central2 → us-central1 (23 GB + 41 GB ≈ 64 GB). Updated `BASES[*]["ckpt"]` in the experiment file to `gs://marin-us-central1/...` paths. Committed `56b1b1c86`, pushed. | Copies finished in parallel background (~350-620 MiB/s server-side). |
399+
| 2026-04-21 22:44Z | Submitted `/ahmed/delphi-math-10b-sweep-v10` with `--region us-central1` (dropping us-east5 because the ckpts are now only in us-central1). | Coordinator up in ~5 s. |
400+
| 2026-04-21 22:45–22:50Z | v10 walked the dep graph (skipped the already-cached normalize/tokenize from v7), dispatched the train_lm sub-task, scheduler pending-on-coscheduling for a v5p-64 slice (need 8 worker VMs to come up), then `running`. | Expected TPU spin-up delay. |
401+
| 2026-04-21 22:53Z | All 8 `train_lm/[0-7]` replicas restored the checkpoint successfully (`jax.experimental.array_serialization.serialization Error check finished successfully`). Train-step tracing + HLO lowering completed in a few s. **First training step** 39.3 s (compile-heavy), then 4.4–4.5 s/step steady-state. | MFU ≈ 36 % on v5p-64 (32 chips × 459 TFLOPS bf16 peak; achieved ≈ 5.3 PFLOPS/s). |
402+
| 2026-04-21 22:53 – 2026-04-22 05:16Z | v10 training ran for ~6 h 22 m. Loss dropped 1.58 → 1.12 over warm-up (500 steps), plateaued 1.12–1.20 through decay, **final train_loss 0.958 (tqdm) / 0.962 (W&B summary at step 4767)**. 3 mid-run evals at steps 200/400/600/800/... ran cleanly. Periodic save at step 1000. | Successful run: `delphi-1e20-iso-d2048-L21-math-10b-lr0.5-ba7b7f` (wandb + 155 GB at `gs://marin-us-central1/checkpoints/delphi-1e20-iso-d2048-L21-math-10b-lr0.5-ba7b7f/`). |
403+
| 2026-04-22 05:16Z | v10 coordinator `state=succeeded`. But iris shows only ONE `train_lm` child under the coordinator, and wandb has only ONE v10-era run (the `lr0.5-ba7b7f` one). GCS sweep-output audit: `lr0.5-ba7b7f` → 155 GB with `checkpoints/`, `hf/`, `tracker_metrics.jsonl`; `lr0.67-e3be0c`**65 KB, `.executor_status=SUCCESS`, `.artifact=null`, no training output**; `lr0.83-e3de76` / `1e21-lr0.5-ccce18` / `1e21-lr0.67-e5b5df` / `1e21-lr0.83-ece889` → 65 KB each, `.executor_status=FAILED`. | **Only 1 of 6 sweep points actually trained.** The other 5 marked terminal states (SUCCESS or FAILED) without producing training artifacts. |
404+
405+
## 2026-04-22 post-v10 analysis: the `train_lm` name-collision pitfall
406+
407+
`lib/marin/src/marin/training/training.py:307` pins every dispatched iris sub-job to the **literal name** `"train_lm"`. When an `executor_main` invocation runs 6 training `ExecutorStep`s whose dependencies are all already satisfied (tokenize cache warm, base ckpts local), `step_runner`'s default `max_concurrent=8` `ThreadPoolExecutor` dispatches all 6 `run_levanter_train_lm` calls **in parallel**. All 6 call `_submit_training_job(job_name="train_lm", ...)` under the same coordinator parent → same full iris path `/ahmed/<coord>/train_lm`.
408+
409+
The iris controller's `EXISTING_JOB_POLICY_KEEP` (what fray hands it for `adopt_existing=True`) then, per `lib/iris/src/iris/cluster/controller/service.py:1113-1117`:
410+
411+
```python
412+
elif policy == job_pb2.EXISTING_JOB_POLICY_KEEP:
413+
if not is_job_finished(existing_job.state):
414+
return controller_pb2.Controller.LaunchJobResponse(job_id=job_id.to_wire())
415+
# Job finished, replace it (KEEP only preserves running jobs)
416+
self._transitions.remove_finished_job(job_id)
417+
```
418+
419+
→ racing submits 2…6 see the still-running first job and **adopt its handle without creating a new job**. All six Python threads then `.wait()` on the same handle, which completes when the first (and only) config's training finishes. The adopted-handle threads see "SUCCEEDED" and return, so `step_runner` marks their steps as `STATUS_SUCCESS` despite the fn never actually running Levanter training.
420+
421+
Why I confused myself with the PR 4591 seed-sweep precedent: the seed-sweep pattern *defines* many ExecutorSteps but in practice each seed ran as a **separate top-level iris job** on a different day (wandb `created_at` for 1e21 seed{0,42,62746} were 2026-03-04, 2026-03-18 01:47Z, 2026-03-18 06:11Z; 1e22 seeds were 03-04, 03-22, 03-26). Different top-level coordinators → different parent paths → no `train_lm` collision. The seed PR itself didn't fix the collision; it just enabled enumerating the variants, and the human operator launched each variant separately.
422+
423+
## Fix applied to `experiments/exp_delphi_math_10b_midtrain.py` (not yet run)
424+
425+
Added env-var-driven filtering so a single invocation of the script can build a single sweep point:
426+
427+
```python
428+
_SELECT_BASE = os.environ.get("MIDTRAIN_SELECT_BASE") # e.g. "1e21-v5"
429+
_SELECT_LR = os.environ.get("MIDTRAIN_SELECT_LR") # e.g. "0.67"
430+
431+
def _build_runs():
432+
for base_tag, base in BASES.items():
433+
if _SELECT_BASE is not None and base_tag != _SELECT_BASE: continue
434+
...
435+
for lr_factor in LR_FACTORS:
436+
if _SELECT_LR is not None and _lr_str(lr_factor) != _SELECT_LR: continue
437+
...
438+
```
439+
440+
Verified:
441+
442+
- Unset: builds all 6 steps (as before). Useful for dry-run/introspection.
443+
- `MIDTRAIN_SELECT_BASE=1e21-v5 MIDTRAIN_SELECT_LR=0.67`: builds just `delphi-1e21-v5-math-10b-lr0.67`.
444+
445+
**Step hashes are unchanged by filtering** (filtering only affects which steps `_build_runs` returns; each step's config is byte-identical to before). So the already-succeeded `delphi-1e20-iso-d2048-L21-math-10b-lr0.5-ba7b7f` entry stays cached — any future invocation that includes it will see `STATUS_SUCCESS` and skip. The 5 remaining steps currently have a `.executor_status` of `SUCCESS` (lr0.67 1e20) or `FAILED` (other 4). Before relaunching:
446+
447+
- **`STATUS_SUCCESS` with no training output** (the `lr0.67-e3be0c` case, and possibly others): the cache check will treat them as succeeded → skip → no retraining. Workaround: delete `.executor_status` at those output paths so the next run re-does the step. (Do NOT delete the `lr0.5-ba7b7f` one — that one really did train.)
448+
- **`STATUS_FAILED`**: `step_runner` will raise `PreviousTaskFailedError` unless you pass `force_run_failed=True` or delete the status file.
449+
450+
## Launch recipe for the 5 remaining sweep points (copy-paste)
451+
452+
Each variant goes as its own iris coordinator so `/ahmed/<coord-N>/train_lm` is a unique path per sweep point. Run from the repo root:
453+
454+
```bash
455+
# 1. (one-time) clean up the stale STATUS files so the 5 remaining steps don't
456+
# short-circuit on cache hit / fail-as-previous-failure:
457+
for target in \
458+
'delphi-1e20-iso-d2048-L21-math-10b-lr0.67-e3be0c' \
459+
'delphi-1e20-iso-d2048-L21-math-10b-lr0.83-e3de76' \
460+
'delphi-1e21-v5-math-10b-lr0.5-ccce18' \
461+
'delphi-1e21-v5-math-10b-lr0.67-e5b5df' \
462+
'delphi-1e21-v5-math-10b-lr0.83-ece889'; do
463+
gcloud storage rm "gs://marin-us-central1/checkpoints/${target}/.executor_status" 2>/dev/null || true
464+
done
465+
466+
# 2. launch each as its own iris job (each gets a unique --job-name)
467+
launch() {
468+
local base="$1" lr="$2" short
469+
short=$(echo "$base" | sed 's/-iso-d2048-L21//')
470+
uv run iris --cluster=marin job run \
471+
--cpu 1 --memory 3GB --disk 9GB \
472+
--region us-central1 \
473+
--job-name "delphi-math-10b-${short}-lr${lr}" \
474+
--no-wait \
475+
-e MARIN_I_WILL_PAY_FOR_ALL_FEES 1 \
476+
-e WANDB_API_KEY "${WANDB_API_KEY}" \
477+
-e MIDTRAIN_SELECT_BASE "$base" \
478+
-e MIDTRAIN_SELECT_LR "$lr" \
479+
-- python experiments/exp_delphi_math_10b_midtrain.py
480+
}
481+
launch 1e20-iso-d2048-L21 0.67
482+
launch 1e20-iso-d2048-L21 0.83
483+
launch 1e21-v5 0.5
484+
launch 1e21-v5 0.67
485+
launch 1e21-v5 0.83
486+
```
487+
488+
All 5 can run in parallel (v5p-64 pool has `max_slices: 256`). Expected per-run wall-time: 1.9 B base ≈ 6 h, 3.4 B base ≈ 10 h (larger model, same BS=512). W&B runs appear under `marin-community/marin` with names `delphi-<base>-math-10b-lr<factor>-<hash>`.
397489

398490
## Expected cross-region transfers (FYI)
399491

experiments/exp_delphi_math_10b_midtrain.py

Lines changed: 42 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,11 +10,34 @@
1010
https://oa.williamheld.com/blog/delphi/ and ``experiments.scaling_law_sweeps
1111
.completed_adamh``.
1212
13-
This file produces ``len(BASES) * len(LR_FACTORS) = 6`` :class:`ExecutorStep`
14-
runs. See ``.agents/logbooks/midtraining_delphi.md`` for the full rationale,
13+
This file enumerates ``len(BASES) * len(LR_FACTORS) = 6`` :class:`ExecutorStep`
14+
runs. **Each sweep point must be launched as its own top-level ``iris job
15+
run`` coordinator** — do NOT put all 6 under a single coordinator, because
16+
Marin's ``run_levanter_train_lm`` submits its iris child with the hardcoded
17+
name ``train_lm`` (``lib/marin/src/marin/training/training.py:307``) and
18+
concurrent same-name submits collapse onto one handle via the iris
19+
``EXISTING_JOB_POLICY_KEEP`` policy (``lib/iris/src/iris/cluster/controller
20+
/service.py:1113``). Symptom on v10: 1 of 6 actually trained; the other 5
21+
marked SUCCESS with empty artifacts.
22+
23+
To launch a single sweep point, set the env vars below so this script
24+
builds only the matching step:
25+
26+
iris --cluster=marin job run --cpu 1 --memory 3GB --disk 9GB \\
27+
--region us-central1 --job-name delphi-math-10b-1e21-lr0.67 --no-wait \\
28+
-e MARIN_I_WILL_PAY_FOR_ALL_FEES 1 -e WANDB_API_KEY "$WANDB_API_KEY" \\
29+
-e MIDTRAIN_SELECT_BASE 1e21-v5 -e MIDTRAIN_SELECT_LR 0.67 \\
30+
-- python experiments/exp_delphi_math_10b_midtrain.py
31+
32+
With no env vars set, all 6 steps are generated (useful for dry-runs /
33+
introspection; do NOT actually ``executor_main`` on the full list).
34+
35+
See ``.agents/logbooks/midtraining_delphi.md`` for the full rationale,
1536
numbers, and verification plan.
1637
"""
1738

39+
import os
40+
1841
from levanter.optim import AdamHConfig
1942

2043
from experiments.defaults import default_train
@@ -123,9 +146,23 @@ def _build_adamh(base: dict, lr_factor: float) -> AdamHConfig:
123146
)
124147

125148

149+
# Env-var filters: set these to restrict the generated sweep to a single
150+
# point so each can be launched as its own iris coordinator job. Step hashes
151+
# are unchanged by filtering — already-succeeded outputs (e.g. the v10
152+
# `lr0.5-ba7b7f` run) stay cached and will be skipped automatically.
153+
_SELECT_BASE = os.environ.get("MIDTRAIN_SELECT_BASE") # e.g. "1e21-v5"
154+
_SELECT_LR = os.environ.get("MIDTRAIN_SELECT_LR") # e.g. "0.67"
155+
156+
157+
def _lr_str(lr_factor: float) -> str:
158+
return f"{lr_factor:.2f}".rstrip("0").rstrip(".")
159+
160+
126161
def _build_runs() -> list[ExecutorStep]:
127162
runs: list[ExecutorStep] = []
128163
for base_tag, base in BASES.items():
164+
if _SELECT_BASE is not None and base_tag != _SELECT_BASE:
165+
continue
129166
# Reconstruct the Qwen3Config exactly as the pretrain run built it,
130167
# so TensorStore weight restore matches every array shape.
131168
# Private method is intentional: it's the single source of truth for
@@ -136,6 +173,8 @@ def _build_runs() -> list[ExecutorStep]:
136173
)
137174

138175
for lr_factor in LR_FACTORS:
176+
if _SELECT_LR is not None and _lr_str(lr_factor) != _SELECT_LR:
177+
continue
139178
optimizer = _build_adamh(base, lr_factor)
140179

141180
train_cfg = SimpleTrainConfig(
@@ -158,7 +197,7 @@ def _build_runs() -> list[ExecutorStep]:
158197
steps_per_hf_export=STEPS_PER_HF_EXPORT,
159198
)
160199

161-
lr_str = f"{lr_factor:.2f}".rstrip("0").rstrip(".")
200+
lr_str = _lr_str(lr_factor)
162201
name = f"delphi-{base_tag}-math-10b-lr{lr_str}"
163202

164203
runs.append(

0 commit comments

Comments
 (0)