You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Document where the scaling-law / plotting infra lives
(lib/marin/src/marin/scaling_laws/ + experiments/isoflop_sweep.py,
authored by William Held). Note that tracker_metrics.jsonl is the
only source of truth for the two runs whose W&B is polluted by the
step-monotonic rejection bug, and that isoflop/scaling-fit is the
wrong abstraction for a 3-point x 2-base LR sweep.
-`gs://marin-us-central1/checkpoints/delphi-1e20-iso-d2048-L21-math-10b-lr0.67-e3be0c/` (GCS is the healthy fresh training, but W&B run is polluted with broken data)
- Also the corresponding W&B runs at those names — they display misleading flat-min-lr curves; safe to delete once the `-v2` runs are locked in.
970
+
971
+
---
972
+
973
+
## Analysis + plotting utilities in this repo (found 2026-04-23)
974
+
975
+
Will Held owns most of the scaling-law / analysis infra. If you need to produce plots, fits, or sweep-wide comparisons, these are the code paths to study first rather than rolling your own.
|`scaling_plots.py`|`create_isoflop_plot`, `create_scaling_plot`, `save_plots`, `upload_plots_to_wandb`| Plotly figure builders for isoflop curves + scaling fits; GCS save + W&B artifact upload |
982
+
|`isoflop_analysis.py`|`fit_scaling_laws`, `predict_optimal_config`, `robust_quad_logx` (Huber-loss quadratic fit), `ScalingFit`, `QuadraticFitCoeffs`, `IsoFlopRecord`, `MinimaRecord`, `CandidateConfig`| Scaling-law math and data structures |
983
+
|`eval_metrics_reader.py`|`read_eval_records` (+ W&B backfill via `_backfill_metrics_from_wandb`) | Pulls per-step eval metrics from GCS runs and W&B API, unifying the two sources |
984
+
|`tpu_utils.py`|`pick_v5p_type`, `pick_v4_type`, `V5P_SPEC`, `V4_SPEC`| Choose the smallest TPU slice that fits a given model |
985
+
|`__init__.py`| Re-exports above | Public API entry |
986
+
987
+
### Callers / end-to-end wiring
988
+
989
+
| File | Purpose |
990
+
|---|---|
991
+
|`experiments/isoflop_sweep.py`|**The canonical ExecutorStep wiring** — reads eval metrics, fits scaling laws, emits plots, uploads to W&B. Pattern-match against this when building a new analysis step. |
992
+
|`experiments/exp1337_delphi_suite.py`| Delphi-specific sweep runner using `predict_optimal_config`. Source of the `(H, L, B)`-heuristic pipeline. |
993
+
|`experiments/exp2166_scaling_ladder_analysis.py`| Most recent ladder analysis (~2026-02). |
994
+
|`experiments/scaling_law_sweeps/completed_adamh.py`| The AdamH heuristic that drove our 1e20 base-model choice — `completed_adamh_heuristic._build_model_config(hidden_size, seq_len)` is the source of truth for Delphi architecture. |
### Per-run training-loss (no dedicated Marin tool)
998
+
999
+
For single-run or small-sweep train-loss plots (what this midtraining sweep wants), the options are:
1000
+
1001
+
-`lib/levanter/scripts/loss_history.py` — ~30-line example that hits `wandb.Api().runs().scan_history()` for `train/loss` by git-sha. Good template.
1002
+
- Read `tracker_metrics.jsonl` at each run's GCS output path directly (Levanter writes it independently of W&B; **this is our only source of truth for the `e3be0c` / `db9de7` runs whose W&B is polluted**). One JSON per step, columns include `train/loss`, `optim/learning_rate`, `optim/adam_lr`, and all `eval/paloma/*/loss` + `eval/uncheatable_eval/*/loss` series. Just `pd.read_json(..., lines=True)`.
1003
+
1004
+
### For this midtraining sweep specifically
1005
+
1006
+
The scaling-laws infra is overkill for a 3-point × 2-base LR sweep (no scaling fit is meaningful with one token budget + one parameter count per base). Appropriate plots:
1007
+
1008
+
- Train-loss vs step, EMA-smoothed, one line per `(base, lr_factor)`.
1009
+
- Paloma validation loss vs step (`eval/paloma/c4_en/loss`, `eval/paloma/dolma-v1_5/loss`, etc.), same overlay.
1010
+
- Final-loss bar chart per sweep point to pick the winner.
1011
+
1012
+
A ~50-line script that loads the 6 `tracker_metrics.jsonl` files (3 × 1e20 + 3 × 1e21 once they land) and renders these with matplotlib or plotly is enough. **Do not** build it on top of `isoflop_analysis.py` — wrong abstraction. **Do** reuse `eval_metrics_reader.read_eval_records` for the GCS + W&B unification logic if the runs have the right shape (check its filters first).
1013
+
1014
+
### Authorship / blame-walk
1015
+
1016
+
-`scaling_plots.py` / `isoflop_analysis.py` / `eval_metrics_reader.py` / `isoflop_sweep.py` — William Held, PR #2243 "Scaling Plots & Analysis as an Executor Step".
1017
+
- Delphi pipeline (`exp1337_delphi_suite.py`) — William Held, PR #3292 "Delphi Scaling Setup", plus PR #4591 "exp1337: add seed sweep".
1018
+
- AdamH heuristic — William Held, PR #2447 "Beta2 gets a bit wacky with very large batch sizes...".
1019
+
1020
+
When in doubt on scaling/analysis decisions, `git log --format='%an %s' -- <file>` → look for Will.
0 commit comments