Update moe/README.md and agent.md: compute-optimal baseline, gate structure

ClassicLarry · claude · ClassicLarry · commit 921e389aed7d · 2026-04-15T14:56:06.000-07:00
- README.md: split into v16 isoflop sweep (scaling law fit) and
  compute-optimal baseline (per-dim optimal budget runs), add tokens/tok_s/
  runtime columns, update promotion criteria to reference agent.md gates
- agent.md: full rewrite with gate 1/gate 2 structure, effective speedup
  calculation, GraphQL sub-issue commands, macro_loss as primary metric,
  branch off main, include user prompt in issue body

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/experiments/grug/moe/README.md b/experiments/grug/moe/README.md
@@ -61,51 +61,57 @@ entry point — `launch.py` uses it to produce the baseline step. Callers that
 want full manual control pass `GrugModelConfig` and `GrugMoeAdamHConfig`
 directly to `GrugMoeLaunchConfig`.
 
-## v16 isoflop sweep: best runs per compute budget
+## v16 isoflop sweep
 
 From the v16 sweep (`group=isoflop-moe-v16` on wandb, project `dial_moe`).
 See [issue #4447](https://github.com/marin-community/marin/issues/4447) for
-the full sweep context, per-cell results, and extrapolation tables. Rankings
-below are by **Paloma macro loss** at the final eval step. All runs use the
-architecture described above, QB routing, shared expert, GQA 4:1, seq_len
-4096. Budget → best run:
-
-| Budget | Best dim | Layers | Paloma macro | c4_en BPB | Run |
-|--------|----------|--------|-------------|-----------|-----|
-| 1e18   | d768     | 8      | **3.5273**  | 1.0658 | [isoflop-moe-v16-1e+18-d768](https://wandb.ai/marin-community/dial_moe/runs/isoflop-moe-v16-1e+18-d768) |
-| 3e18   | d768     | 8      | **3.3398**  | 1.0122 | [isoflop-moe-v16-3e+18-d768](https://wandb.ai/marin-community/dial_moe/runs/isoflop-moe-v16-3e+18-d768) |
-| 1e19   | d1024    | 11     | **3.1494**  | 0.9541 | [isoflop-moe-v16-1e+19-d1024](https://wandb.ai/marin-community/dial_moe/runs/isoflop-moe-v16-1e+19-d1024) |
-| 3e19   | d1536    | 16     | **3.0066**  | 0.9123 | [isoflop-moe-v16-3e+19-d1536-v2](https://wandb.ai/marin-community/dial_moe/runs/isoflop-moe-v16-3e+19-d1536-v2) |
-| 1e20   | d1536    | 16     | **2.8509**  | 0.8665 | [isoflop-moe-v16-1e+20-d1536-v2](https://wandb.ai/marin-community/dial_moe/runs/isoflop-moe-v16-1e+20-d1536-v2) |
-| 3e20   | d2048    | 21     | **2.7222**  | 0.8289 | [isoflop-moe-v16-3e+20-d2048](https://wandb.ai/marin-community/dial_moe/runs/isoflop-moe-v16-3e+20-d2048) |
-
-Derived scaling laws (fit on 1e18–3e20 optima):
+the full sweep context, per-cell results, and extrapolation tables. All runs
+use the architecture described above, QB routing, shared expert, GQA 4:1,
+seq_len 4096. The sweep tested multiple hidden dims at each compute budget
+(1e18–3e20) to find the optimal model size per budget.
+
+Scaling laws fit on the v16 sweep optima:
 
 - `N*(C) = 1.09e-2 · C^0.535`
 - `T*(C) = 1.60e+1 · C^0.464`
-- Paloma macro: `1.6 + 95.18 · C^(-0.0941)` (irreducible L∞ pinned to 1.6)
-- c4_en BPB: `0.4814 + 25.97 · C^(-0.0915)` (free-fit asymptote)
+- Paloma macro: `1.6 + 95.18 · C^(-0.0941)` (L∞ pinned at 1.6)
 
 Projections:
 
-| Budget | Projected macro | Projected c4_en BPB |
-|--------|-----------------|---------------------|
-| 1e21   | 2.606           | 0.7923              |
-| 1e23   | 2.252           | 0.6854              |
+| Budget | Projected macro |
+|--------|-----------------|
+| 1e21   | 2.606           |
+| 1e23   | 2.252           |
+
+The **measured** 1e21 d2560-v2 run came in at macro **2.599**.
+
+## Compute-optimal baseline
+
+Using `N*(C)` from the isoflop sweep, we inverted to find the optimal compute
+budget for each hidden dim, then ran each at its predicted optimal budget. These
+are the baseline runs that ablation experiments compare against.
 
-The **measured** 1e21 d2560-v2 run came in at macro **2.599** / bpb **0.7923**.
+| Budget   | Dim      | Layers | Paloma macro | Tokens  | v5p-8 avg tok/s | v5p-8 runtime | Run |
+|----------|----------|--------|-------------|---------|-----------------|---------------|-----|
+| 2.19e17  | d512     | 6      | **3.8104**  | 8.37e8  | 405,630         | 0.6h          | [moe-v16-compute-opt-d512-2.19e+17](https://wandb.ai/marin-community/dial_moe/runs/moe-v16-compute-opt-d512-2.19e+17) |
+| 1.70e18  | d768     | 8      | **3.4339**  | 2.71e9  | 273,532         | 2.8h          | [moe-v16-compute-opt-d768-1.70e+18](https://wandb.ai/marin-community/dial_moe/runs/moe-v16-compute-opt-d768-1.70e+18) |
+| 9.00e18  | d1024    | 11     | **3.1605**  | 6.63e9  | 175,165         | 10.5h         | [moe-v16-compute-opt-d1024-9.00e+18](https://wandb.ai/marin-community/dial_moe/runs/moe-v16-compute-opt-d1024-9.00e+18) |
+| 3e19     | d1536    | 16     | **3.0066**  | 7.83e9  |                 |               | [isoflop-moe-v16-3e+19-d1536-v2](https://wandb.ai/marin-community/dial_moe/runs/isoflop-moe-v16-3e+19-d1536-v2) |
 
 ## Promotion criteria
 
-Changes can be promoted to this recipe when they demonstrate:
+Changes can be promoted to this recipe when they demonstrate some combination
+of the following. Typically point 1 is sufficient.
 
-1. **Lower loss at the same runtime** on the rungs of the 1e18 – 3e20 compute
-   ladder (measured on the optima above, at the same token count / step count).
-2. **Lower projected c4_en BPB at 1e21 and 1e23 FLOPs**, using the scaling-law
-   fit above (L∞ pinned at 1.6 for Paloma macro). Re-fit the power law on the
-   candidate's ladder and compare projections head-to-head.
-3. **Low curvature around the minimum of each isoflop curve** — stable
-   behavior across under- and over-trained regimes.
+1. **Passes gate 1 and gate 2** as defined in [`agent.md`](./agent.md) —
+   effective speedup > 1 at all compute-optimal baseline points, and lower
+   projected macro_loss at 1e21 and 1e23.
+2. **Low curvature around the minimum of each isoflop curve** — stable
+   behavior across under- and over-trained regimes, in particular the
+   overtrained regime.
+3. **Stability and scaling improvements** — better routing balance, controlled
+   norm growth, fewer activation outliers. Anything that makes the recipe more
+   robust to scaling, even if loss is neutral at small scale.
 
 Most promotable changes will land in one of three files:
 
@@ -118,8 +124,7 @@ Most promotable changes will land in one of three files:
 
 Some discretionary factors may influence the promotion decision even when the
 loss criteria are met — for example, impact on training memory footprint,
-inference latency / KV-cache size, serving compatibility, or interactions
-with unrelated in-flight work.
+inference latency / KV-cache size, serving compatibility, or interaction effects with other promotable changes.
 
 ## Files
 
@@ -133,3 +138,4 @@ with unrelated in-flight work.
 - [`launch.py`](./launch.py) — `GrugMoeLaunchConfig`, baseline `ExecutorStep`,
   and `executor_main` wiring.
 - [`adamh.py`](./adamh.py) — shared AdamH utilities.
+- [`agent.md`](./agent.md) — agent guide for running ablation experiments on Iris.
diff --git a/experiments/grug/moe/agent.md b/experiments/grug/moe/agent.md
@@ -0,0 +1,209 @@
+# Agent Guide: experiments/grug/moe
+
+## Autonomy
+
+This workflow is designed to run end-to-end without human confirmation. The
+agent is authorized to:
+
+- Create branches, commit, and push without asking
+- Create GitHub experiment issues and post comments
+- Submit Iris jobs and kill only jobs submitted by self
+- Run experiments through both gates autonomously
+
+Do not stop to ask for confirmation at any step. If something fails, diagnose
+and retry or report the failure — do not block waiting for input.
+
+## Objective
+
+Determine whether a proposed change outperforms the baseline. Baseline results
+are in `experiments/grug/moe/README.md` — compare against the table there.
+
+**Metrics (from wandb):**
+- `eval/paloma/macro_loss` (final value)
+- `throughput/tokens_per_second` (averaged over the last 100 steps)
+- `throughput/total_tokens` (final value)
+
+**Baseline scaling law** (L∞ pinned at 1.6):
+
+```
+loss(C) = 1.6 + 95.18 · C^(-0.0941)
+```
+
+### Gate 1: effective speedup at two small scales
+
+Run the variant at `d512` (2.19e17 FLOPs) and `d768` (1.70e18 FLOPs).
+
+For each scale, compute the **effective speedup** at a fixed macro_loss target
+(use the baseline's final macro_loss at that scale as the target). The variant
+passes gate 1 if it shows an effective speedup at **both** scales.
+
+### Gate 2: scaling law projection
+
+Run the variant at the two larger scales: `d1024` (9.00e18) and `d1280`
+(2.83e19). Combine with the gate 1 results (d512, d768) for four total points.
+
+The variant passes gate 2 if:
+1. It shows an effective speedup at **all four** scales.
+2. Fit a new scaling law `loss(C) = 1.6 + A · C^(-alpha)` (asymptote pinned
+   at 1.6) on the variant's four optima. Project to 1e21 and 1e23 FLOPs.
+   The variant's projected loss must be lower than the baseline's at both
+   budgets (baseline: 2.606 at 1e21, 2.252 at 1e23).
+
+### Effective speedup calculation
+
+Given baseline and variant results at the same compute budget:
+
+```python
+import numpy as np
+
+def effective_speedup(baseline_loss, baseline_tps, variant_loss, variant_tps, budget):
+    """Compute effective wall-clock speedup at the baseline's final loss.
+
+    Returns > 1 if the variant is faster to reach the same loss.
+    """
+    target_loss = baseline_loss
+
+    # Invert scaling law: C needed to reach target_loss
+    # loss(C) = 1.6 + 95.18 * C^(-0.0941) => C = (95.18 / (loss - 1.6))^(1/0.0941)
+    C_baseline = (95.18 / (target_loss - 1.6)) ** (1 / 0.0941)
+
+    # The variant achieves variant_loss at the same budget. Fit the same
+    # power-law shape shifted vertically: the variant's curve passes through
+    # (budget, variant_loss) with the same exponent.
+    # variant_loss = 1.6 + A_var * budget^(-0.0941)
+    A_var = (variant_loss - 1.6) / budget ** (-0.0941)
+    C_variant = (A_var / (target_loss - 1.6)) ** (1 / 0.0941)
+
+    # Wall-clock = compute / throughput
+    wall_baseline = C_baseline / baseline_tps
+    wall_variant = C_variant / variant_tps
+    return wall_baseline / wall_variant
+```
+
+### Example: effective speedup at a fixed loss target
+
+Suppose at d768 / 1.70e18 FLOPs:
+- **Baseline**: macro_loss = 3.43, tok/s = 200,000
+- **Variant A**: macro_loss = 3.40, tok/s = 180,000 (better loss, 10% slower)
+
+To reach macro_loss = 3.43 (the baseline's final loss), how much compute does
+each method need?
+
+```python
+# Invert the scaling law: C(L) = (95.18 / (L - 1.6))^(1/0.0941)
+target_loss = 3.43
+C_baseline = (95.18 / (target_loss - 1.6)) ** (1 / 0.0941)  # = 1.70e18
+
+# Variant A reaches 3.40 at 1.70e18 FLOPs. It would have hit 3.43 at some
+# smaller C. Assume the same scaling law shape, shifted by the improvement:
+# variant_loss = 1.6 + A_var * budget^(-0.0941)
+A_var = (3.40 - 1.6) / (1.70e18) ** (-0.0941)
+C_variant = (A_var / (target_loss - 1.6)) ** (1 / 0.0941)
+```
+
+But compute alone isn't wall-clock time — variant A is 10% slower per step.
+The wall-clock to reach the target is `C / tok_per_sec`:
+
+```python
+wall_baseline = C_baseline / 200_000
+wall_variant = C_variant / 180_000
+speedup = wall_baseline / wall_variant
+```
+
+If `speedup > 1`, variant A reaches the target loss faster in real time despite
+being slower per step. Report this as "X% effective speedup (or slowdown) at
+macro_loss = Y". This is the key number for deciding whether to promote a
+change.
+
+## Implementation
+
+Most promotable changes will land in one of three files:
+
+- `model.py` — architecture tweaks (routing, norms, attention, activation functions, expert layout, etc.).
+- `heuristic.py` — scaling heuristics (LR formula coefficients, depth/width formula, GQA ratio, per-batch-size epsilon/beta2 scaling).
+- `optimizer.py` — optimizer internals (AdamH components, parameter-group partitioning, per-group learning rates, weight decay).
+
+## Documentation & GitHub Issues
+
+Create a new branch for each experiment issue. Branch off `main`.
+
+Follow `.agents/skills/agent-research/SKILL.md` for all documentation, logbooks,
+W&B tracking, and GitHub experiment issue management tied to work in this
+directory. Pay attention to this file carefully.
+
+Experiment issues should be titled `Agent MoE Experiment: [description]`.
+Include the exact prompt from the user that initiated the experiment in the
+issue body.
+
+After creating the issue, **add it as a sub-issue of #4281** (April 2026 MoE
+scaling tracking issue) using the GitHub GraphQL API. This is required — do not skip it. First get the node IDs, then
+call `addSubIssue`:
+
+```bash
+# 1. Get node IDs for the parent and the new issue
+gh api graphql -f query='
+query {
+  repository(owner: "marin-community", name: "marin") {
+    parent: issue(number: 4281) { id }
+    child: issue(number: <NEW_ISSUE_NUMBER>) { id }
+  }
+}'
+
+# 2. Add the sub-issue relationship
+gh api graphql -f query='
+mutation {
+  addSubIssue(input: {issueId: "<PARENT_ID>", subIssueId: "<CHILD_ID>"}) {
+    issue { number }
+    subIssue { number }
+  }
+}'
+```
+
+## Authentication
+
+Assume the user has already completed these before job submission:
+- `WANDB_API_KEY` set in the environment
+- `gcloud auth login` and `gcloud auth application-default login`
+
+## Job Submission
+
+Jobs in this directory are submitted to **Iris** on a **v5p-8**.
+
+### Submission command
+
+```bash
+.venv/bin/iris --config lib/iris/examples/marin.yaml job run \
+  --no-wait \
+  --reserve v5p-8 \
+  -e WANDB_API_KEY "$WANDB_API_KEY" \
+  -- python -m experiments.grug.moe.launch
+```
+
+Swap the module path (`experiments.grug.moe.launch`) for whichever launch
+script in this directory you are running.
+
+### Monitoring
+
+Runs may take time to find a TPU, and 5–10 minutes to start once scheduled.
+After confirming the run is progressing on wandb, jobs typically take over an
+hour to complete. Sleep at reasonable intervals (e.g. 15 minutes) before
+checking status — do not poll in a tight loop.
+
+Reconnect to logs:
+```bash
+.venv/bin/iris --config lib/iris/examples/marin.yaml job logs -f JOB_ID
+```
+
+List your jobs:
+```bash
+.venv/bin/iris --config lib/iris/examples/marin.yaml job list | grep "$(whoami)"
+```
+
+Check runs in wandb (match `<PROJECT>` and `<PREFIX>` to `launch.py`):
+```python
+import wandb
+api = wandb.Api()
+runs = api.runs('marin-community/<PROJECT>', filters={'displayName': {'$regex': '^<PREFIX>'}}, order='-created_at')
+for r in runs:
+    print(f'{r.name:<50} state={r.state:<10} step={r.summary.get("global_step", "n/a")}')
+```