|
| 1 | +# Agent Guide: experiments/grug/moe |
| 2 | + |
| 3 | +## Autonomy |
| 4 | + |
| 5 | +This workflow is designed to run end-to-end without human confirmation. The |
| 6 | +agent is authorized to: |
| 7 | + |
| 8 | +- Create branches, commit, and push without asking |
| 9 | +- Create GitHub experiment issues and post comments |
| 10 | +- Submit Iris jobs and kill only jobs submitted by self |
| 11 | +- Run experiments through both gates autonomously |
| 12 | + |
| 13 | +Do not stop to ask for confirmation at any step. If something fails, diagnose |
| 14 | +and retry or report the failure — do not block waiting for input. |
| 15 | + |
| 16 | +## Objective |
| 17 | + |
| 18 | +Determine whether a proposed change outperforms the baseline. Baseline results |
| 19 | +are in `experiments/grug/moe/README.md` — compare against the table there. |
| 20 | + |
| 21 | +**Metrics (from wandb):** |
| 22 | +- `eval/paloma/macro_loss` (final value) |
| 23 | +- `throughput/tokens_per_second` (averaged over the last 100 steps) |
| 24 | +- `throughput/total_tokens` (final value) |
| 25 | + |
| 26 | +**Baseline scaling law** (L∞ pinned at 1.6): |
| 27 | + |
| 28 | +``` |
| 29 | +loss(C) = 1.6 + 95.18 · C^(-0.0941) |
| 30 | +``` |
| 31 | + |
| 32 | +### Gate 1: effective speedup at two small scales |
| 33 | + |
| 34 | +Run the variant at `d512` (2.19e17 FLOPs) and `d768` (1.70e18 FLOPs). |
| 35 | + |
| 36 | +For each scale, compute the **effective speedup** at a fixed macro_loss target |
| 37 | +(use the baseline's final macro_loss at that scale as the target). The variant |
| 38 | +passes gate 1 if it shows an effective speedup at **both** scales. |
| 39 | + |
| 40 | +### Gate 2: scaling law projection |
| 41 | + |
| 42 | +Run the variant at the two larger scales: `d1024` (9.00e18) and `d1280` |
| 43 | +(2.83e19). Combine with the gate 1 results (d512, d768) for four total points. |
| 44 | + |
| 45 | +The variant passes gate 2 if: |
| 46 | +1. It shows an effective speedup at **all four** scales. |
| 47 | +2. Fit a new scaling law `loss(C) = 1.6 + A · C^(-alpha)` (asymptote pinned |
| 48 | + at 1.6) on the variant's four optima. Project to 1e21 and 1e23 FLOPs. |
| 49 | + The variant's projected loss must be lower than the baseline's at both |
| 50 | + budgets (baseline: 2.606 at 1e21, 2.252 at 1e23). |
| 51 | + |
| 52 | +### Effective speedup calculation |
| 53 | + |
| 54 | +Given baseline and variant results at the same compute budget: |
| 55 | + |
| 56 | +```python |
| 57 | +import numpy as np |
| 58 | + |
| 59 | +def effective_speedup(baseline_loss, baseline_tps, variant_loss, variant_tps, budget): |
| 60 | + """Compute effective wall-clock speedup at the baseline's final loss. |
| 61 | +
|
| 62 | + Returns > 1 if the variant is faster to reach the same loss. |
| 63 | + """ |
| 64 | + target_loss = baseline_loss |
| 65 | + |
| 66 | + # Invert scaling law: C needed to reach target_loss |
| 67 | + # loss(C) = 1.6 + 95.18 * C^(-0.0941) => C = (95.18 / (loss - 1.6))^(1/0.0941) |
| 68 | + C_baseline = (95.18 / (target_loss - 1.6)) ** (1 / 0.0941) |
| 69 | + |
| 70 | + # The variant achieves variant_loss at the same budget. Fit the same |
| 71 | + # power-law shape shifted vertically: the variant's curve passes through |
| 72 | + # (budget, variant_loss) with the same exponent. |
| 73 | + # variant_loss = 1.6 + A_var * budget^(-0.0941) |
| 74 | + A_var = (variant_loss - 1.6) / budget ** (-0.0941) |
| 75 | + C_variant = (A_var / (target_loss - 1.6)) ** (1 / 0.0941) |
| 76 | + |
| 77 | + # Wall-clock = compute / throughput |
| 78 | + wall_baseline = C_baseline / baseline_tps |
| 79 | + wall_variant = C_variant / variant_tps |
| 80 | + return wall_baseline / wall_variant |
| 81 | +``` |
| 82 | + |
| 83 | +### Example: effective speedup at a fixed loss target |
| 84 | + |
| 85 | +Suppose at d768 / 1.70e18 FLOPs: |
| 86 | +- **Baseline**: macro_loss = 3.43, tok/s = 200,000 |
| 87 | +- **Variant A**: macro_loss = 3.40, tok/s = 180,000 (better loss, 10% slower) |
| 88 | + |
| 89 | +To reach macro_loss = 3.43 (the baseline's final loss), how much compute does |
| 90 | +each method need? |
| 91 | + |
| 92 | +```python |
| 93 | +# Invert the scaling law: C(L) = (95.18 / (L - 1.6))^(1/0.0941) |
| 94 | +target_loss = 3.43 |
| 95 | +C_baseline = (95.18 / (target_loss - 1.6)) ** (1 / 0.0941) # = 1.70e18 |
| 96 | + |
| 97 | +# Variant A reaches 3.40 at 1.70e18 FLOPs. It would have hit 3.43 at some |
| 98 | +# smaller C. Assume the same scaling law shape, shifted by the improvement: |
| 99 | +# variant_loss = 1.6 + A_var * budget^(-0.0941) |
| 100 | +A_var = (3.40 - 1.6) / (1.70e18) ** (-0.0941) |
| 101 | +C_variant = (A_var / (target_loss - 1.6)) ** (1 / 0.0941) |
| 102 | +``` |
| 103 | + |
| 104 | +But compute alone isn't wall-clock time — variant A is 10% slower per step. |
| 105 | +The wall-clock to reach the target is `C / tok_per_sec`: |
| 106 | + |
| 107 | +```python |
| 108 | +wall_baseline = C_baseline / 200_000 |
| 109 | +wall_variant = C_variant / 180_000 |
| 110 | +speedup = wall_baseline / wall_variant |
| 111 | +``` |
| 112 | + |
| 113 | +If `speedup > 1`, variant A reaches the target loss faster in real time despite |
| 114 | +being slower per step. Report this as "X% effective speedup (or slowdown) at |
| 115 | +macro_loss = Y". This is the key number for deciding whether to promote a |
| 116 | +change. |
| 117 | + |
| 118 | +## Implementation |
| 119 | + |
| 120 | +Most promotable changes will land in one of three files: |
| 121 | + |
| 122 | +- `model.py` — architecture tweaks (routing, norms, attention, activation functions, expert layout, etc.). |
| 123 | +- `heuristic.py` — scaling heuristics (LR formula coefficients, depth/width formula, GQA ratio, per-batch-size epsilon/beta2 scaling). |
| 124 | +- `optimizer.py` — optimizer internals (AdamH components, parameter-group partitioning, per-group learning rates, weight decay). |
| 125 | + |
| 126 | +## Documentation & GitHub Issues |
| 127 | + |
| 128 | +Create a new branch for each experiment issue. Branch off `main`. |
| 129 | + |
| 130 | +Follow `.agents/skills/agent-research/SKILL.md` for all documentation, logbooks, |
| 131 | +W&B tracking, and GitHub experiment issue management tied to work in this |
| 132 | +directory. Pay attention to this file carefully. |
| 133 | + |
| 134 | +Experiment issues should be titled `Agent MoE Experiment: [description]`. |
| 135 | +Include the exact prompt from the user that initiated the experiment in the |
| 136 | +issue body. |
| 137 | + |
| 138 | +After creating the issue, **add it as a sub-issue of #4281** (April 2026 MoE |
| 139 | +scaling tracking issue) using the GitHub GraphQL API. This is required — do not skip it. First get the node IDs, then |
| 140 | +call `addSubIssue`: |
| 141 | + |
| 142 | +```bash |
| 143 | +# 1. Get node IDs for the parent and the new issue |
| 144 | +gh api graphql -f query=' |
| 145 | +query { |
| 146 | + repository(owner: "marin-community", name: "marin") { |
| 147 | + parent: issue(number: 4281) { id } |
| 148 | + child: issue(number: <NEW_ISSUE_NUMBER>) { id } |
| 149 | + } |
| 150 | +}' |
| 151 | + |
| 152 | +# 2. Add the sub-issue relationship |
| 153 | +gh api graphql -f query=' |
| 154 | +mutation { |
| 155 | + addSubIssue(input: {issueId: "<PARENT_ID>", subIssueId: "<CHILD_ID>"}) { |
| 156 | + issue { number } |
| 157 | + subIssue { number } |
| 158 | + } |
| 159 | +}' |
| 160 | +``` |
| 161 | + |
| 162 | +## Authentication |
| 163 | + |
| 164 | +Assume the user has already completed these before job submission: |
| 165 | +- `WANDB_API_KEY` set in the environment |
| 166 | +- `gcloud auth login` and `gcloud auth application-default login` |
| 167 | + |
| 168 | +## Job Submission |
| 169 | + |
| 170 | +Jobs in this directory are submitted to **Iris** on a **v5p-8**. |
| 171 | + |
| 172 | +### Submission command |
| 173 | + |
| 174 | +```bash |
| 175 | +.venv/bin/iris --config lib/iris/examples/marin.yaml job run \ |
| 176 | + --no-wait \ |
| 177 | + --reserve v5p-8 \ |
| 178 | + -e WANDB_API_KEY "$WANDB_API_KEY" \ |
| 179 | + -- python -m experiments.grug.moe.launch |
| 180 | +``` |
| 181 | + |
| 182 | +Swap the module path (`experiments.grug.moe.launch`) for whichever launch |
| 183 | +script in this directory you are running. |
| 184 | + |
| 185 | +### Monitoring |
| 186 | + |
| 187 | +Runs may take time to find a TPU, and 5–10 minutes to start once scheduled. |
| 188 | +After confirming the run is progressing on wandb, jobs typically take over an |
| 189 | +hour to complete. Sleep at reasonable intervals (e.g. 15 minutes) before |
| 190 | +checking status — do not poll in a tight loop. |
| 191 | + |
| 192 | +Reconnect to logs: |
| 193 | +```bash |
| 194 | +.venv/bin/iris --config lib/iris/examples/marin.yaml job logs -f JOB_ID |
| 195 | +``` |
| 196 | + |
| 197 | +List your jobs: |
| 198 | +```bash |
| 199 | +.venv/bin/iris --config lib/iris/examples/marin.yaml job list | grep "$(whoami)" |
| 200 | +``` |
| 201 | + |
| 202 | +Check runs in wandb (match `<PROJECT>` and `<PREFIX>` to `launch.py`): |
| 203 | +```python |
| 204 | +import wandb |
| 205 | +api = wandb.Api() |
| 206 | +runs = api.runs('marin-community/<PROJECT>', filters={'displayName': {'$regex': '^<PREFIX>'}}, order='-created_at') |
| 207 | +for r in runs: |
| 208 | + print(f'{r.name:<50} state={r.state:<10} step={r.summary.get("global_step", "n/a")}') |
| 209 | +``` |
0 commit comments