Skip to content

Commit 921e389

Browse files
ClassicLarryclaude
andcommitted
Update moe/README.md and agent.md: compute-optimal baseline, gate structure
- README.md: split into v16 isoflop sweep (scaling law fit) and compute-optimal baseline (per-dim optimal budget runs), add tokens/tok_s/ runtime columns, update promotion criteria to reference agent.md gates - agent.md: full rewrite with gate 1/gate 2 structure, effective speedup calculation, GraphQL sub-issue commands, macro_loss as primary metric, branch off main, include user prompt in issue body Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent c928d1c commit 921e389

File tree

2 files changed

+248
-33
lines changed

2 files changed

+248
-33
lines changed

experiments/grug/moe/README.md

Lines changed: 39 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -61,51 +61,57 @@ entry point — `launch.py` uses it to produce the baseline step. Callers that
6161
want full manual control pass `GrugModelConfig` and `GrugMoeAdamHConfig`
6262
directly to `GrugMoeLaunchConfig`.
6363

64-
## v16 isoflop sweep: best runs per compute budget
64+
## v16 isoflop sweep
6565

6666
From the v16 sweep (`group=isoflop-moe-v16` on wandb, project `dial_moe`).
6767
See [issue #4447](https://github.com/marin-community/marin/issues/4447) for
68-
the full sweep context, per-cell results, and extrapolation tables. Rankings
69-
below are by **Paloma macro loss** at the final eval step. All runs use the
70-
architecture described above, QB routing, shared expert, GQA 4:1, seq_len
71-
4096. Budget → best run:
72-
73-
| Budget | Best dim | Layers | Paloma macro | c4_en BPB | Run |
74-
|--------|----------|--------|-------------|-----------|-----|
75-
| 1e18 | d768 | 8 | **3.5273** | 1.0658 | [isoflop-moe-v16-1e+18-d768](https://wandb.ai/marin-community/dial_moe/runs/isoflop-moe-v16-1e+18-d768) |
76-
| 3e18 | d768 | 8 | **3.3398** | 1.0122 | [isoflop-moe-v16-3e+18-d768](https://wandb.ai/marin-community/dial_moe/runs/isoflop-moe-v16-3e+18-d768) |
77-
| 1e19 | d1024 | 11 | **3.1494** | 0.9541 | [isoflop-moe-v16-1e+19-d1024](https://wandb.ai/marin-community/dial_moe/runs/isoflop-moe-v16-1e+19-d1024) |
78-
| 3e19 | d1536 | 16 | **3.0066** | 0.9123 | [isoflop-moe-v16-3e+19-d1536-v2](https://wandb.ai/marin-community/dial_moe/runs/isoflop-moe-v16-3e+19-d1536-v2) |
79-
| 1e20 | d1536 | 16 | **2.8509** | 0.8665 | [isoflop-moe-v16-1e+20-d1536-v2](https://wandb.ai/marin-community/dial_moe/runs/isoflop-moe-v16-1e+20-d1536-v2) |
80-
| 3e20 | d2048 | 21 | **2.7222** | 0.8289 | [isoflop-moe-v16-3e+20-d2048](https://wandb.ai/marin-community/dial_moe/runs/isoflop-moe-v16-3e+20-d2048) |
81-
82-
Derived scaling laws (fit on 1e18–3e20 optima):
68+
the full sweep context, per-cell results, and extrapolation tables. All runs
69+
use the architecture described above, QB routing, shared expert, GQA 4:1,
70+
seq_len 4096. The sweep tested multiple hidden dims at each compute budget
71+
(1e18–3e20) to find the optimal model size per budget.
72+
73+
Scaling laws fit on the v16 sweep optima:
8374

8475
- `N*(C) = 1.09e-2 · C^0.535`
8576
- `T*(C) = 1.60e+1 · C^0.464`
86-
- Paloma macro: `1.6 + 95.18 · C^(-0.0941)` (irreducible L∞ pinned to 1.6)
87-
- c4_en BPB: `0.4814 + 25.97 · C^(-0.0915)` (free-fit asymptote)
77+
- Paloma macro: `1.6 + 95.18 · C^(-0.0941)` (L∞ pinned at 1.6)
8878

8979
Projections:
9080

91-
| Budget | Projected macro | Projected c4_en BPB |
92-
|--------|-----------------|---------------------|
93-
| 1e21 | 2.606 | 0.7923 |
94-
| 1e23 | 2.252 | 0.6854 |
81+
| Budget | Projected macro |
82+
|--------|-----------------|
83+
| 1e21 | 2.606 |
84+
| 1e23 | 2.252 |
85+
86+
The **measured** 1e21 d2560-v2 run came in at macro **2.599**.
87+
88+
## Compute-optimal baseline
89+
90+
Using `N*(C)` from the isoflop sweep, we inverted to find the optimal compute
91+
budget for each hidden dim, then ran each at its predicted optimal budget. These
92+
are the baseline runs that ablation experiments compare against.
9593

96-
The **measured** 1e21 d2560-v2 run came in at macro **2.599** / bpb **0.7923**.
94+
| Budget | Dim | Layers | Paloma macro | Tokens | v5p-8 avg tok/s | v5p-8 runtime | Run |
95+
|----------|----------|--------|-------------|---------|-----------------|---------------|-----|
96+
| 2.19e17 | d512 | 6 | **3.8104** | 8.37e8 | 405,630 | 0.6h | [moe-v16-compute-opt-d512-2.19e+17](https://wandb.ai/marin-community/dial_moe/runs/moe-v16-compute-opt-d512-2.19e+17) |
97+
| 1.70e18 | d768 | 8 | **3.4339** | 2.71e9 | 273,532 | 2.8h | [moe-v16-compute-opt-d768-1.70e+18](https://wandb.ai/marin-community/dial_moe/runs/moe-v16-compute-opt-d768-1.70e+18) |
98+
| 9.00e18 | d1024 | 11 | **3.1605** | 6.63e9 | 175,165 | 10.5h | [moe-v16-compute-opt-d1024-9.00e+18](https://wandb.ai/marin-community/dial_moe/runs/moe-v16-compute-opt-d1024-9.00e+18) |
99+
| 3e19 | d1536 | 16 | **3.0066** | 7.83e9 | | | [isoflop-moe-v16-3e+19-d1536-v2](https://wandb.ai/marin-community/dial_moe/runs/isoflop-moe-v16-3e+19-d1536-v2) |
97100

98101
## Promotion criteria
99102

100-
Changes can be promoted to this recipe when they demonstrate:
103+
Changes can be promoted to this recipe when they demonstrate some combination
104+
of the following. Typically point 1 is sufficient.
101105

102-
1. **Lower loss at the same runtime** on the rungs of the 1e18 – 3e20 compute
103-
ladder (measured on the optima above, at the same token count / step count).
104-
2. **Lower projected c4_en BPB at 1e21 and 1e23 FLOPs**, using the scaling-law
105-
fit above (L∞ pinned at 1.6 for Paloma macro). Re-fit the power law on the
106-
candidate's ladder and compare projections head-to-head.
107-
3. **Low curvature around the minimum of each isoflop curve** — stable
108-
behavior across under- and over-trained regimes.
106+
1. **Passes gate 1 and gate 2** as defined in [`agent.md`](./agent.md)
107+
effective speedup > 1 at all compute-optimal baseline points, and lower
108+
projected macro_loss at 1e21 and 1e23.
109+
2. **Low curvature around the minimum of each isoflop curve** — stable
110+
behavior across under- and over-trained regimes, in particular the
111+
overtrained regime.
112+
3. **Stability and scaling improvements** — better routing balance, controlled
113+
norm growth, fewer activation outliers. Anything that makes the recipe more
114+
robust to scaling, even if loss is neutral at small scale.
109115

110116
Most promotable changes will land in one of three files:
111117

@@ -118,8 +124,7 @@ Most promotable changes will land in one of three files:
118124

119125
Some discretionary factors may influence the promotion decision even when the
120126
loss criteria are met — for example, impact on training memory footprint,
121-
inference latency / KV-cache size, serving compatibility, or interactions
122-
with unrelated in-flight work.
127+
inference latency / KV-cache size, serving compatibility, or interaction effects with other promotable changes.
123128

124129
## Files
125130

@@ -133,3 +138,4 @@ with unrelated in-flight work.
133138
- [`launch.py`](./launch.py)`GrugMoeLaunchConfig`, baseline `ExecutorStep`,
134139
and `executor_main` wiring.
135140
- [`adamh.py`](./adamh.py) — shared AdamH utilities.
141+
- [`agent.md`](./agent.md) — agent guide for running ablation experiments on Iris.

experiments/grug/moe/agent.md

Lines changed: 209 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,209 @@
1+
# Agent Guide: experiments/grug/moe
2+
3+
## Autonomy
4+
5+
This workflow is designed to run end-to-end without human confirmation. The
6+
agent is authorized to:
7+
8+
- Create branches, commit, and push without asking
9+
- Create GitHub experiment issues and post comments
10+
- Submit Iris jobs and kill only jobs submitted by self
11+
- Run experiments through both gates autonomously
12+
13+
Do not stop to ask for confirmation at any step. If something fails, diagnose
14+
and retry or report the failure — do not block waiting for input.
15+
16+
## Objective
17+
18+
Determine whether a proposed change outperforms the baseline. Baseline results
19+
are in `experiments/grug/moe/README.md` — compare against the table there.
20+
21+
**Metrics (from wandb):**
22+
- `eval/paloma/macro_loss` (final value)
23+
- `throughput/tokens_per_second` (averaged over the last 100 steps)
24+
- `throughput/total_tokens` (final value)
25+
26+
**Baseline scaling law** (L∞ pinned at 1.6):
27+
28+
```
29+
loss(C) = 1.6 + 95.18 · C^(-0.0941)
30+
```
31+
32+
### Gate 1: effective speedup at two small scales
33+
34+
Run the variant at `d512` (2.19e17 FLOPs) and `d768` (1.70e18 FLOPs).
35+
36+
For each scale, compute the **effective speedup** at a fixed macro_loss target
37+
(use the baseline's final macro_loss at that scale as the target). The variant
38+
passes gate 1 if it shows an effective speedup at **both** scales.
39+
40+
### Gate 2: scaling law projection
41+
42+
Run the variant at the two larger scales: `d1024` (9.00e18) and `d1280`
43+
(2.83e19). Combine with the gate 1 results (d512, d768) for four total points.
44+
45+
The variant passes gate 2 if:
46+
1. It shows an effective speedup at **all four** scales.
47+
2. Fit a new scaling law `loss(C) = 1.6 + A · C^(-alpha)` (asymptote pinned
48+
at 1.6) on the variant's four optima. Project to 1e21 and 1e23 FLOPs.
49+
The variant's projected loss must be lower than the baseline's at both
50+
budgets (baseline: 2.606 at 1e21, 2.252 at 1e23).
51+
52+
### Effective speedup calculation
53+
54+
Given baseline and variant results at the same compute budget:
55+
56+
```python
57+
import numpy as np
58+
59+
def effective_speedup(baseline_loss, baseline_tps, variant_loss, variant_tps, budget):
60+
"""Compute effective wall-clock speedup at the baseline's final loss.
61+
62+
Returns > 1 if the variant is faster to reach the same loss.
63+
"""
64+
target_loss = baseline_loss
65+
66+
# Invert scaling law: C needed to reach target_loss
67+
# loss(C) = 1.6 + 95.18 * C^(-0.0941) => C = (95.18 / (loss - 1.6))^(1/0.0941)
68+
C_baseline = (95.18 / (target_loss - 1.6)) ** (1 / 0.0941)
69+
70+
# The variant achieves variant_loss at the same budget. Fit the same
71+
# power-law shape shifted vertically: the variant's curve passes through
72+
# (budget, variant_loss) with the same exponent.
73+
# variant_loss = 1.6 + A_var * budget^(-0.0941)
74+
A_var = (variant_loss - 1.6) / budget ** (-0.0941)
75+
C_variant = (A_var / (target_loss - 1.6)) ** (1 / 0.0941)
76+
77+
# Wall-clock = compute / throughput
78+
wall_baseline = C_baseline / baseline_tps
79+
wall_variant = C_variant / variant_tps
80+
return wall_baseline / wall_variant
81+
```
82+
83+
### Example: effective speedup at a fixed loss target
84+
85+
Suppose at d768 / 1.70e18 FLOPs:
86+
- **Baseline**: macro_loss = 3.43, tok/s = 200,000
87+
- **Variant A**: macro_loss = 3.40, tok/s = 180,000 (better loss, 10% slower)
88+
89+
To reach macro_loss = 3.43 (the baseline's final loss), how much compute does
90+
each method need?
91+
92+
```python
93+
# Invert the scaling law: C(L) = (95.18 / (L - 1.6))^(1/0.0941)
94+
target_loss = 3.43
95+
C_baseline = (95.18 / (target_loss - 1.6)) ** (1 / 0.0941) # = 1.70e18
96+
97+
# Variant A reaches 3.40 at 1.70e18 FLOPs. It would have hit 3.43 at some
98+
# smaller C. Assume the same scaling law shape, shifted by the improvement:
99+
# variant_loss = 1.6 + A_var * budget^(-0.0941)
100+
A_var = (3.40 - 1.6) / (1.70e18) ** (-0.0941)
101+
C_variant = (A_var / (target_loss - 1.6)) ** (1 / 0.0941)
102+
```
103+
104+
But compute alone isn't wall-clock time — variant A is 10% slower per step.
105+
The wall-clock to reach the target is `C / tok_per_sec`:
106+
107+
```python
108+
wall_baseline = C_baseline / 200_000
109+
wall_variant = C_variant / 180_000
110+
speedup = wall_baseline / wall_variant
111+
```
112+
113+
If `speedup > 1`, variant A reaches the target loss faster in real time despite
114+
being slower per step. Report this as "X% effective speedup (or slowdown) at
115+
macro_loss = Y". This is the key number for deciding whether to promote a
116+
change.
117+
118+
## Implementation
119+
120+
Most promotable changes will land in one of three files:
121+
122+
- `model.py` — architecture tweaks (routing, norms, attention, activation functions, expert layout, etc.).
123+
- `heuristic.py` — scaling heuristics (LR formula coefficients, depth/width formula, GQA ratio, per-batch-size epsilon/beta2 scaling).
124+
- `optimizer.py` — optimizer internals (AdamH components, parameter-group partitioning, per-group learning rates, weight decay).
125+
126+
## Documentation & GitHub Issues
127+
128+
Create a new branch for each experiment issue. Branch off `main`.
129+
130+
Follow `.agents/skills/agent-research/SKILL.md` for all documentation, logbooks,
131+
W&B tracking, and GitHub experiment issue management tied to work in this
132+
directory. Pay attention to this file carefully.
133+
134+
Experiment issues should be titled `Agent MoE Experiment: [description]`.
135+
Include the exact prompt from the user that initiated the experiment in the
136+
issue body.
137+
138+
After creating the issue, **add it as a sub-issue of #4281** (April 2026 MoE
139+
scaling tracking issue) using the GitHub GraphQL API. This is required — do not skip it. First get the node IDs, then
140+
call `addSubIssue`:
141+
142+
```bash
143+
# 1. Get node IDs for the parent and the new issue
144+
gh api graphql -f query='
145+
query {
146+
repository(owner: "marin-community", name: "marin") {
147+
parent: issue(number: 4281) { id }
148+
child: issue(number: <NEW_ISSUE_NUMBER>) { id }
149+
}
150+
}'
151+
152+
# 2. Add the sub-issue relationship
153+
gh api graphql -f query='
154+
mutation {
155+
addSubIssue(input: {issueId: "<PARENT_ID>", subIssueId: "<CHILD_ID>"}) {
156+
issue { number }
157+
subIssue { number }
158+
}
159+
}'
160+
```
161+
162+
## Authentication
163+
164+
Assume the user has already completed these before job submission:
165+
- `WANDB_API_KEY` set in the environment
166+
- `gcloud auth login` and `gcloud auth application-default login`
167+
168+
## Job Submission
169+
170+
Jobs in this directory are submitted to **Iris** on a **v5p-8**.
171+
172+
### Submission command
173+
174+
```bash
175+
.venv/bin/iris --config lib/iris/examples/marin.yaml job run \
176+
--no-wait \
177+
--reserve v5p-8 \
178+
-e WANDB_API_KEY "$WANDB_API_KEY" \
179+
-- python -m experiments.grug.moe.launch
180+
```
181+
182+
Swap the module path (`experiments.grug.moe.launch`) for whichever launch
183+
script in this directory you are running.
184+
185+
### Monitoring
186+
187+
Runs may take time to find a TPU, and 5–10 minutes to start once scheduled.
188+
After confirming the run is progressing on wandb, jobs typically take over an
189+
hour to complete. Sleep at reasonable intervals (e.g. 15 minutes) before
190+
checking status — do not poll in a tight loop.
191+
192+
Reconnect to logs:
193+
```bash
194+
.venv/bin/iris --config lib/iris/examples/marin.yaml job logs -f JOB_ID
195+
```
196+
197+
List your jobs:
198+
```bash
199+
.venv/bin/iris --config lib/iris/examples/marin.yaml job list | grep "$(whoami)"
200+
```
201+
202+
Check runs in wandb (match `<PROJECT>` and `<PREFIX>` to `launch.py`):
203+
```python
204+
import wandb
205+
api = wandb.Api()
206+
runs = api.runs('marin-community/<PROJECT>', filters={'displayName': {'$regex': '^<PREFIX>'}}, order='-created_at')
207+
for r in runs:
208+
print(f'{r.name:<50} state={r.state:<10} step={r.summary.get("global_step", "n/a")}')
209+
```

0 commit comments

Comments
 (0)