Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
6b7db1e
Add MoE depth MuP LR sweep
Apr 25, 2026
6e76368
Clarify MoE Iris submission workflow
Apr 25, 2026
06f3223
Record MoE Iris submission blocker
Apr 25, 2026
94fb32a
Clarify MoE Iris entrypoint resources
Apr 25, 2026
99bd5b7
Record MoE depth MuP Iris submission
Apr 25, 2026
e51255b
Record MoE depth MuP startup check
Apr 25, 2026
33936f8
Record MoE depth MuP first progress
Apr 25, 2026
9caeffc
Record MoE depth MuP d512 results
Apr 25, 2026
72f9a43
Record MoE depth MuP d768 early eval
Apr 25, 2026
095efa6
Record MoE depth MuP d512 completion
Apr 25, 2026
2f1ad40
Record MoE depth MuP d768 partial completion
Apr 25, 2026
45b2c8a
Record MoE depth MuP d1024 early eval
Apr 26, 2026
2fe66ba
Record MoE depth MuP d1024 second eval
Apr 26, 2026
42d0861
Record MoE depth MuP d1024 third eval
Apr 26, 2026
117f0f4
Record MoE depth MuP recovery resubmission
Apr 26, 2026
67a6f60
Record MoE depth MuP recovery heartbeat
Apr 26, 2026
1cfd049
Record MoE depth MuP d768 completion
Apr 26, 2026
511e976
Record MoE depth MuP d1024 checkpoint
Apr 26, 2026
7004c30
Record MoE depth MuP d1024 5k checkpoint
Apr 26, 2026
79d65d4
Record MoE depth MuP d1024 6k checkpoint
Apr 26, 2026
1f63b74
Record MoE depth MuP d1024 7k checkpoint
Apr 26, 2026
0e3d8b6
Record MoE depth MuP d1024 8k checkpoint
Apr 26, 2026
c8b39cb
Record MoE depth MuP d1024 9k checkpoint
Apr 26, 2026
05768c5
Record MoE depth MuP d1024 10k checkpoint
Apr 26, 2026
734a666
Record MoE depth MuP d1024 11k checkpoint
Apr 26, 2026
aa63435
Record MoE depth MuP d1024 12k checkpoint
Apr 26, 2026
c6989db
Record MoE depth MuP central d1024 finals
Apr 26, 2026
dfa4f4b
Record MoE depth MuP d1024 tail checkpoint
Apr 26, 2026
9721631
Record MoE depth MuP d1280 checkpoint
Apr 26, 2026
575b9af
Record MoE depth MuP d1024 completion
Apr 26, 2026
9e467fb
Record MoE depth MuP d1280 checkpoint
Apr 27, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,040 changes: 1,040 additions & 0 deletions .agents/logbooks/moe-depth-mup-lr-sweep.md

Large diffs are not rendered by default.

130 changes: 130 additions & 0 deletions .agents/ops/2026-04-25-iris-controller-discovery-permission.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
---
date: 2026-04-25
system: iris
severity: diagnostic-only
resolution: investigating
pr: https://github.com/marin-community/marin/pull/5179
issue: https://github.com/marin-community/marin/issues/5178
---

## TL;DR

- A MoE depth MuP LR sweep was ready to submit through Iris, but job submission
failed before job creation.
- The active GCP account was `kaiyuew@stanford.edu` on project
`hai-gcp-models`.
- Iris could not discover the Marin controller because GCP returned
`GCP API error 403: Required 'compute.instances.list' permission for
'projects/hai-gcp-models'`.
- No Iris job was created. No cluster or controller state was changed.
- Retrying requires an account with controller VM list permission or an explicit
`--controller-url` to an existing controller tunnel.

## Original problem report

The user requested that MoE experiment work always submit the run to Iris and
continue until the full MoE procedure is finished. The attempted command was:

```bash
.venv/bin/iris --config lib/iris/examples/marin.yaml job run \
--no-wait \
--reserve v5p-8 \
-e WANDB_API_KEY "$WANDB_API_KEY" \
-- python -m experiments.grug.moe.depth_mup_lr_sweep
```

The command failed with:

```text
GCP API error 403: Required 'compute.instances.list' permission for 'projects/hai-gcp-models'
RuntimeError: No controller VM found (label=iris-marin-controller=true, project=hai-gcp-models)
```

## Investigation path

1. The workflow first verified local GitHub auth as `WhenWen`, created issue
#5178, pushed PR #5179, and prepared the depth MuP sweep module.

2. The Iris preflight confirmed `.venv/bin/iris` was executable and
`WANDB_API_KEY` was present.

3. `gcloud auth list --filter=status:ACTIVE --format='value(account)'` showed
the active account was `kaiyuew@stanford.edu`.

4. `iris --config lib/iris/examples/marin.yaml job list` failed during
controller discovery with a GCP 403 on `compute.instances.list`. This meant
the CLI could not find the controller VM and could not open its config-based
tunnel.

5. `lib/iris/OPS.md` confirmed the two normal connection modes:
`--config=PATH` for auto-tunnels, or `--controller-url=URL` for an existing
manual tunnel.

6. The required production submission command was attempted anyway. It failed
before job creation with the same GCP permission error.

7. The dev config, `lib/iris/examples/marin-dev.yaml`, was tried as a fallback.
It also failed before job creation with the same permission error, but with
the dev controller label.

8. The environment had no `IRIS_*` or controller URL variables, no listener on
localhost ports 10000 or 10001, and `curl -sf http://localhost:10000/health`
returned nothing. There was no existing tunnel to reuse.

## User course corrections

- The user instructed that future GitHub issues and PRs must not use the
connector and should be submitted as `whenwen`. The MoE guide was updated to
require local GitHub auth as `whenwen`.
- The user then instructed that future MoE work must always submit the run to
Iris and continue through the full MoE procedure. The MoE guide was updated
to make Iris submission mandatory unless a hard blocker prevents it.

## Root cause

The active local GCP account lacked permission to list Compute Engine instances
in `hai-gcp-models`. Iris config-based controller discovery depends on listing
the controller VM by label. Without `compute.instances.list`, the CLI cannot
discover or tunnel to either the production or dev Marin controller.

This was an authentication and project permission blocker, not a code or
scheduler failure. The submission failed before any Iris job was created.

## Fix

No infrastructure fix was applied. The code workflow was updated in
`experiments/grug/moe/agent.md` to require Iris submission and full procedure
completion for MoE experiments.

Operationally, one of these is needed before retrying:

```bash
gcloud auth login
gcloud auth application-default login
gcloud auth list --filter=status:ACTIVE --format='value(account)'
```

The active account must have enough access on `hai-gcp-models` to discover the
Iris controller, or the caller must provide an explicit controller URL:

```bash
.venv/bin/iris --controller-url=http://localhost:10000 job run ...
```

## How OPS.md could have shortened this

- In `lib/iris/OPS.md` under "GCP Operations / Connecting", add a preflight
command for controller discovery permissions:
`gcloud compute instances list --project=hai-gcp-models --filter="labels.iris-marin-controller=true" --format="value(name)"`.
This would distinguish missing controller VMs from missing GCP permissions
before running `iris job run`.
- In `lib/iris/OPS.md` under "Troubleshooting", add a row for controller
discovery failures that says a `compute.instances.list` 403 is an auth
blocker and should be fixed by switching GCP account or using an explicit
`--controller-url`.

## Artifacts

- PR: https://github.com/marin-community/marin/pull/5179
- Experiment issue: https://github.com/marin-community/marin/issues/5178
- Research logbook: `.agents/logbooks/moe-depth-mup-lr-sweep.md`
20 changes: 20 additions & 0 deletions experiments/grug/moe/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,11 @@ z-loss only). The architecture choices are hardcoded in
experts (not softmax).
- **Shared expert**: one always-on dense MLP per block in parallel with the
routed experts (contributes to every token).
- **Depth MuP residual scaling**: optional, controlled by
`depth_mup_residual_scaling` in `GrugModelConfig`. When enabled, each
attention and MLP residual update is scaled by `1 / sqrt(num_layers)`.
The baseline recipe leaves this off; `depth_mup_lr_sweep.py` enables it for
the depth MuP ablation.
- **GatedNorm**: rank-128 low-rank gating on RMS-normalized input pre-attention
and pre-MLP. Acts as a learned per-token gate over the hidden dimension.
- **XSA (Exclusive Self-Attention)**: after attention, subtract the component
Expand Down Expand Up @@ -61,6 +66,19 @@ entry point — `launch.py` uses it to produce the baseline step. Callers that
want full manual control pass `GrugModelConfig` and `GrugMoeAdamHConfig`
directly to `GrugMoeLaunchConfig`.

## Depth MuP LR sweep

`depth_mup_lr_sweep.py` launches the depth MuP ablation. It uses the same
compute-optimal d512, d768, d1024, and d1280 budgets as the baseline table
below, enables `depth_mup_residual_scaling=True`, and sweeps a log-spaced set of
LR multipliers around the v16 Adam/AdamH formula:

`0.25x, 0.354x, 0.5x, 0.707x, 1x, 1.414x, 2x, 2.828x, 4x`

Both the Adam LR and AdamH LR are multiplied by the same factor. The intended
readout is whether the fitted LR optimum has lower scale sensitivity than the
v16 fit from #4225.

## v16 isoflop sweep

From the v16 sweep (`group=isoflop-moe-v16` on wandb, project `dial_moe`).
Expand Down Expand Up @@ -164,5 +182,7 @@ Predicted macro uses `loss(C) = 1.6 + 95.18 · C^(-0.0941)`.
`build_from_heuristic` entry point.
- [`launch.py`](./launch.py) — `GrugMoeLaunchConfig`, baseline `ExecutorStep`,
and `executor_main` wiring.
- [`depth_mup_lr_sweep.py`](./depth_mup_lr_sweep.py) — depth MuP residual
scaling LR sweep across the compute-optimal MoE scales.
- [`adamh.py`](./adamh.py) — shared AdamH utilities.
- [`agent.md`](./agent.md) — agent guide for running ablation experiments on Iris.
27 changes: 23 additions & 4 deletions experiments/grug/moe/agent.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,14 @@ This workflow is designed to run end-to-end without human confirmation. The
agent is authorized to:

- Create branches, commit, and push without asking
- Create GitHub experiment issues and post comments
- Create GitHub experiment issues and post comments as `whenwen`
- Submit Iris jobs and kill only jobs submitted by self
- Run experiments through both gates autonomously

Do not stop to ask for confirmation at any step. If something fails, diagnose
and retry or report the failure — do not block waiting for input.
Do not stop to ask for confirmation at any step. Do not stop at a code change,
commit, or PR: submit the run to Iris and continue through the full gate
procedure below. If something fails, diagnose and retry or report the failure —
do not block waiting for input.

## Objective

Expand Down Expand Up @@ -118,6 +120,13 @@ Most promotable changes will land in one of three files:

Create a new branch for each experiment issue. Branch off `main`.

For GitHub write actions in this workflow, **do not use the GitHub connector**.
Create experiment issues, PRs, labels, issue comments, and review-thread updates
through local GitHub auth as `whenwen` (for example `gh`, or GitHub
REST/GraphQL with the local token if `gh` is unavailable). Before the first
write action, verify the authenticated user is `whenwen`; if it is not, stop and
report the mismatch.

Follow `.agents/skills/agent-research/SKILL.md` for all documentation, logbooks,
W&B tracking, and GitHub experiment issue management tied to work in this
directory. Pay attention to this file carefully.
Expand Down Expand Up @@ -159,13 +168,23 @@ Assume the user has already completed these before job submission:
## Job Submission

Jobs in this directory are submitted to **Iris** on a **v5p-8**.
Always submit the relevant MoE experiment run to Iris as part of this workflow.
Do not leave a MoE experiment at "ready to run" unless a hard blocker prevents
submission, such as missing authentication, unavailable required environment
variables, or an Iris outage.

The Iris entrypoint runs `executor_main`, so keep that parent job CPU-only. The
`ExecutorStep` configs in the launch modules request `ResourceConfig.with_tpu("v5p-8")`
for the training children.

### Submission command

```bash
.venv/bin/iris --config lib/iris/examples/marin.yaml job run \
--no-wait \
--reserve v5p-8 \
--cpu 1 \
--memory 2G \
--extra cpu \
-e WANDB_API_KEY "$WANDB_API_KEY" \
-- python -m experiments.grug.moe.launch
```
Expand Down
157 changes: 157 additions & 0 deletions experiments/grug/moe/depth_mup_lr_sweep.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
# Copyright The Marin Authors
# SPDX-License-Identifier: Apache-2.0

"""Depth MuP LR sweep for the current Grug MoE recipe."""

import dataclasses
from dataclasses import dataclass

from fray.cluster import ResourceConfig
from levanter.tracker.wandb import WandbConfig
from marin.execution.executor import ExecutorStep, executor_main, this_output_path, versioned

from experiments.grug.moe.heuristic import MoeAdamHHeuristic, build_from_heuristic
from experiments.grug.moe.launch import (
NEMOTRON_MIX_WITH_DEFAULT_VALIDATION,
GrugMoeLaunchConfig,
run_grug_moe_trial,
)
from experiments.grug.moe.optimizer import GrugMoeAdamHConfig
from experiments.grug.moe.train import GrugEvalConfig, GrugTrainerConfig

DEPTH_MUP_TARGET_STEPS: int = 2**14
DEPTH_MUP_WANDB_GROUP: str = "moe-depth-mup-lr-sweep"


@dataclass(frozen=True)
class DepthMupSweepScale:
label: str
budget: float
hidden_dim: int


DEPTH_MUP_SWEEP_SCALES: tuple[DepthMupSweepScale, ...] = (
DepthMupSweepScale(label="d512", budget=2.19e17, hidden_dim=512),
DepthMupSweepScale(label="d768", budget=1.70e18, hidden_dim=768),
DepthMupSweepScale(label="d1024", budget=9.00e18, hidden_dim=1024),
DepthMupSweepScale(label="d1280", budget=2.83e19, hidden_dim=1280),
)

DEPTH_MUP_LR_MULTIPLIERS: tuple[float, ...] = (
0.25,
0.3535533905932738,
0.5,
0.7071067811865476,
1.0,
1.4142135623730951,
2.0,
2.8284271247461903,
4.0,
)


def _format_lr_multiplier(multiplier: float) -> str:
if multiplier <= 0:
raise ValueError(f"lr_multiplier must be positive, got {multiplier}")
if multiplier.is_integer():
return f"{int(multiplier)}x"
return f"{multiplier:.3g}".replace(".", "p") + "x"


def _scale_optimizer_lrs(optimizer: GrugMoeAdamHConfig, lr_multiplier: float) -> GrugMoeAdamHConfig:
if lr_multiplier <= 0:
raise ValueError(f"lr_multiplier must be positive, got {lr_multiplier}")
expert_lr = optimizer.expert_lr * lr_multiplier if optimizer.expert_lr is not None else None
return dataclasses.replace(
optimizer,
learning_rate=optimizer.learning_rate * lr_multiplier,
adam_lr=optimizer.adam_lr * lr_multiplier,
expert_lr=expert_lr,
)


def build_depth_mup_lr_sweep_config(
scale: DepthMupSweepScale,
lr_multiplier: float,
*,
output_path: str,
seed: int = 0,
) -> GrugMoeLaunchConfig:
heuristic = MoeAdamHHeuristic(depth_mup_residual_scaling=True)
model, optimizer, batch_size, steps = build_from_heuristic(
budget=scale.budget,
hidden_dim=scale.hidden_dim,
heuristic=heuristic,
target_steps=DEPTH_MUP_TARGET_STEPS,
)
optimizer = _scale_optimizer_lrs(optimizer, lr_multiplier)
run_id = f"moe-depth-mup-lr-{scale.label}-lr{_format_lr_multiplier(lr_multiplier)}"

return GrugMoeLaunchConfig(
model=model,
data=NEMOTRON_MIX_WITH_DEFAULT_VALIDATION,
output_path=output_path,
run_id=run_id,
resources=ResourceConfig.with_tpu("v5p-8"),
steps=steps,
batch_size=batch_size,
seed=seed,
mp="params=float32,compute=bfloat16,output=bfloat16",
tracker=WandbConfig(
project="marin_moe",
tags=["moe", "depth-mup", "lr-sweep"],
group=DEPTH_MUP_WANDB_GROUP,
name=None,
),
optimizer=optimizer,
grug_trainer=GrugTrainerConfig(
z_loss_weight=1e-4,
ema_beta=None,
log_every=1,
),
eval=GrugEvalConfig(
eval_batch_size=512,
steps_per_eval=1000,
max_eval_batches=8,
eval_current=True,
eval_ema=False,
),
)


def _versioned_launch_config(config: GrugMoeLaunchConfig) -> GrugMoeLaunchConfig:
return dataclasses.replace(
config,
model=versioned(config.model),
resources=versioned(config.resources),
steps=versioned(config.steps),
batch_size=versioned(config.batch_size),
seed=versioned(config.seed),
mp=versioned(config.mp),
optimizer=versioned(config.optimizer),
grug_trainer=versioned(config.grug_trainer),
eval=versioned(config.eval) if config.eval is not None else None,
)


def build_depth_mup_lr_sweep_step(scale: DepthMupSweepScale, lr_multiplier: float) -> ExecutorStep:
config = build_depth_mup_lr_sweep_config(scale, lr_multiplier, output_path=this_output_path())
return ExecutorStep(
name=f"grug/moe_depth_mup_lr/{config.run_id}",
fn=run_grug_moe_trial,
config=_versioned_launch_config(config),
)


depth_mup_lr_sweep_steps: tuple[ExecutorStep, ...] = tuple(
build_depth_mup_lr_sweep_step(scale, lr_multiplier)
for scale in DEPTH_MUP_SWEEP_SCALES
for lr_multiplier in DEPTH_MUP_LR_MULTIPLIERS
)


if __name__ == "__main__":
executor_main(
steps=list(depth_mup_lr_sweep_steps),
description="Depth MuP residual scaling LR sweep for Grug MoE.",
)
Loading
Loading