marin-community · WhenWen · Apr 25, 2026 · Apr 25, 2026 · Apr 25, 2026 · Apr 25, 2026
diff --git a/.agents/logbooks/moe-depth-mup-lr-sweep.md b/.agents/logbooks/moe-depth-mup-lr-sweep.md
diff --git a/.agents/ops/2026-04-25-iris-controller-discovery-permission.md b/.agents/ops/2026-04-25-iris-controller-discovery-permission.md
@@ -0,0 +1,130 @@
+---
+date: 2026-04-25
+system: iris
+severity: diagnostic-only
+resolution: investigating
+pr: https://github.com/marin-community/marin/pull/5179
+issue: https://github.com/marin-community/marin/issues/5178
+---
+
+## TL;DR
+
+- A MoE depth MuP LR sweep was ready to submit through Iris, but job submission
+  failed before job creation.
+- The active GCP account was `kaiyuew@stanford.edu` on project
+  `hai-gcp-models`.
+- Iris could not discover the Marin controller because GCP returned
+  `GCP API error 403: Required 'compute.instances.list' permission for
+  'projects/hai-gcp-models'`.
+- No Iris job was created. No cluster or controller state was changed.
+- Retrying requires an account with controller VM list permission or an explicit
+  `--controller-url` to an existing controller tunnel.
+
+## Original problem report
+
+The user requested that MoE experiment work always submit the run to Iris and
+continue until the full MoE procedure is finished. The attempted command was:
+
+```bash
+.venv/bin/iris --config lib/iris/examples/marin.yaml job run \
+  --no-wait \
+  --reserve v5p-8 \
+  -e WANDB_API_KEY "$WANDB_API_KEY" \
+  -- python -m experiments.grug.moe.depth_mup_lr_sweep
+```
+
+The command failed with:
+
+```text
+GCP API error 403: Required 'compute.instances.list' permission for 'projects/hai-gcp-models'
+RuntimeError: No controller VM found (label=iris-marin-controller=true, project=hai-gcp-models)
+```
+
+## Investigation path
+
+1. The workflow first verified local GitHub auth as `WhenWen`, created issue
+   #5178, pushed PR #5179, and prepared the depth MuP sweep module.
+
+2. The Iris preflight confirmed `.venv/bin/iris` was executable and
+   `WANDB_API_KEY` was present.
+
+3. `gcloud auth list --filter=status:ACTIVE --format='value(account)'` showed
+   the active account was `kaiyuew@stanford.edu`.
+
+4. `iris --config lib/iris/examples/marin.yaml job list` failed during
+   controller discovery with a GCP 403 on `compute.instances.list`. This meant
+   the CLI could not find the controller VM and could not open its config-based
+   tunnel.
+
+5. `lib/iris/OPS.md` confirmed the two normal connection modes:
+   `--config=PATH` for auto-tunnels, or `--controller-url=URL` for an existing
+   manual tunnel.
+
+6. The required production submission command was attempted anyway. It failed
+   before job creation with the same GCP permission error.
+
+7. The dev config, `lib/iris/examples/marin-dev.yaml`, was tried as a fallback.
+   It also failed before job creation with the same permission error, but with
+   the dev controller label.
+
+8. The environment had no `IRIS_*` or controller URL variables, no listener on
+   localhost ports 10000 or 10001, and `curl -sf http://localhost:10000/health`
+   returned nothing. There was no existing tunnel to reuse.
+
+## User course corrections
+
+- The user instructed that future GitHub issues and PRs must not use the
+  connector and should be submitted as `whenwen`. The MoE guide was updated to
+  require local GitHub auth as `whenwen`.
+- The user then instructed that future MoE work must always submit the run to
+  Iris and continue through the full MoE procedure. The MoE guide was updated
+  to make Iris submission mandatory unless a hard blocker prevents it.
+
+## Root cause
+
+The active local GCP account lacked permission to list Compute Engine instances
+in `hai-gcp-models`. Iris config-based controller discovery depends on listing
+the controller VM by label. Without `compute.instances.list`, the CLI cannot
+discover or tunnel to either the production or dev Marin controller.
+
+This was an authentication and project permission blocker, not a code or
+scheduler failure. The submission failed before any Iris job was created.
+
+## Fix
+
+No infrastructure fix was applied. The code workflow was updated in
+`experiments/grug/moe/agent.md` to require Iris submission and full procedure
+completion for MoE experiments.
+
+Operationally, one of these is needed before retrying:
+
+```bash
+gcloud auth login
+gcloud auth application-default login
+gcloud auth list --filter=status:ACTIVE --format='value(account)'
+```
+
+The active account must have enough access on `hai-gcp-models` to discover the
+Iris controller, or the caller must provide an explicit controller URL:
+
+```bash
+.venv/bin/iris --controller-url=http://localhost:10000 job run ...
+```
+
+## How OPS.md could have shortened this
+
+- In `lib/iris/OPS.md` under "GCP Operations / Connecting", add a preflight
+  command for controller discovery permissions:
+  `gcloud compute instances list --project=hai-gcp-models --filter="labels.iris-marin-controller=true" --format="value(name)"`.
+  This would distinguish missing controller VMs from missing GCP permissions
+  before running `iris job run`.
+- In `lib/iris/OPS.md` under "Troubleshooting", add a row for controller
+  discovery failures that says a `compute.instances.list` 403 is an auth
+  blocker and should be fixed by switching GCP account or using an explicit
+  `--controller-url`.
+
+## Artifacts
+
+- PR: https://github.com/marin-community/marin/pull/5179
+- Experiment issue: https://github.com/marin-community/marin/issues/5178
+- Research logbook: `.agents/logbooks/moe-depth-mup-lr-sweep.md`
diff --git a/experiments/grug/moe/README.md b/experiments/grug/moe/README.md
@@ -25,6 +25,11 @@ z-loss only). The architecture choices are hardcoded in
   experts (not softmax).
 - **Shared expert**: one always-on dense MLP per block in parallel with the
   routed experts (contributes to every token).
+- **Depth MuP residual scaling**: optional, controlled by
+  `depth_mup_residual_scaling` in `GrugModelConfig`. When enabled, each
+  attention and MLP residual update is scaled by `1 / sqrt(num_layers)`.
+  The baseline recipe leaves this off; `depth_mup_lr_sweep.py` enables it for
+  the depth MuP ablation.
 - **GatedNorm**: rank-128 low-rank gating on RMS-normalized input pre-attention
   and pre-MLP. Acts as a learned per-token gate over the hidden dimension.
 - **XSA (Exclusive Self-Attention)**: after attention, subtract the component
@@ -61,6 +66,19 @@ entry point — `launch.py` uses it to produce the baseline step. Callers that
 want full manual control pass `GrugModelConfig` and `GrugMoeAdamHConfig`
 directly to `GrugMoeLaunchConfig`.
 
+## Depth MuP LR sweep
+
+`depth_mup_lr_sweep.py` launches the depth MuP ablation. It uses the same
+compute-optimal d512, d768, d1024, and d1280 budgets as the baseline table
+below, enables `depth_mup_residual_scaling=True`, and sweeps a log-spaced set of
+LR multipliers around the v16 Adam/AdamH formula:
+
+`0.25x, 0.354x, 0.5x, 0.707x, 1x, 1.414x, 2x, 2.828x, 4x`
+
+Both the Adam LR and AdamH LR are multiplied by the same factor. The intended
+readout is whether the fitted LR optimum has lower scale sensitivity than the
+v16 fit from #4225.
+
 ## v16 isoflop sweep
 
 From the v16 sweep (`group=isoflop-moe-v16` on wandb, project `dial_moe`).
@@ -164,5 +182,7 @@ Predicted macro uses `loss(C) = 1.6 + 95.18 · C^(-0.0941)`.
   `build_from_heuristic` entry point.
 - [`launch.py`](./launch.py) — `GrugMoeLaunchConfig`, baseline `ExecutorStep`,
   and `executor_main` wiring.
+- [`depth_mup_lr_sweep.py`](./depth_mup_lr_sweep.py) — depth MuP residual
+  scaling LR sweep across the compute-optimal MoE scales.
 - [`adamh.py`](./adamh.py) — shared AdamH utilities.
 - [`agent.md`](./agent.md) — agent guide for running ablation experiments on Iris.
diff --git a/experiments/grug/moe/agent.md b/experiments/grug/moe/agent.md
@@ -6,12 +6,14 @@ This workflow is designed to run end-to-end without human confirmation. The
 agent is authorized to:
 
 - Create branches, commit, and push without asking
-- Create GitHub experiment issues and post comments
+- Create GitHub experiment issues and post comments as `whenwen`
 - Submit Iris jobs and kill only jobs submitted by self
 - Run experiments through both gates autonomously
 
-Do not stop to ask for confirmation at any step. If something fails, diagnose
-and retry or report the failure — do not block waiting for input.
+Do not stop to ask for confirmation at any step. Do not stop at a code change,
+commit, or PR: submit the run to Iris and continue through the full gate
+procedure below. If something fails, diagnose and retry or report the failure —
+do not block waiting for input.
 
 ## Objective
 
@@ -118,6 +120,13 @@ Most promotable changes will land in one of three files:
 
 Create a new branch for each experiment issue. Branch off `main`.
 
+For GitHub write actions in this workflow, **do not use the GitHub connector**.
+Create experiment issues, PRs, labels, issue comments, and review-thread updates
+through local GitHub auth as `whenwen` (for example `gh`, or GitHub
+REST/GraphQL with the local token if `gh` is unavailable). Before the first
+write action, verify the authenticated user is `whenwen`; if it is not, stop and
+report the mismatch.
+
 Follow `.agents/skills/agent-research/SKILL.md` for all documentation, logbooks,
 W&B tracking, and GitHub experiment issue management tied to work in this
 directory. Pay attention to this file carefully.
@@ -159,13 +168,23 @@ Assume the user has already completed these before job submission:
 ## Job Submission
 
 Jobs in this directory are submitted to **Iris** on a **v5p-8**.
+Always submit the relevant MoE experiment run to Iris as part of this workflow.
+Do not leave a MoE experiment at "ready to run" unless a hard blocker prevents
+submission, such as missing authentication, unavailable required environment
+variables, or an Iris outage.
+
+The Iris entrypoint runs `executor_main`, so keep that parent job CPU-only. The
+`ExecutorStep` configs in the launch modules request `ResourceConfig.with_tpu("v5p-8")`
+for the training children.
 
 ### Submission command
 
 ```bash
 .venv/bin/iris --config lib/iris/examples/marin.yaml job run \
   --no-wait \
-  --reserve v5p-8 \
+  --cpu 1 \
+  --memory 2G \
+  --extra cpu \
   -e WANDB_API_KEY "$WANDB_API_KEY" \
   -- python -m experiments.grug.moe.launch
 ```

diff --git a/experiments/grug/moe/depth_mup_lr_sweep.py b/experiments/grug/moe/depth_mup_lr_sweep.py
@@ -0,0 +1,157 @@
+# Copyright The Marin Authors
+# SPDX-License-Identifier: Apache-2.0
+
+"""Depth MuP LR sweep for the current Grug MoE recipe."""
+
+import dataclasses
+from dataclasses import dataclass
+
+from fray.cluster import ResourceConfig
+from levanter.tracker.wandb import WandbConfig
+from marin.execution.executor import ExecutorStep, executor_main, this_output_path, versioned
+
+from experiments.grug.moe.heuristic import MoeAdamHHeuristic, build_from_heuristic
+from experiments.grug.moe.launch import (
+    NEMOTRON_MIX_WITH_DEFAULT_VALIDATION,
+    GrugMoeLaunchConfig,
+    run_grug_moe_trial,
+)
+from experiments.grug.moe.optimizer import GrugMoeAdamHConfig
+from experiments.grug.moe.train import GrugEvalConfig, GrugTrainerConfig
+
+DEPTH_MUP_TARGET_STEPS: int = 2**14
+DEPTH_MUP_WANDB_GROUP: str = "moe-depth-mup-lr-sweep"
+
+
+@dataclass(frozen=True)
+class DepthMupSweepScale:
+    label: str
+    budget: float
+    hidden_dim: int
+
+
+DEPTH_MUP_SWEEP_SCALES: tuple[DepthMupSweepScale, ...] = (
+    DepthMupSweepScale(label="d512", budget=2.19e17, hidden_dim=512),
+    DepthMupSweepScale(label="d768", budget=1.70e18, hidden_dim=768),
+    DepthMupSweepScale(label="d1024", budget=9.00e18, hidden_dim=1024),
+    DepthMupSweepScale(label="d1280", budget=2.83e19, hidden_dim=1280),
+)
+
+DEPTH_MUP_LR_MULTIPLIERS: tuple[float, ...] = (
+    0.25,
+    0.3535533905932738,
+    0.5,
+    0.7071067811865476,
+    1.0,
+    1.4142135623730951,
+    2.0,
+    2.8284271247461903,
+    4.0,
+)
+
+
+def _format_lr_multiplier(multiplier: float) -> str:
+    if multiplier <= 0:
+        raise ValueError(f"lr_multiplier must be positive, got {multiplier}")
+    if multiplier.is_integer():
+        return f"{int(multiplier)}x"
+    return f"{multiplier:.3g}".replace(".", "p") + "x"
+
+
+def _scale_optimizer_lrs(optimizer: GrugMoeAdamHConfig, lr_multiplier: float) -> GrugMoeAdamHConfig:
+    if lr_multiplier <= 0:
+        raise ValueError(f"lr_multiplier must be positive, got {lr_multiplier}")
+    expert_lr = optimizer.expert_lr * lr_multiplier if optimizer.expert_lr is not None else None
+    return dataclasses.replace(
+        optimizer,
+        learning_rate=optimizer.learning_rate * lr_multiplier,
+        adam_lr=optimizer.adam_lr * lr_multiplier,
+        expert_lr=expert_lr,
+    )
+
+
+def build_depth_mup_lr_sweep_config(
+    scale: DepthMupSweepScale,
+    lr_multiplier: float,
+    *,
+    output_path: str,
+    seed: int = 0,
+) -> GrugMoeLaunchConfig:
+    heuristic = MoeAdamHHeuristic(depth_mup_residual_scaling=True)
+    model, optimizer, batch_size, steps = build_from_heuristic(
+        budget=scale.budget,
+        hidden_dim=scale.hidden_dim,
+        heuristic=heuristic,
+        target_steps=DEPTH_MUP_TARGET_STEPS,
+    )
+    optimizer = _scale_optimizer_lrs(optimizer, lr_multiplier)
+    run_id = f"moe-depth-mup-lr-{scale.label}-lr{_format_lr_multiplier(lr_multiplier)}"
+
+    return GrugMoeLaunchConfig(
+        model=model,
+        data=NEMOTRON_MIX_WITH_DEFAULT_VALIDATION,
+        output_path=output_path,
+        run_id=run_id,
+        resources=ResourceConfig.with_tpu("v5p-8"),
+        steps=steps,
+        batch_size=batch_size,
+        seed=seed,
+        mp="params=float32,compute=bfloat16,output=bfloat16",
+        tracker=WandbConfig(
+            project="marin_moe",
+            tags=["moe", "depth-mup", "lr-sweep"],
+            group=DEPTH_MUP_WANDB_GROUP,
+            name=None,
+        ),
+        optimizer=optimizer,
+        grug_trainer=GrugTrainerConfig(
+            z_loss_weight=1e-4,
+            ema_beta=None,
+            log_every=1,
+        ),
+        eval=GrugEvalConfig(
+            eval_batch_size=512,
+            steps_per_eval=1000,
+            max_eval_batches=8,
+            eval_current=True,
+            eval_ema=False,
+        ),
+    )
+
+
+def _versioned_launch_config(config: GrugMoeLaunchConfig) -> GrugMoeLaunchConfig:
+    return dataclasses.replace(
+        config,
+        model=versioned(config.model),
+        resources=versioned(config.resources),
+        steps=versioned(config.steps),
+        batch_size=versioned(config.batch_size),
+        seed=versioned(config.seed),
+        mp=versioned(config.mp),
+        optimizer=versioned(config.optimizer),
+        grug_trainer=versioned(config.grug_trainer),
+        eval=versioned(config.eval) if config.eval is not None else None,
+    )
+
+
+def build_depth_mup_lr_sweep_step(scale: DepthMupSweepScale, lr_multiplier: float) -> ExecutorStep:
+    config = build_depth_mup_lr_sweep_config(scale, lr_multiplier, output_path=this_output_path())
+    return ExecutorStep(
+        name=f"grug/moe_depth_mup_lr/{config.run_id}",
+        fn=run_grug_moe_trial,
+        config=_versioned_launch_config(config),
+    )
+
+
+depth_mup_lr_sweep_steps: tuple[ExecutorStep, ...] = tuple(
+    build_depth_mup_lr_sweep_step(scale, lr_multiplier)
+    for scale in DEPTH_MUP_SWEEP_SCALES
+    for lr_multiplier in DEPTH_MUP_LR_MULTIPLIERS
+)
+
+
+if __name__ == "__main__":
+    executor_main(
+        steps=list(depth_mup_lr_sweep_steps),
+        description="Depth MuP residual scaling LR sweep for Grug MoE.",
+    )