Skip to content

Commit 06f3223

Browse files
author
Kaiyue Wen
committed
Record MoE Iris submission blocker
1 parent 6e76368 commit 06f3223

2 files changed

Lines changed: 151 additions & 0 deletions

File tree

.agents/logbooks/moe-depth-mup-lr-sweep.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,3 +65,24 @@
6565
- Interpretation: local behavior is wired correctly; no TPU jobs launched yet.
6666
- Next action: run broader contract checks, then decide whether to submit the
6767
sweep jobs.
68+
69+
### 2026-04-25 11:33 - Iris submission blocked
70+
71+
- Hypothesis: the depth MuP LR sweep can be submitted through the standard MoE
72+
Iris command once the branch is pushed.
73+
- Command:
74+
- `.venv/bin/iris --config lib/iris/examples/marin.yaml job run --no-wait --reserve v5p-8 -e WANDB_API_KEY "$WANDB_API_KEY" -- python -m experiments.grug.moe.depth_mup_lr_sweep`
75+
- `.venv/bin/iris --config lib/iris/examples/marin-dev.yaml job run --no-wait --reserve v5p-8 -e WANDB_API_KEY "$WANDB_API_KEY" -- python -m experiments.grug.moe.depth_mup_lr_sweep`
76+
- Config:
77+
- active GCP account: `kaiyuew@stanford.edu`
78+
- GCP project: `hai-gcp-models`
79+
- `WANDB_API_KEY`: present
80+
- controller tunnel: none found on `localhost:10000`
81+
- Result: both production and dev configs failed before creating a job:
82+
`GCP API error 403: Required 'compute.instances.list' permission for
83+
'projects/hai-gcp-models'`.
84+
- Interpretation: the run submission is blocked by local GCP permissions, not
85+
by the experiment code or Iris job configuration.
86+
- Next action: retry submission after authenticating with an account that can
87+
list controller VMs in `hai-gcp-models`, or provide an explicit
88+
`--controller-url` for an existing Iris tunnel.
Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
---
2+
date: 2026-04-25
3+
system: iris
4+
severity: diagnostic-only
5+
resolution: investigating
6+
pr: https://github.com/marin-community/marin/pull/5179
7+
issue: https://github.com/marin-community/marin/issues/5178
8+
---
9+
10+
## TL;DR
11+
12+
- A MoE depth MuP LR sweep was ready to submit through Iris, but job submission
13+
failed before job creation.
14+
- The active GCP account was `kaiyuew@stanford.edu` on project
15+
`hai-gcp-models`.
16+
- Iris could not discover the Marin controller because GCP returned
17+
`GCP API error 403: Required 'compute.instances.list' permission for
18+
'projects/hai-gcp-models'`.
19+
- No Iris job was created. No cluster or controller state was changed.
20+
- Retrying requires an account with controller VM list permission or an explicit
21+
`--controller-url` to an existing controller tunnel.
22+
23+
## Original problem report
24+
25+
The user requested that MoE experiment work always submit the run to Iris and
26+
continue until the full MoE procedure is finished. The attempted command was:
27+
28+
```bash
29+
.venv/bin/iris --config lib/iris/examples/marin.yaml job run \
30+
--no-wait \
31+
--reserve v5p-8 \
32+
-e WANDB_API_KEY "$WANDB_API_KEY" \
33+
-- python -m experiments.grug.moe.depth_mup_lr_sweep
34+
```
35+
36+
The command failed with:
37+
38+
```text
39+
GCP API error 403: Required 'compute.instances.list' permission for 'projects/hai-gcp-models'
40+
RuntimeError: No controller VM found (label=iris-marin-controller=true, project=hai-gcp-models)
41+
```
42+
43+
## Investigation path
44+
45+
1. The workflow first verified local GitHub auth as `WhenWen`, created issue
46+
#5178, pushed PR #5179, and prepared the depth MuP sweep module.
47+
48+
2. The Iris preflight confirmed `.venv/bin/iris` was executable and
49+
`WANDB_API_KEY` was present.
50+
51+
3. `gcloud auth list --filter=status:ACTIVE --format='value(account)'` showed
52+
the active account was `kaiyuew@stanford.edu`.
53+
54+
4. `iris --config lib/iris/examples/marin.yaml job list` failed during
55+
controller discovery with a GCP 403 on `compute.instances.list`. This meant
56+
the CLI could not find the controller VM and could not open its config-based
57+
tunnel.
58+
59+
5. `lib/iris/OPS.md` confirmed the two normal connection modes:
60+
`--config=PATH` for auto-tunnels, or `--controller-url=URL` for an existing
61+
manual tunnel.
62+
63+
6. The required production submission command was attempted anyway. It failed
64+
before job creation with the same GCP permission error.
65+
66+
7. The dev config, `lib/iris/examples/marin-dev.yaml`, was tried as a fallback.
67+
It also failed before job creation with the same permission error, but with
68+
the dev controller label.
69+
70+
8. The environment had no `IRIS_*` or controller URL variables, no listener on
71+
localhost ports 10000 or 10001, and `curl -sf http://localhost:10000/health`
72+
returned nothing. There was no existing tunnel to reuse.
73+
74+
## User course corrections
75+
76+
- The user instructed that future GitHub issues and PRs must not use the
77+
connector and should be submitted as `whenwen`. The MoE guide was updated to
78+
require local GitHub auth as `whenwen`.
79+
- The user then instructed that future MoE work must always submit the run to
80+
Iris and continue through the full MoE procedure. The MoE guide was updated
81+
to make Iris submission mandatory unless a hard blocker prevents it.
82+
83+
## Root cause
84+
85+
The active local GCP account lacked permission to list Compute Engine instances
86+
in `hai-gcp-models`. Iris config-based controller discovery depends on listing
87+
the controller VM by label. Without `compute.instances.list`, the CLI cannot
88+
discover or tunnel to either the production or dev Marin controller.
89+
90+
This was an authentication and project permission blocker, not a code or
91+
scheduler failure. The submission failed before any Iris job was created.
92+
93+
## Fix
94+
95+
No infrastructure fix was applied. The code workflow was updated in
96+
`experiments/grug/moe/agent.md` to require Iris submission and full procedure
97+
completion for MoE experiments.
98+
99+
Operationally, one of these is needed before retrying:
100+
101+
```bash
102+
gcloud auth login
103+
gcloud auth application-default login
104+
gcloud auth list --filter=status:ACTIVE --format='value(account)'
105+
```
106+
107+
The active account must have enough access on `hai-gcp-models` to discover the
108+
Iris controller, or the caller must provide an explicit controller URL:
109+
110+
```bash
111+
.venv/bin/iris --controller-url=http://localhost:10000 job run ...
112+
```
113+
114+
## How OPS.md could have shortened this
115+
116+
- In `lib/iris/OPS.md` under "GCP Operations / Connecting", add a preflight
117+
command for controller discovery permissions:
118+
`gcloud compute instances list --project=hai-gcp-models --filter="labels.iris-marin-controller=true" --format="value(name)"`.
119+
This would distinguish missing controller VMs from missing GCP permissions
120+
before running `iris job run`.
121+
- In `lib/iris/OPS.md` under "Troubleshooting", add a row for controller
122+
discovery failures that says a `compute.instances.list` 403 is an auth
123+
blocker and should be fixed by switching GCP account or using an explicit
124+
`--controller-url`.
125+
126+
## Artifacts
127+
128+
- PR: https://github.com/marin-community/marin/pull/5179
129+
- Experiment issue: https://github.com/marin-community/marin/issues/5178
130+
- Research logbook: `.agents/logbooks/moe-depth-mup-lr-sweep.md`

0 commit comments

Comments
 (0)