Skip to content

Commit abc4c9e

Browse files
authored
Merge branch 'main' into grug_moe_heuristic
2 parents 921e389 + 5eb5b60 commit abc4c9e

File tree

133 files changed

+5535
-3210
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

133 files changed

+5535
-3210
lines changed

.agents/skills/agent-research/SKILL.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -58,8 +58,7 @@ When using W&B:
5858
- iteration is quick,
5959
- you are tuning kernels or benchmarks,
6060
- full pipeline apparatus is unnecessary.
61-
- Use `.agents/skills/dev-tpu/SKILL.md` for the standard Iris-backed workflow.
62-
- Use `.agents/skills/dev-tpu-ray/SKILL.md` only when you specifically need the legacy Ray-backed workflow.
61+
- Use `.agents/skills/dev-tpu/SKILL.md` for the Iris-backed workflow.
6362

6463
Rule of thumb:
6564
- Start with dev TPU for fast hillclimbing.

.agents/skills/babysit-job/SKILL.md

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,7 @@ description: Monitor/babysit a job continuously and recover on failure. Use when
77

88
Monitor a job continuously and recover on failure. For **Zephyr pipelines**,
99
delegate to **babysit-zephyr** instead. Otherwise, follow this skill — Iris is
10-
the default execution backend.
11-
12-
**Ray is deprecated.** If the user asks to run or babysit a Ray job, tell them
13-
Ray is no longer supported and they should use Iris instead.
10+
the execution backend.
1411

1512
## Required Info
1613

.agents/skills/canary-triage/SKILL.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,8 +28,9 @@ write a Slack summary. Diagnosis and reporting only — no code changes, no PRs.
2828
The cluster is still live. Collect signal now — it will be torn down after you.
2929

3030
- Iris job state via `.venv/bin/iris --config=$IRIS_CONFIG job list --json`
31-
- **GPU lane:** you have kubectl at `~/.kube/coreweave-iris`, namespace `$IRIS_NAMESPACE`.
31+
- **GPU lane:** you have kubectl at `~/.kube/coreweave-iris`, namespace `$IRIS_NAMESPACE` (defaults to `iris-ci` — the canary shares this namespace with PR CI).
3232
Get pod status, controller logs, task pod logs, warning events, pod describe.
33+
**Filter by `iris.job_id=<CANARY_JOB_ID with '/' replaced by '.'>`** so you only see this canary's pods, not co-tenant CI pods. Example: `kubectl -n iris-ci get pods -l iris.job_id=runner.iris-run-job-abc123`.
3334
- **TPU lane:** use `iris process logs` and `iris job list`.
3435
- Re-run `scripts/canary/validate_canary_metrics.py` if you need the validation output.
3536

.agents/skills/dev-tpu-ray/SKILL.md

Lines changed: 0 additions & 212 deletions
This file was deleted.

.agents/skills/dev-tpu/SKILL.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ Use this skill when you want the standard fast TPU debugging loop without wiring
99

1010
`scripts/iris/dev_tpu.py` reserves a TPU-backed worker through Iris, waits for the worker VM to come up, and lets you SSH into it or run commands directly against it.
1111

12-
This is the preferred dev TPU workflow. Unlike the legacy Ray path, it does not create a persistent `~/.ssh/config` alias. It uses `gcloud` SSH and SCP against the worker that Iris assigned to the holder job.
12+
It uses `gcloud` SSH and SCP against the worker that Iris assigned to the holder job. There is no persistent `~/.ssh/config` alias.
1313

1414
## Critical concurrency rule
1515

.agents/skills/pull-request/SKILL.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -85,6 +85,11 @@ A specification contains:
8585

8686
## Creating the PR
8787

88+
Unless the user says otherwise, and when permissions allow, push directly to a
89+
branch on the main repository and open the PR from that branch. Do not default
90+
to pushing to a fork. Use a fork only when direct push to the main repository
91+
is not available or the user explicitly asks for it.
92+
8893
Use `gh pr create` with these flags:
8994

9095
```bash

.github/workflows/iris-coreweave-ci.yaml

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -15,10 +15,12 @@ permissions:
1515
pull-requests: read # needed for issue_comment to access PR metadata
1616
statuses: write # post commit status from issue_comment trigger
1717

18-
# Single concurrency group — only one CW CI run at a time across all PRs.
19-
# The warm cluster is shared; concurrent runs would conflict.
18+
# Shared concurrency group with marin-canary-ferry-cw.yaml — both rebuild/roll
19+
# the shared iris-ci controller and submit against the shared H100 in
20+
# US-WEST-04A. Only one run cluster-wide at a time. cancel-in-progress=false
21+
# so a mid-flight canary is not killed by a PR firing.
2022
concurrency:
21-
group: iris-coreweave-ci
23+
group: iris-coreweave-ci-shared
2224
cancel-in-progress: false
2325

2426
jobs:
@@ -173,9 +175,9 @@ jobs:
173175
AWS_ENDPOINT_URL: https://74981a43be0de7712369306c7b19133d.r2.cloudflarestorage.com
174176
FSSPEC_S3: '{"endpoint_url": "https://74981a43be0de7712369306c7b19133d.r2.cloudflarestorage.com"}'
175177
run: |
176-
IRIS_CONTROLLER_URL="http://localhost:${LOCAL_PORT}"
177-
timeout 600 uv run tests/integration_test.py \
178-
--controller-url "$IRIS_CONTROLLER_URL"
178+
export IRIS_CONTROLLER_URL="http://localhost:${LOCAL_PORT}"
179+
timeout 600 uv run pytest tests/test_integration_test.py \
180+
-m integration -o "addopts=" --timeout=600 -v -s
179181
180182
- name: Stop port-forward
181183
if: always()
Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
name: Iris - IAP Proxy
2+
on:
3+
push:
4+
branches:
5+
- main
6+
paths:
7+
- 'infra/iris-iap-proxy/**'
8+
- '.github/workflows/iris-iap-proxy.yaml'
9+
pull_request:
10+
paths:
11+
- 'infra/iris-iap-proxy/**'
12+
- '.github/workflows/iris-iap-proxy.yaml'
13+
14+
15+
concurrency:
16+
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}
17+
cancel-in-progress: ${{ github.event_name == 'pull_request' }}
18+
19+
jobs:
20+
build:
21+
runs-on: ubuntu-latest
22+
timeout-minutes: 15
23+
steps:
24+
- name: Checkout code
25+
uses: actions/checkout@v4
26+
27+
- name: Set up Docker Buildx
28+
uses: docker/setup-buildx-action@v3
29+
30+
- name: Build container image
31+
uses: docker/build-push-action@v6
32+
with:
33+
context: infra/iris-iap-proxy
34+
push: false
35+
cache-from: type=gha
36+
cache-to: type=gha,mode=max
37+
38+
deploy:
39+
needs: build
40+
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
41+
runs-on: ubuntu-latest
42+
timeout-minutes: 20
43+
strategy:
44+
fail-fast: false
45+
matrix:
46+
cluster: [marin, marin-dev]
47+
concurrency:
48+
group: iris-iap-proxy-deploy-${{ matrix.cluster }}
49+
cancel-in-progress: false
50+
steps:
51+
- name: Checkout code
52+
uses: actions/checkout@v4
53+
54+
- name: Authenticate to Google Cloud
55+
uses: google-github-actions/auth@v2
56+
with:
57+
credentials_json: ${{ secrets.MARIN_CD_CLOUD_RUN_SA_KEY }}
58+
59+
- name: Set up Google Cloud SDK
60+
uses: google-github-actions/setup-gcloud@v2
61+
with:
62+
project_id: ${{ secrets.GCP_PROJECT_ID }}
63+
install_components: beta
64+
65+
- name: Deploy to Cloud Run
66+
run: ./infra/iris-iap-proxy/deploy.sh ${{ matrix.cluster }}

0 commit comments

Comments
 (0)