Skip to content

Commit fa9c454

Browse files
yonromaiclaude
andcommitted
Iris/CW: namespace-qualify cluster-scoped RBAC to support isolated lifecycles
ClusterRole and ClusterRoleBinding names were hardcoded to "iris-controller", causing collisions when multiple Iris instances shared a CKS cluster. Key on namespace instead (e.g. "iris-controller-iris", "iris-controller-iris-canary") so teardown of one cluster doesn't break another. Adds a dedicated coreweave-canary.yaml config with namespace/label_prefix "iris-canary" and points the canary workflow at it, so its nightly teardown no longer interferes with persistent workloads in the default "iris" namespace. Closes #3698 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 6e702d8 commit fa9c454

File tree

6 files changed

+169
-19
lines changed

6 files changed

+169
-19
lines changed

.github/workflows/marin-canary-ferry-cw.yaml

Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ jobs:
2323
runs-on: ubuntu-latest
2424
timeout-minutes: 180
2525
concurrency:
26-
group: canary-ferry-cw
26+
group: canary-ferry-cw-iris-canary
2727
cancel-in-progress: true
2828
env:
2929
RUN_ID: canary-gpu-${{ github.run_id }}-${{ github.run_attempt }}
@@ -35,7 +35,11 @@ jobs:
3535
CANARY_MAX_WALL_CLOCK: "7200"
3636
WANDB_ENTITY: marin-community
3737
WANDB_PROJECT: marin
38-
IRIS_CONFIG: lib/iris/examples/coreweave.yaml
38+
IRIS_CONFIG: lib/iris/examples/coreweave-canary.yaml
39+
# Must match the label_prefix and namespace in IRIS_CONFIG so teardown
40+
# targets only this cluster's resources.
41+
IRIS_LABEL_PREFIX: iris-canary
42+
IRIS_NAMESPACE: iris-canary
3943

4044
steps:
4145
- name: Checkout code
@@ -156,11 +160,11 @@ jobs:
156160
- name: Capture failure diagnostics
157161
if: failure()
158162
run: |
159-
kubectl --kubeconfig ~/.kube/coreweave-iris -n iris \
163+
kubectl --kubeconfig ~/.kube/coreweave-iris -n ${{ env.IRIS_NAMESPACE }} \
160164
logs -l app=iris-controller --tail=500 || true
161-
kubectl --kubeconfig ~/.kube/coreweave-iris -n iris \
165+
kubectl --kubeconfig ~/.kube/coreweave-iris -n ${{ env.IRIS_NAMESPACE }} \
162166
describe pod -l app=iris-controller || true
163-
kubectl --kubeconfig ~/.kube/coreweave-iris -n iris \
167+
kubectl --kubeconfig ~/.kube/coreweave-iris -n ${{ env.IRIS_NAMESPACE }} \
164168
get events --sort-by='.lastTimestamp' --field-selector type!=Normal || true
165169
166170
# `cluster stop` only deletes Pods; NodePools survive and rely on the
@@ -172,7 +176,7 @@ jobs:
172176
.venv/bin/iris -v --config=${{ env.IRIS_CONFIG }} cluster stop || true
173177
if [ "${{ inputs.keep_nodepool }}" != "true" ]; then
174178
kubectl --kubeconfig ~/.kube/coreweave-iris \
175-
delete nodepool -l iris-iris-managed=true
179+
delete nodepool -l iris-${{ env.IRIS_LABEL_PREFIX }}-managed=true
176180
else
177181
echo "Keeping node pool alive (keep_nodepool=true)"
178182
fi

lib/iris/docs/coreweave.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -173,8 +173,8 @@ in `CoreweavePlatform`):
173173
|----------|---------|
174174
| `iris` Namespace | Isolation for all Iris resources |
175175
| `iris-controller` ServiceAccount | In-cluster K8s API auth for controller and worker Pods |
176-
| `iris-controller` ClusterRole | API permissions (see below) |
177-
| `iris-controller` ClusterRoleBinding | Binds ServiceAccount to ClusterRole |
176+
| `iris-controller-{namespace}` ClusterRole | API permissions (see below). Namespace-qualified to support multiple Iris instances on the same CKS cluster. |
177+
| `iris-controller-{namespace}` ClusterRoleBinding | Binds ServiceAccount to ClusterRole. Namespace-qualified to avoid collisions. |
178178

179179
**ClusterRole permissions**:
180180

@@ -364,7 +364,7 @@ The platform detects fatal errors before the full timeout expires:
364364
`CoreweavePlatform.start_controller()` orchestrates the full startup sequence.
365365
See `lib/iris/src/iris/cluster/platform/coreweave.py`.
366366

367-
1. Apply RBAC prerequisites (Namespace, ServiceAccount, ClusterRole, ClusterRoleBinding)
367+
1. Apply RBAC prerequisites (Namespace, ServiceAccount, ClusterRole `iris-controller-{ns}`, ClusterRoleBinding `iris-controller-{ns}`)
368368
2. Create S3 credentials Secret (if S3 storage configured)
369369
3. Apply ConfigMap with cluster config
370370
4. Create/reconcile all shared NodePools in parallel via `ensure_nodepools()`
Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
# Iris configuration for the CoreWeave GPU canary ferry (CI).
2+
#
3+
# Minimal config: only the scale groups the canary actually uses (cpu-erapids
4+
# for the controller, h100-8x for training). Uses a dedicated namespace so
5+
# CI teardown doesn't interfere with persistent clusters in "iris".
6+
7+
platform:
8+
label_prefix: iris-canary
9+
coreweave:
10+
region: US-WEST-04A
11+
namespace: iris-canary
12+
kubeconfig_path: ~/.kube/coreweave-iris
13+
object_storage_endpoint: https://74981a43be0de7712369306c7b19133d.r2.cloudflarestorage.com
14+
15+
storage:
16+
remote_state_dir: s3://marin-na/iris/state/canary
17+
18+
controller:
19+
image: ghcr.io/marin-community/iris-controller:latest
20+
coreweave:
21+
port: 10000
22+
service_name: iris-controller-svc
23+
scale_group: cpu-erapids
24+
25+
defaults:
26+
autoscaler:
27+
evaluation_interval:
28+
milliseconds: 10000
29+
scale_up_delay:
30+
milliseconds: 60000
31+
scale_down_delay:
32+
milliseconds: 300000
33+
startup_grace_period:
34+
milliseconds: 2400000 # 40 min — covers autoscaler node provisioning + Pod startup
35+
worker:
36+
docker_image: ghcr.io/marin-community/iris-worker:latest
37+
port: 10001
38+
cache_dir: /mnt/local/iris-cache
39+
runtime: kubernetes
40+
default_task_image: ghcr.io/marin-community/iris-task:latest
41+
42+
scale_groups:
43+
cpu-erapids:
44+
num_vms: 1
45+
resources:
46+
cpu: 64
47+
ram: 256GB
48+
disk: 1TB
49+
device_type: cpu
50+
worker:
51+
attributes:
52+
region: US-WEST-04A
53+
pool: cpu-erapids
54+
min_slices: 0
55+
max_slices: 1
56+
priority: 50
57+
slice_template:
58+
num_vms: 1
59+
coreweave:
60+
region: US-WEST-04A
61+
instance_type: cd-gp-i64-erapids
62+
63+
h100-8x:
64+
num_vms: 1
65+
resources:
66+
cpu: 128
67+
ram: 2048GB
68+
disk: 1TB
69+
device_type: gpu
70+
device_variant: H100
71+
device_count: 8
72+
worker:
73+
attributes:
74+
region: US-WEST-04A
75+
pool: h100-8x
76+
min_slices: 0
77+
max_slices: 1
78+
priority: 100
79+
slice_template:
80+
num_vms: 1
81+
coreweave:
82+
region: US-WEST-04A
83+
instance_type: gd-8xh100ib-i128

lib/iris/examples/coreweave.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,7 @@ scale_groups:
7979
pool: cpu-erapids
8080
min_slices: 0
8181
max_slices: 1 # Cost-safe default; increase for production workloads
82-
priority: 100
82+
priority: 50
8383
slice_template:
8484
num_vms: 1
8585
coreweave:
@@ -102,7 +102,7 @@ scale_groups:
102102
pool: h100-8x
103103
min_slices: 0
104104
max_slices: 1 # Cost-safe default; increase for production workloads
105-
priority: 50
105+
priority: 100
106106
slice_template:
107107
num_vms: 1
108108
coreweave:
@@ -125,7 +125,7 @@ scale_groups:
125125
pool: h100-16x
126126
min_slices: 0
127127
max_slices: 1
128-
priority: 50
128+
priority: 100
129129
slice_template:
130130
num_vms: 2
131131
coreweave:

lib/iris/src/iris/cluster/platform/coreweave.py

Lines changed: 21 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -361,13 +361,23 @@ def __init__(
361361

362362
# -- RBAC / Namespace Prerequisites ----------------------------------------
363363

364+
def _rbac_cluster_role_name(self) -> str:
365+
"""Namespace-qualified ClusterRole name to avoid collisions across Iris instances."""
366+
return f"iris-controller-{self._namespace}"
367+
364368
def ensure_rbac(self) -> None:
365369
"""Create the namespace, ServiceAccount, ClusterRole, and ClusterRoleBinding.
366370
367371
Idempotent (kubectl apply). These were previously manual operator
368372
prerequisites; now they're auto-applied at cluster start so a single
369373
``iris cluster start`` is sufficient.
374+
375+
ClusterRole and ClusterRoleBinding names are qualified with the namespace
376+
(e.g. ``iris-controller-iris``) so multiple Iris instances on the same
377+
CKS cluster don't collide on these cluster-scoped resources.
370378
"""
379+
cluster_role_name = self._rbac_cluster_role_name()
380+
371381
namespace_manifest = {"apiVersion": "v1", "kind": "Namespace", "metadata": {"name": self._namespace}}
372382

373383
sa_manifest = {
@@ -379,7 +389,7 @@ def ensure_rbac(self) -> None:
379389
role_manifest = {
380390
"apiVersion": "rbac.authorization.k8s.io/v1",
381391
"kind": "ClusterRole",
382-
"metadata": {"name": "iris-controller"},
392+
"metadata": {"name": cluster_role_name},
383393
"rules": [
384394
{
385395
"apiGroups": ["compute.coreweave.com"],
@@ -412,21 +422,21 @@ def ensure_rbac(self) -> None:
412422
binding_manifest = {
413423
"apiVersion": "rbac.authorization.k8s.io/v1",
414424
"kind": "ClusterRoleBinding",
415-
"metadata": {"name": "iris-controller"},
425+
"metadata": {"name": cluster_role_name},
416426
"subjects": [
417427
{"kind": "ServiceAccount", "name": "iris-controller", "namespace": self._namespace},
418428
],
419429
"roleRef": {
420430
"kind": "ClusterRole",
421-
"name": "iris-controller",
431+
"name": cluster_role_name,
422432
"apiGroup": "rbac.authorization.k8s.io",
423433
},
424434
}
425435

426436
for manifest in [namespace_manifest, sa_manifest, role_manifest, binding_manifest]:
427437
self._kubectl.apply_json(manifest)
428438

429-
logger.info("RBAC prerequisites applied (namespace=%s)", self._namespace)
439+
logger.info("RBAC prerequisites applied (namespace=%s, clusterRole=%s)", self._namespace, cluster_role_name)
430440

431441
# -- Storage Detection ----------------------------------------------------
432442

@@ -1151,7 +1161,7 @@ def restart_controller(self, config: config_pb2.IrisClusterConfig) -> str:
11511161
return self.start_controller(config)
11521162

11531163
def stop_controller(self, config: config_pb2.IrisClusterConfig) -> None:
1154-
"""Stop the controller by deleting its K8s resources."""
1164+
"""Stop the controller and clean up its RBAC resources."""
11551165
cw = config.controller.coreweave
11561166
service_name = cw.service_name or "iris-controller-svc"
11571167

@@ -1160,7 +1170,12 @@ def stop_controller(self, config: config_pb2.IrisClusterConfig) -> None:
11601170
self._kubectl.delete("configmap", "iris-cluster-config")
11611171
if self._uses_s3_storage(config):
11621172
self._kubectl.delete("secret", _S3_SECRET_NAME)
1163-
logger.info("Controller resources deleted")
1173+
1174+
# Clean up cluster-scoped RBAC resources created by ensure_rbac().
1175+
cluster_role_name = self._rbac_cluster_role_name()
1176+
self._kubectl.delete("clusterrolebinding", cluster_role_name, cluster_scoped=True)
1177+
self._kubectl.delete("clusterrole", cluster_role_name, cluster_scoped=True)
1178+
logger.info("Controller resources deleted (including RBAC %s)", cluster_role_name)
11641179

11651180
def stop_all(
11661181
self,

lib/iris/tests/cluster/platform/test_coreweave_platform.py

Lines changed: 49 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,8 @@ def __init__(self):
9292
self._services: dict[str, dict] = {}
9393
self._configmaps: dict[str, dict] = {}
9494
self._secrets: dict[str, dict] = {}
95+
self._cluster_roles: dict[str, dict] = {}
96+
self._cluster_role_bindings: dict[str, dict] = {}
9597
self._failures: dict[str, str] = {}
9698
self._pod_logs: dict[str, str] = {}
9799
self._events: list[dict] = []
@@ -191,6 +193,10 @@ def __call__(self, cmd: list[str], **kwargs) -> subprocess.CompletedProcess:
191193
return self._handle_delete_generic(clean_args, "configmap", self._configmaps)
192194
if "delete" in clean_args and "secret" in clean_args:
193195
return self._handle_delete_generic(clean_args, "secret", self._secrets)
196+
if "delete" in clean_args and "clusterrolebinding" in clean_args:
197+
return self._handle_delete_generic(clean_args, "clusterrolebinding", self._cluster_role_bindings)
198+
if "delete" in clean_args and "clusterrole" in clean_args:
199+
return self._handle_delete_generic(clean_args, "clusterrole", self._cluster_roles)
194200
if "set" in clean_args and "image" in clean_args:
195201
return self._handle_set_image(clean_args)
196202
if "rollout" in clean_args and "restart" in clean_args:
@@ -264,6 +270,19 @@ def _handle_apply(self, input_data: str, namespace: str) -> subprocess.Completed
264270
"data": data.get("data", {}),
265271
}
266272
return _completed()
273+
elif kind == "ClusterRole":
274+
self._cluster_roles[name] = {
275+
"metadata": data.get("metadata", {}),
276+
"rules": data.get("rules", []),
277+
}
278+
return _completed()
279+
elif kind == "ClusterRoleBinding":
280+
self._cluster_role_bindings[name] = {
281+
"metadata": data.get("metadata", {}),
282+
"subjects": data.get("subjects", []),
283+
"roleRef": data.get("roleRef", {}),
284+
}
285+
return _completed()
267286

268287
return _completed()
269288

@@ -1150,7 +1169,7 @@ def test_start_controller_reconciles_when_already_available(fake_kubectl: FakeKu
11501169

11511170

11521171
def test_stop_controller_deletes_resources_except_nodepool(fake_kubectl: FakeKubectl):
1153-
"""stop_controller deletes Deployment, Service, ConfigMap, and S3 secret but not NodePool."""
1172+
"""stop_controller deletes Deployment, Service, ConfigMap, S3 secret, and RBAC but not NodePool."""
11541173
platform = _make_platform()
11551174
cluster_config = _make_cluster_config(remote_state_dir="s3://test-bucket/bundles")
11561175

@@ -1190,6 +1209,35 @@ def test_stop_controller_idempotent(fake_kubectl: FakeKubectl):
11901209
platform.shutdown()
11911210

11921211

1212+
def test_rbac_isolation_across_namespaces(fake_kubectl: FakeKubectl):
1213+
"""Two Iris instances with different namespaces get isolated RBAC; teardown of one doesn't affect the other."""
1214+
platform_a = _make_platform(namespace="alpha")
1215+
platform_b = _make_platform(namespace="beta")
1216+
1217+
platform_a.ensure_rbac()
1218+
platform_b.ensure_rbac()
1219+
1220+
# Each gets a namespace-qualified ClusterRole and ClusterRoleBinding
1221+
assert "iris-controller-alpha" in fake_kubectl._cluster_roles
1222+
assert "iris-controller-beta" in fake_kubectl._cluster_roles
1223+
1224+
# Binding references the correct ClusterRole and namespace
1225+
binding_a = fake_kubectl._cluster_role_bindings["iris-controller-alpha"]
1226+
assert binding_a["roleRef"]["name"] == "iris-controller-alpha"
1227+
assert binding_a["subjects"][0]["namespace"] == "alpha"
1228+
1229+
# Stopping alpha cleans up its RBAC without affecting beta
1230+
platform_a.stop_controller(_make_cluster_config())
1231+
1232+
assert "iris-controller-alpha" not in fake_kubectl._cluster_roles
1233+
assert "iris-controller-alpha" not in fake_kubectl._cluster_role_bindings
1234+
assert "iris-controller-beta" in fake_kubectl._cluster_roles
1235+
assert "iris-controller-beta" in fake_kubectl._cluster_role_bindings
1236+
1237+
platform_a.shutdown()
1238+
platform_b.shutdown()
1239+
1240+
11931241
def test_tunnel_parses_address():
11941242
"""tunnel() extracts service name and port from the address string."""
11951243
config = config_pb2.CoreweavePlatformConfig(region="LGA1", namespace="iris")

0 commit comments

Comments
 (0)