Skip to content

Iris/CW: namespace-qualify RBAC and isolate canary lifecycle#3703

Merged
yonromai merged 1 commit intomainfrom
iris-cw-namespace-isolation
Mar 15, 2026
Merged

Iris/CW: namespace-qualify RBAC and isolate canary lifecycle#3703
yonromai merged 1 commit intomainfrom
iris-cw-namespace-isolation

Conversation

@yonromai
Copy link
Copy Markdown
Contributor

@yonromai yonromai commented Mar 15, 2026

Summary

Fixes #3698 — multiple Iris instances on the same CKS cluster no longer interfere with each other's lifecycle.

Problem: ClusterRole/ClusterRoleBinding were hardcoded to iris-controller (cluster-scoped), so tearing down one Iris namespace destroyed RBAC for all others. The canary workflow also ran in the default iris namespace, so its nightly teardown would kill any persistent workloads there.

Fix:

  1. Qualify cluster-scoped RBAC names with namespace (iris-controller-{ns})
  2. Move the canary to its own namespace (iris-canary) via a dedicated config, freeing iris for persistent use

What changed

  • coreweave.pyensure_rbac() creates iris-controller-{namespace} ClusterRole/Binding; stop_controller() cleans them up
  • coreweave-canary.yaml (new) — canary-specific config with namespace: iris-canary, label_prefix: iris-canary, separate remote_state_dir
  • marin-canary-ferry-cw.yaml — points at coreweave-canary.yaml; concurrency group, teardown label, and diagnostics namespace all scoped to iris-canary
  • coreweave.md — RBAC table updated
Why key on namespace, not label_prefix?

RBAC binds a ServiceAccount (namespaced) to a ClusterRole. Keying on namespace keeps the naming aligned with K8s's own isolation boundary. Two configs with the same namespace but different label_prefix should share one ClusterRole, not create two that grant the same SA identical permissions.

What already worked (no changes needed)
  • NodePool isolation — already uses label_prefix for names and labels
  • Namespaced resources (SA, Deployment, Service, ConfigMap, Secret) — already isolated by namespace
  • GCP platform — doesn't use K8s RBAC

Test plan

  • Existing test_coreweave_platform.py suite passes (47 tests)
  • New test_rbac_isolation_across_namespaces — two platforms create/teardown non-overlapping RBAC
  • ./infra/pre-commit.py --all-files --fix passes
  • Manual on CW — verified end-to-end (run 23119639900)
Manual test details

Started default cluster (namespace: iris) from laptop, then triggered the canary workflow on this branch with target_tokens=65536.

During canary run: both ClusterRoles coexisted:

iris-controller-iris          2026-03-15T20:52:22Z
iris-controller-iris-canary   2026-03-15T21:29:19Z

After canary teardown:

Check Result
iris-controller-iris ClusterRole Still exists
Default controller pod (iris namespace) 1/1 Running
iris-controller-iris-canary ClusterRole Gone (cleaned up)
Canary NodePools (iris-canary-*) Gone (cleaned up)
Default NodePools (iris-*) Unaffected

The canary ferry itself failed at the training step (unrelated — likely 65k token edge case), but Start CoreWeave cluster, Submit canary ferry, and Tear down CoreWeave cluster all succeeded, which is what matters for the isolation test.

Note: a stale iris-controller ClusterRole from a pre-PR run (Feb 19) was found on the cluster — confirms the problem this PR fixes.

🤖 Generated with Claude Code

@yonromai yonromai added the agent-generated Created by automation/agent label Mar 15, 2026
@yonromai yonromai marked this pull request as ready for review March 15, 2026 22:27
@yonromai yonromai requested a review from rjpower March 15, 2026 22:27
@yonromai
Copy link
Copy Markdown
Contributor Author

@rjpower I have a hunch you'll have opinions about this one 😇

Comment thread lib/iris/examples/coreweave-canary.yaml Outdated
pool: cpu-erapids
min_slices: 0
max_slices: 1
priority: 100
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: priorities are "backwards" in Iris (you and Rav can feel free to fix). groups are sorted in ascending order, so this means that we'll use GPU in preference to CPU.

Copy link
Copy Markdown
Collaborator

@rjpower rjpower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me!? If it's possible to use the label to infer the namespace, I think I'd prefer that, but that would mean we couldn't run multiple controllers in the same namespace (would we ever want to do that?)

@yonromai yonromai force-pushed the iris-cw-namespace-isolation branch 3 times, most recently from 73fc1d2 to 0822477 Compare March 15, 2026 23:04
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 08224774d2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lib/iris/src/iris/cluster/platform/coreweave.py Outdated
@yonromai yonromai force-pushed the iris-cw-namespace-isolation branch from 0822477 to 6af1b20 Compare March 15, 2026 23:13
…fecycles

ClusterRole and ClusterRoleBinding names were hardcoded to "iris-controller",
causing collisions when multiple Iris instances shared a CKS cluster. Key on
namespace instead (e.g. "iris-controller-iris", "iris-controller-iris-canary")
so teardown of one cluster doesn't break another.

Adds a dedicated coreweave-canary.yaml config with namespace/label_prefix
"iris-canary" and points the canary workflow at it, so its nightly teardown
no longer interferes with persistent workloads in the default "iris" namespace.

Closes #3698

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@yonromai yonromai force-pushed the iris-cw-namespace-isolation branch from 6af1b20 to fa9c454 Compare March 15, 2026 23:28
@yonromai yonromai enabled auto-merge (squash) March 15, 2026 23:34
@yonromai yonromai merged commit 73970cc into main Mar 15, 2026
22 checks passed
@yonromai yonromai deleted the iris-cw-namespace-isolation branch March 15, 2026 23:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Iris/CW: support isolated cluster lifecycles (namespace or equivalent)

2 participants