Iris/CW: namespace-qualify RBAC and isolate canary lifecycle#3703
Iris/CW: namespace-qualify RBAC and isolate canary lifecycle#3703
Conversation
|
@rjpower I have a hunch you'll have opinions about this one 😇 |
| pool: cpu-erapids | ||
| min_slices: 0 | ||
| max_slices: 1 | ||
| priority: 100 |
There was a problem hiding this comment.
nit: priorities are "backwards" in Iris (you and Rav can feel free to fix). groups are sorted in ascending order, so this means that we'll use GPU in preference to CPU.
rjpower
left a comment
There was a problem hiding this comment.
Seems reasonable to me!? If it's possible to use the label to infer the namespace, I think I'd prefer that, but that would mean we couldn't run multiple controllers in the same namespace (would we ever want to do that?)
73fc1d2 to
0822477
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 08224774d2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
0822477 to
6af1b20
Compare
…fecycles ClusterRole and ClusterRoleBinding names were hardcoded to "iris-controller", causing collisions when multiple Iris instances shared a CKS cluster. Key on namespace instead (e.g. "iris-controller-iris", "iris-controller-iris-canary") so teardown of one cluster doesn't break another. Adds a dedicated coreweave-canary.yaml config with namespace/label_prefix "iris-canary" and points the canary workflow at it, so its nightly teardown no longer interferes with persistent workloads in the default "iris" namespace. Closes #3698 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6af1b20 to
fa9c454
Compare
Summary
Fixes #3698 — multiple Iris instances on the same CKS cluster no longer interfere with each other's lifecycle.
Problem: ClusterRole/ClusterRoleBinding were hardcoded to
iris-controller(cluster-scoped), so tearing down one Iris namespace destroyed RBAC for all others. The canary workflow also ran in the defaultirisnamespace, so its nightly teardown would kill any persistent workloads there.Fix:
iris-controller-{ns})iris-canary) via a dedicated config, freeingirisfor persistent useWhat changed
coreweave.py—ensure_rbac()createsiris-controller-{namespace}ClusterRole/Binding;stop_controller()cleans them upcoreweave-canary.yaml(new) — canary-specific config withnamespace: iris-canary,label_prefix: iris-canary, separateremote_state_dirmarin-canary-ferry-cw.yaml— points atcoreweave-canary.yaml; concurrency group, teardown label, and diagnostics namespace all scoped toiris-canarycoreweave.md— RBAC table updatedWhy key on namespace, not label_prefix?
RBAC binds a ServiceAccount (namespaced) to a ClusterRole. Keying on namespace keeps the naming aligned with K8s's own isolation boundary. Two configs with the same namespace but different
label_prefixshould share one ClusterRole, not create two that grant the same SA identical permissions.What already worked (no changes needed)
label_prefixfor names and labelsTest plan
test_coreweave_platform.pysuite passes (47 tests)test_rbac_isolation_across_namespaces— two platforms create/teardown non-overlapping RBAC./infra/pre-commit.py --all-files --fixpassesManual test details
Started default cluster (
namespace: iris) from laptop, then triggered the canary workflow on this branch withtarget_tokens=65536.During canary run: both ClusterRoles coexisted:
After canary teardown:
iris-controller-irisClusterRoleirisnamespace)iris-controller-iris-canaryClusterRoleiris-canary-*)iris-*)The canary ferry itself failed at the training step (unrelated — likely 65k token edge case), but
Start CoreWeave cluster,Submit canary ferry, andTear down CoreWeave clusterall succeeded, which is what matters for the isolation test.Note: a stale
iris-controllerClusterRole from a pre-PR run (Feb 19) was found on the cluster — confirms the problem this PR fixes.🤖 Generated with Claude Code