[CASCL-1305] No-op kubectl-datadog install on a foreign Karpenter#2949
[CASCL-1305] No-op kubectl-datadog install on a foreign Karpenter#2949
Conversation
Add a pre-flight guard to `kubectl datadog autoscaling cluster install` that recognises when Karpenter is already running on the cluster — be it a Helm install in another namespace, a raw `kubectl apply`, or anything in between — and short-circuits with a friendly success message and a link to the Datadog Autoscaling settings page, mirroring the existing EKS auto-mode handling. Detection scans cluster-scoped ClusterRoles labelled `app.kubernetes.io/name=karpenter` for the absence of our `autoscaling.datadoghq.com/installed-by=kubectl-datadog` sentinel; CRDs are ignored because Helm leaves them behind on `helm uninstall` and a re-install must still succeed in that case. Switch the chart's `additionalLabels` from overriding standard `app.kubernetes.io/managed-by` and `app.kubernetes.io/version` to Datadog-namespaced keys: the chart's `_helpers.tpl` emits the standard keys before `additionalLabels`, producing duplicate YAML keys whose deduplication at the API server is non-deterministic (verified empirically — `managed-by` resolves to the chart's `Helm` while `version` resolves to ours, on the very same resource). The new keys are exported as constants so writer and reader stay in sync. Migration: existing plugin installs lack the new sentinel and will be flagged as foreign on the next `install`. Users have to run `kubectl datadog autoscaling cluster uninstall` followed by `kubectl datadog autoscaling cluster install` once with the new binary to migrate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2949 +/- ##
==========================================
+ Coverage 40.91% 41.21% +0.30%
==========================================
Files 324 325 +1
Lines 28743 28965 +222
==========================================
+ Hits 11760 11939 +179
- Misses 16129 16170 +41
- Partials 854 856 +2
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 3 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
|
🎯 Code Coverage (details) 🔗 Commit SHA: 27f3cf0 | Docs | Datadog PR Page | Give us feedback! |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fd65ec0c9d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const ( | ||
| karpenterChartNameLabel = "app.kubernetes.io/name" | ||
| karpenterChartNameValue = "karpenter" | ||
| karpenterClusterRoleSelector = karpenterChartNameLabel + "=" + karpenterChartNameValue |
There was a problem hiding this comment.
Detect nameOverride-based Karpenter installs
This selector misses existing Karpenter releases installed with the chart’s supported nameOverride value, because the upstream label app.kubernetes.io/name is rendered from that override rather than always being karpenter. In that scenario the list returns no ClusterRoles, so kubectl datadog autoscaling cluster install falls through and installs a second controller instead of no-oping; this affects Helm/Argo/raw-manifest installs that customize nameOverride.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Summary
- Updated foreign Karpenter detection to scan all
ClusterRoles and identify Karpenter by RBAC rule content (karpenter.shAPI group +nodepools/nodeclaimsresources), instead of relying onapp.kubernetes.io/name=karpenter. This makes detection robust to chartnameOverride. cmd/kubectl-datadog/autoscaling/cluster/install/guess/foreignkarpenter.goL24-L48 cmd/kubectl-datadog/autoscaling/cluster/install/guess/foreignkarpenter.goL50-L79 - Added helper matching functions (
hasKarpenterAPIGroupRule,hasKarpenterResource,contains) to keep the detection logic explicit and testable. cmd/kubectl-datadog/autoscaling/cluster/install/guess/foreignkarpenter.goL50-L79 - Updated/expanded unit tests to model RBAC-rule-based detection and added a dedicated case for
foreign Karpenter installed with custom nameOverrideto verify renamed releases are still detected as foreign. cmd/kubectl-datadog/autoscaling/cluster/install/guess/foreignkarpenter_test.goL16-L114 cmd/kubectl-datadog/autoscaling/cluster/install/guess/foreignkarpenter_test.goL75-L82 - Committed the changes on the current branch (
64986c1) and created a follow-up PR via themake_prtool.
Testing
- ✅
gofmt -w cmd/kubectl-datadog/autoscaling/cluster/install/guess/foreignkarpenter.go cmd/kubectl-datadog/autoscaling/cluster/install/guess/foreignkarpenter_test.go - ✅
go test ./cmd/kubectl-datadog/autoscaling/cluster/install/guess/...
There was a problem hiding this comment.
Addressed in cdbb6c7: switched detection from the app.kubernetes.io/name=karpenter LabelSelector to a structural fingerprint on the karpenter.sh API group. We list every ClusterRole and look for a rule referencing that group, which the chart hard-codes for nodepools/nodeclaims regardless of nameOverride or other metadata customizations. Added a regression test foreign Karpenter installed with custom nameOverride and a defensive case to make sure a ClusterRole with karpenter-looking labels but no karpenter.sh rules does not trigger the guard.
There was a problem hiding this comment.
Same fix landed on this branch as cdbb6c7 (independently arrived at the same karpenter.sh API group fingerprint). No follow-up PR needed.
Datadog PR Gates flagged the patch coverage at 40.74%, well below the threshold. The new code in install.go (the Helm-values block carrying the Datadog ownership sentinel and the foreign-Karpenter no-op message) was unreached by tests. Adds: - TestKarpenterHelmValues: pins the additionalLabels contract with the IsForeignKarpenterInstalled detector — both sides go through the same exported constants — and covers both install modes (the IRSA ServiceAccount annotation only appears on Fargate). - TestDisplayForeignKarpenterMessage: renders the box into a buffered cobra Command and asserts the user-visible strings, including the URL-encoded kube_cluster_name query that points the user at their cluster's Datadog autoscaling settings page. The display test sets PATH to empty so github.com/pkg/browser's LookPath probe of xdg-open / x-www-browser / www-browser fails fast and browser.OpenURL returns ErrNotFound — without that, the spawned xdg-open holds the pipe writer for browser.Stdout and browser.Stderr open and exec.Cmd.Wait blocks indefinitely on the pipe-copy goroutine. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex review on the PR head pointed out that the previous detection selector (`app.kubernetes.io/name=karpenter`) misses Karpenter installs deployed via the upstream Helm chart with a custom `nameOverride`: the chart renders both the label value and the ClusterRole name from that override, so the LabelSelector returns no items and the install command falls through to deploy a second controller side-by-side. Switch to a structural fingerprint instead. The chart's clusterrole.yaml and clusterrole-core.yaml hard-code rules on the `karpenter.sh` API group (for nodepools/nodeclaims) — that group is independent of any metadata customization and is the canonical contract every Karpenter controller needs. We now list every ClusterRole and look for any rule referencing karpenter.sh, then skip the ones bearing our InstalledBy sentinel. Adds a regression test that asserts a foreign install with `nameOverride` is detected, and a defensive case that ensures a ClusterRole carrying karpenter-looking labels but no real karpenter.sh rules does not trigger the guard. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@codex review |
|
Codex Review: Didn't find any major issues. 🚀 ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
The foreign-Karpenter detector previously called `List` without a `Limit`, which means client-go does NOT auto-paginate and asks the API server for the entire ClusterRole collection in one shot. On a dense cluster with thousands of ClusterRoles that materialises the full slice in memory just to answer a yes/no question. Switch to the same Limit + Continue pattern already used by `GetNodesProperties` in the same package, with `clusterRoleListChunkSize = 100`. Early-exit on the first foreign match means we don't paginate to the end either; on a small cluster this is a single round-trip and strictly equivalent to the previous behaviour. Add a focused pagination subtest that wires a reactor returning four synthetic pages and asserts: empty Continue on the first call, token forwarding on subsequent calls, fixed Limit, skipping our own ClusterRoles, foreign found on page 3 returns true, and page 4 never requested. A `require.Less` guards the reactor's slice access against over-fetch with a clear failure message. Also refactor `hasKarpenterAPIGroupRule` to `lo.ContainsBy`, matching the predicate-search idiom already used by `IsEKSAutoModeEnabled` in the same package. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
In case a Karpenter not installed by $ kubectl datadog autoscaling cluster install
2026/04/28 12:15:49 Detected foreign Karpenter ClusterRole
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Karpenter is already installed on cluster lenaic-karpenter-test. │
│ │
│ kubectl-datadog has nothing to install. │
│ │
│ Navigate to the Autoscaling settings page │
│ and select cluster to start generating recommendations: │
│ https://app.datadoghq.com/orchestration/scaling/settings?query=kube_cluster_name%3Alenaic-karpenter-test │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────╯ |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e0cef1fb57
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // looser check stays robust against upstream resource additions. | ||
| func hasKarpenterAPIGroupRule(rules []rbacv1.PolicyRule) bool { | ||
| return lo.ContainsBy(rules, func(rule rbacv1.PolicyRule) bool { | ||
| return slices.Contains(rule.APIGroups, karpenterAPIGroup) |
There was a problem hiding this comment.
Require a real Karpenter fingerprint before no-oping
On clusters that already have the Datadog Operator installed but no Karpenter controller, this API-group-only check returns true because the operator's own ClusterRole grants karpenter.sh permissions (config/rbac/role.yaml:378-389, generated from internal/controller/datadogagent_controller.go:103). Since install.go now exits when this helper reports true, those users are incorrectly told Karpenter is installed and cannot use the plugin to install autoscaling; please exclude the operator role or match a more Karpenter-specific RBAC shape.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in 68b5e73: tightened the predicate to require both the karpenter.sh API group AND at least one explicit controller resource (nodepools or nodeclaims). The Datadog Operator's wildcard rule (apiGroups: [karpenter.sh], resources: ['*']) no longer matches because slices.Contains(["*"], "nodepools") is false. Added a regression test Datadog Operator role with karpenter.sh wildcard is not a controller and a defensive split-rule case to make sure either resource alone is enough to trigger the guard.
| return displayEKSAutoModeMessage(cmd, clusterName) | ||
| } | ||
|
|
||
| if foreign, err := guess.IsForeignKarpenterInstalled(ctx, o.Clientset); err != nil { |
There was a problem hiding this comment.
nits: can we get more info about other installation (the namespace/name for example) so we can display it to the user too
clamoriniere
left a comment
There was a problem hiding this comment.
/lgtm
a small idea to improve the UX, but not mandatory change
Codex caught a P1 false-positive: the previous detector fired on any ClusterRole that referenced the `karpenter.sh` API group, which trips on the Datadog Operator's own role (`config/rbac/role.yaml:378-389`, `apiGroups: [karpenter.sh], resources: ['*']`). Every cluster running the operator would have been told "Karpenter is already installed" and denied the install. Tighten the predicate to require both the API group AND at least one explicit Karpenter controller resource (`nodepools` or `nodeclaims`). The upstream v1 chart's clusterrole-core.yaml always names those resources; CRD-management roles like the operator's use wildcards on `karpenter.sh/*` and no longer match. Add coverage for the regression and pin the broader contract: - new case "Datadog Operator role with karpenter.sh wildcard is not a controller" with `expected: false`, - new case "split rules with only nodepools or only nodeclaims still match" — the v1 chart splits write permissions across multiple rules, each carrying only one resource; either alone must trigger detection, - contract test renamed to `TestKarpenterControllerFingerprintContract` and now also pins `karpenterControllerResources`. Renamed `hasKarpenterAPIGroupRule` → `hasKarpenterControllerRule` and softened doc comments to scope the contract to the supported v1 chart (legacy v0 with `provisioners`/`machines` is intentionally out of scope). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Any monitoring or management workload that consumes Karpenter resources
needs RBAC permissions on the `karpenter.sh` API group — including the
Datadog Agent Helm chart's cluster-agent ClusterRole, which grants
`karpenter.sh/*` to inspect Karpenter resources for autoscaling
recommendations. Looking at the rules of every ClusterRole turned out
to be too coarse: a tighter "specific resources, no wildcards" filter
was easy to defeat (a monitor that lists `nodepools` for metrics rather
than via `*`), and on top of that Cédric pointed out the user-facing
no-op message lacked the namespace/name of the foreign install.
Switch the signal to the controller's Deployment. A Deployment that
runs the Karpenter controller is the only object that distinguishes
"a Karpenter is actually managing nodes" from "something is allowed to
read its CRs". The Deployment also carries the namespace/name we want
to surface in the message — Cédric's request lands for free.
Two container-level signals, OR'd together:
- `KARPENTER_SERVICE` env var name. The upstream chart's
`templates/deployment.yaml` unconditionally sets this on the
controller container; image registry rewrites (Chainguard,
Docker Hardened, ECR pull-through caches) leave it intact, so this
catches distributions that swap the image repository entirely.
- Image repository tail equal to `karpenter/controller`, parsed
component-aware: digest is stripped first (everything after `@`),
then a tag is recognised only when the last `:` lies after the last
`/`, so ported registries like `registry.local:5000/karpenter/controller:1.12.0`
parse correctly. Catches forks that drop the env var but keep the
canonical image path, and rejects neighbours like
`team/karpenter/controllers` (plural) and
`team/karpenter/controller-something`.
Function signature evolves to return location info: `IsForeignKarpenterInstalled
(bool, error)` → `FindForeignKarpenterInstallation (*ForeignKarpenter, error)`
where `ForeignKarpenter{Namespace, Name}`. The display message now reads
"Karpenter is already installed on cluster X: Deployment ns/name."
Test coverage:
- `TestKarpenterControllerFingerprintContract` pins both signal
constants against the chart's deployment.yaml.
- `TestImageRepoEndsWith` table covers default, mirror, digest,
tag+digest, ported registry (with and without tag), bare repo,
`controllers` plural, `controller-something`, hardened-shape,
single-component fallthrough.
- New `TestFindForeignKarpenterInstallation` cases: hardened image
matched via env, `controllers` plural not matching, mix of ours
and foreign, our cluster-agent (different image, no env), pagination
reactor unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2c87598da5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if !hasKarpenterControllerContainer(dep.Spec.Template.Spec.Containers) { | ||
| continue | ||
| } | ||
| if dep.Labels[InstalledByLabel] == InstalledByValue { |
There was a problem hiding this comment.
Block Datadog-owned installs in other namespaces
When a cluster already has a kubectl-datadog Karpenter Deployment in namespace A and the user reruns install --karpenter-namespace B, this unconditional skip makes the foreign-install check return nil even though a controller is already present. The later Helm action is scoped to the requested namespace, so it won't find the release in namespace A and will install a second Karpenter controller in B, reintroducing the CRD race this guard is meant to prevent; only the Deployment in the target Helm namespace should be skipped as ours.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in 27f3cf0: FindForeignKarpenterInstallation now takes a targetNamespace and the skip rule tightens to dep.Namespace == targetNamespace && dep.Labels[InstalledByLabel] == InstalledByValue. A kubectl-datadog Deployment in any other namespace now surfaces as foreign — exactly the scenario you flagged. Added two test cases covering cross-namespace own install and the defensive empty-targetNamespace path.
Codex caught a P1 cross-namespace skip leak: the previous check skipped any Deployment carrying our `InstalledByLabel` sentinel regardless of namespace. So a user who already had kubectl-datadog Karpenter in namespace A and reran `install --karpenter-namespace=B` would have the foreign-install check return nil (own install in A skipped), the subsequent Helm action scoped to B would not find any release in A, and a second Karpenter controller would land in B — racing the first one on the cluster-scoped Karpenter CRDs. Fix: thread the target namespace through to the detector, and skip only Deployments that are BOTH in the target namespace AND carry our sentinel. A kubectl-datadog Deployment in any other namespace surfaces as foreign — its controller would conflict with the new one regardless of who installed it. Empty `targetNamespace` (the CLI never sends that, but the detector should fail closed) surfaces every Karpenter controller it finds. Updated the `ForeignKarpenter` doc to spell out the two failure modes (third-party install or our own in a different namespace) and added two test cases: cross-namespace own install treated as foreign, and defensive empty-`targetNamespace` behaviour. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@codex review |
What does this PR do?
Adds a pre-flight guard to
kubectl datadog autoscaling cluster installthat recognises when Karpenter is already running on the cluster — be it a Helm install in another namespace, a rawkubectl apply, or anything in between — and short-circuits with a friendly success message and a link to the Datadog Autoscaling settings page, mirroring the existing EKS auto-mode handling (no error, exit 0). The message names the foreign install'sDeployment <namespace>/<name>so the user can locate it.Detection scans every Deployment cluster-wide and, for each container, looks for either:
KARPENTER_SERVICEenv var name — the upstream chart'stemplates/deployment.yamlunconditionally sets this on the controller container, regardless ofcontroller.image.repository; orkarpenter/controller, parsed component-aware — strip digest first, then recognise a tag only when the last:lies after the last/, so ported registries (registry.local:5000/...) and digests are handled cleanly while neighbours liketeam/karpenter/controllersorteam/karpenter/controller-somethingdo not false-positive.Deployments carrying our
autoscaling.datadoghq.com/installed-by=kubectl-datadogsentinel are treated as ours and the install proceeds (Helm performs an upgrade). The list is paginated withLimit: 100+Continueand short-circuits on the first foreign match.The PR also switches the chart's
additionalLabelsfrom overriding standardapp.kubernetes.io/managed-byandapp.kubernetes.io/versionto Datadog-namespaced keys, exported as constants so the writer (install.go) and reader (FindForeignKarpenterInstallation) stay in sync.Motivation
PR #2717 (CASCL-1281) added a guard for EKS auto-mode clusters. We need a symmetric guard for any other cluster where Karpenter is already running, otherwise re-running
installon such a cluster would deploy a second controller that would race with the first one on the same CRDs.The first design tried RBAC — list ClusterRoles, look for
karpenter.shrules. That turned out fragile: monitoring and management workloads legitimately hold permissions on thekarpenter.shAPI group without running a controller. The Datadog Operator's own ClusterRole grantskarpenter.sh/*(manager role generated frominternal/controller/datadogagent_controller.go), and the Datadog Agent Helm chart's cluster-agent ClusterRole does the same to inspect Karpenter resources for autoscaling recommendations. Any tightening (require specific resources rather than*) is easy to defeat by a monitor that listsnodepoolsrather than*.A Deployment running the Karpenter controller image — or setting the chart's
KARPENTER_SERVICEenv — is the only signal that distinguishes "Karpenter is actually running" from "something has read access to its CRs". It also carries the namespace/name we want to surface to the user.The Datadog-namespaced label switch is required because the Karpenter chart's
_helpers.tplemitsapp.kubernetes.io/managed-by: {{ .Release.Service }}andapp.kubernetes.io/version: ...beforeadditionalLabels, producing duplicate YAML keys. Verified empirically on a real cluster: deduplication at the API server is non-deterministic across keys (managed-byresolves to the chart'sHelmvalue whileversionresolves to ours, on the very same resource), so we cannot rely on those overrides as identity markers.Additional Notes
Migration of existing kubectl-datadog installs. Plugin installs deployed before this change carry the old
app.kubernetes.io/managed-by=kubectl-datadogvalue (silently dropped by the chart) and lack the new sentinel on their Deployment metadata. The new code will detect them as foreign and no-op. To migrate, run once:Acceptable because the install command is recent and not yet broadly deployed.
Out of scope. Resources we apply directly (NodePool / EC2NodeClass via
cmd/kubectl-datadog/.../install/k8s/) keep their existingapp.kubernetes.io/managed-by=kubectl-datadoglabel — no chart helper to fight there, the label sticks, anduninstall.gocontinues to use it as a selector. Pre-v0.32 Karpenter (provisioners/machines) is also out of scope: the install command's chart reference iskarpenter-1.12.0which targets v1+.Minimum Agent Versions
Describe your test plan
Unit tests:
go test ./cmd/kubectl-datadog/autoscaling/cluster/install/...— coversTestFindForeignKarpenterInstallation: no Deployments, ours-only, foreign without sentinel, foreign on a private mirror image, foreign withnameOverride, mix of ours and foreign, the Datadog Cluster Agent (cluster-agent image, noKARPENTER_SERVICEenv — must NOT match), foreign with hardened image (cgr.dev/chainguard/karpenter) matched via the env signal,controllersplural not false-positiving, foreign sentinel value, API list error, pagination reactor that asserts the next page is never requested after the first foreign match;TestKarpenterControllerFingerprintContractpins the literalKARPENTER_SERVICEandkarpenter/controllerso test fixtures using the constants cannot silently mask a typo;TestImageRepoEndsWithcovers default upstream, ECR mirror, digest, tag+digest, ported registry (with and without tag), barekarpenter/controller,controllersplural,controller-something, hardened-shape (cgr.dev/chainguard/karpenter), and single-component image;TestKarpenterHelmValuespins theadditionalLabelscontract — both writer and reader go through the same exported constants — and covers both install modes (only Fargate annotates the ServiceAccount with the IRSA role ARN);TestDisplayForeignKarpenterMessagerenders the box into a buffered cobraCommandand asserts the user-visible strings, including theDeployment <namespace>/<name>line and the URL-encodedkube_cluster_namequery that points the user at their cluster's Datadog autoscaling settings page.End-to-end on a real EKS test cluster (
AWS_PROFILE=exec-sso-container-integrations-account-admin AWS_REGION=eu-west-3):helm install karpenter oci://public.ecr.aws/karpenter/karpenter --namespace karpenter --create-namespace --set settings.clusterName=$CLUSTER. Thenkubectl datadog autoscaling cluster install→ message namesDeployment karpenter/karpenter, exit 0, nodd-karpenter-…CFN stacks created (aws cloudformation list-stacks).nameOverride—helm install foo oci://... --set nameOverride=foo,settings.clusterName=$CLUSTER. Thenkubectl datadog autoscaling cluster install→ still detected and no-op'd via the env var; namespace+Deployment name surfaced.helm uninstall karpenter -n karpenter, thenkubectl datadog autoscaling cluster install→ succeeds in spite of the four residual CRDs (Helmcrds/is install-once).Checklist
enhancementqa/skip-qalabel