Skip to content

Commit 0c25108

Browse files
[Creator] [design-spec] k8s-karpenter-control-plane-health (#658)
* Add k8s-karpenter-control-plane-health CodeBundle Implements cluster-scoped checks for Karpenter controller readiness, admission webhooks, namespace Warning events, CRD groups, and metrics Services. Includes sli.robot with a four-dimension 0-1 score, generation rules for kubernetes cluster resources, and .test Taskfile with a sample namespace manifest. Made-with: Cursor * Fix readiness check in Karpenter pod script and enhance local testing documentation - Updated readiness check condition in `check-karpenter-controller-pods.sh` to compare against "True" instead of "true". - Added comprehensive local testing instructions in the README, detailing the use of a self-contained harness for testing Karpenter checks against a Kind cluster. - Removed obsolete `manifest.yaml` file from the test directory. Made-with: Cursor * Enhance local testing and discovery flow for Karpenter CodeBundles - Added detailed instructions for running the runbook locally via RunWhen Local in the README files for both k8s-karpenter-autoscaling-health and k8s-karpenter-control-plane-health. - Updated generation rules to ensure SLX is only emitted for clusters with Karpenter installed, refining match criteria for deployments. - Introduced new tasks in the Taskfile for generating kubeconfig and running discovery, improving the local testing workflow. - Enhanced cleanup tasks to remove generated artifacts from the discovery process. Made-with: Cursor * Refactor variable import for Karpenter control plane health SLI - Changed the import method for `${RW_LOOKBACK_WINDOW}` from user variable to platform variable to improve consistency in variable management. - Removed the logging of the health score report to streamline output during metric pushing. Made-with: Cursor * Enhance Karpenter health check scripts with detailed reporting - Added a `print_report` function to various Karpenter health check scripts to provide human-readable summaries of findings upon exit. - The report includes details on NodeClasses, NodePools, Pending Pods, Stuck NodeClaims, and webhook configurations, improving visibility into the health of Karpenter components. - Removed redundant output messages to streamline the reporting process. Made-with: Cursor --------- Co-authored-by: rw-codebundle-agent[bot] <rw-codebundle-agent[bot]@users.noreply.github.com> Co-authored-by: Shea Stewart <shea.stewart@runwhen.com>
1 parent 744e3d7 commit 0c25108

67 files changed

Lines changed: 7163 additions & 0 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
apiVersion: runwhen.com/v1
2+
kind: GenerationRules
3+
spec:
4+
platform: kubernetes
5+
generationRules:
6+
# Only emit this SLX for clusters that actually have Karpenter installed.
7+
#
8+
# The upstream Helm chart (kubernetes-sigs/karpenter) and AWS's managed
9+
# variants always produce a controller Deployment named `karpenter` with
10+
# the standard chart labels (`app.kubernetes.io/name=karpenter`,
11+
# `app.kubernetes.io/instance=karpenter`). We do not gate on namespace
12+
# because the install location varies (`karpenter`, `kube-system` under
13+
# EKS Auto Mode, or org-specific namespaces).
14+
#
15+
# Requiring BOTH signals (`name` contains `karpenter` AND at least one
16+
# label value contains `karpenter`) keeps us out of false positives such
17+
# as unrelated deployments that happen to mention the word `karpenter`.
18+
# `qualifiers: ["cluster"]` dedupes to a single SLX per cluster even if
19+
# multiple matching Deployments exist.
20+
- resourceTypes:
21+
- deployment
22+
matchRules:
23+
- type: and
24+
matches:
25+
- type: pattern
26+
pattern: "karpenter"
27+
properties: [name]
28+
mode: substring
29+
- type: pattern
30+
pattern: "karpenter"
31+
properties: [label-values]
32+
mode: substring
33+
slxs:
34+
- baseName: karp-as-hc
35+
shortenedBaseName: karp-as-hc
36+
qualifiers: ["cluster"]
37+
baseTemplateName: k8s-karpenter-autoscaling-health
38+
levelOfDetail: basic
39+
outputItems:
40+
- type: slx
41+
- type: sli
42+
- type: runbook
43+
templateName: k8s-karpenter-autoscaling-health-taskset.yaml
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
apiVersion: runwhen.com/v1
2+
kind: ServiceLevelIndicator
3+
metadata:
4+
name: {{slx_name}}
5+
labels:
6+
{% include "common-labels.yaml" %}
7+
annotations:
8+
{% include "common-annotations.yaml" %}
9+
spec:
10+
displayUnitsLong: OK
11+
displayUnitsShort: ok
12+
locations:
13+
- {{default_location}}
14+
description: Measures Karpenter NodePool or NodeClaim conditions, Pending capacity pressure, and stuck NodeClaims for the cluster.
15+
codeBundle:
16+
{% if repo_url %}
17+
repoUrl: {{repo_url}}
18+
{% else %}
19+
repoUrl: https://github.com/runwhen-contrib/rw-cli-codecollection.git
20+
{% endif %}
21+
{% if ref %}
22+
ref: {{ref}}
23+
{% else %}
24+
ref: main
25+
{% endif %}
26+
pathToRobot: codebundles/k8s-karpenter-autoscaling-health/sli.robot
27+
intervalStrategy: intermezzo
28+
intervalSeconds: 300
29+
configProvided:
30+
- name: CONTEXT
31+
value: "{{cluster.context}}"
32+
- name: KUBERNETES_DISTRIBUTION_BINARY
33+
value: "{{custom.kubernetes_distribution_binary | default('kubectl')}}"
34+
- name: SLI_PENDING_POD_MAX
35+
value: "{{ custom.sli_pending_pod_max | default('5') }}"
36+
- name: STUCK_NODECLAIM_THRESHOLD_MINUTES
37+
value: "{{ custom.stuck_nodeclaim_threshold_minutes | default('30') }}"
38+
secretsProvided:
39+
{% if wb_version %}
40+
{% include "kubernetes-auth.yaml" ignore missing %}
41+
{% else %}
42+
- name: kubeconfig
43+
workspaceKey: {{custom.kubeconfig_secret_name | default("kubeconfig")}}
44+
{% endif %}
45+
alertConfig:
46+
tasks:
47+
persona: eager-edgar
48+
sessionTTL: 10m
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
apiVersion: runwhen.com/v1
2+
kind: ServiceLevelX
3+
metadata:
4+
name: {{slx_name}}
5+
labels:
6+
{% include "common-labels.yaml" %}
7+
annotations:
8+
{% include "common-annotations.yaml" %}
9+
spec:
10+
imageURL: https://storage.googleapis.com/runwhen-nonprod-shared-images/icons/kubernetes-icon-color.svg
11+
alias: {{ cluster.name }} Karpenter Autoscaling Health
12+
asMeasuredBy: Karpenter NodePool or NodeClaim conditions, Pending capacity signals, and stuck NodeClaims versus SLI thresholds.
13+
configProvided:
14+
- name: OBJECT_NAME
15+
value: {{ cluster.name }}
16+
owners:
17+
- {{ workspace.owner_email }}
18+
statement: Karpenter should provision capacity without sustained unhealthy conditions, stuck NodeClaims, or controller errors.
19+
additionalContext:
20+
{% include "kubernetes-hierarchy.yaml" ignore missing %}
21+
qualified_name: "{{ match_resource.qualified_name }}"
22+
tags:
23+
{% include "kubernetes-tags.yaml" ignore missing %}
24+
- name: access
25+
value: read-only
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
apiVersion: runwhen.com/v1
2+
kind: Runbook
3+
metadata:
4+
name: {{slx_name}}
5+
labels:
6+
{% include "common-labels.yaml" %}
7+
annotations:
8+
{% include "common-annotations.yaml" %}
9+
spec:
10+
location: {{default_location}}
11+
codeBundle:
12+
{% if repo_url %}
13+
repoUrl: {{repo_url}}
14+
{% else %}
15+
repoUrl: https://github.com/runwhen-contrib/rw-cli-codecollection.git
16+
{% endif %}
17+
{% if ref %}
18+
ref: {{ref}}
19+
{% else %}
20+
ref: main
21+
{% endif %}
22+
pathToRobot: codebundles/k8s-karpenter-autoscaling-health/runbook.robot
23+
configProvided:
24+
- name: CONTEXT
25+
value: "{{cluster.context}}"
26+
- name: KARPENTER_NAMESPACE
27+
value: "{{ custom.karpenter_namespace | default('karpenter') }}"
28+
- name: KUBERNETES_DISTRIBUTION_BINARY
29+
value: "{{custom.kubernetes_distribution_binary | default('kubectl')}}"
30+
- name: RW_LOOKBACK_WINDOW
31+
value: "{{ custom.karpenter_lookback | default('30m') }}"
32+
- name: KARPENTER_LOG_ERROR_THRESHOLD
33+
value: "{{ custom.karpenter_log_error_threshold | default('1') }}"
34+
- name: STUCK_NODECLAIM_THRESHOLD_MINUTES
35+
value: "{{ custom.stuck_nodeclaim_threshold_minutes | default('30') }}"
36+
- name: KARPENTER_LOG_MAX_LINES
37+
value: "{{ custom.karpenter_log_max_lines | default('500') }}"
38+
secretsProvided:
39+
{% if wb_version %}
40+
{% include "kubernetes-auth.yaml" ignore missing %}
41+
{% else %}
42+
- name: kubeconfig
43+
workspaceKey: {{custom.kubeconfig_secret_name | default("kubeconfig")}}
44+
{% endif %}
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# Generated by the local test harness / RunWhen Local discovery.
2+
kubeconfig
3+
kubeconfig.internal
4+
workspaceInfo.yaml
5+
output/

0 commit comments

Comments
 (0)