Skip to content

Commit 3e0937b

Browse files
authored
Add ability to park disrupted karpenter nodes (#386)
add parked-reason label to make it more explicit why a node was parked
1 parent 09fd996 commit 3e0937b

File tree

20 files changed

+813
-65
lines changed

20 files changed

+813
-65
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ kubeconfig*
77
dist
88
/k8s-shredder
99
my-k8s-shredder-values.yaml
10+
/park-node
1011

1112
# Test binary, build with `go test -c`
1213
*.test

Makefile

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -124,16 +124,17 @@ local-test: build ## Test docker image in a kind cluster (with Karpenter drift a
124124
echo >&2 "[WARN] I require kind but it's not installed(see https://kind.sigs.k8s.io). Assuming a cluster is already accessible."; \
125125
}
126126

127-
local-test-karpenter: build ## Test docker image in a kind cluster with Karpenter and drift detection enabled
127+
local-test-karpenter: build ## Test docker image in a kind cluster with Karpenter drift and disruption detection enabled
128128
@hash kind 2>/dev/null && { \
129-
echo "Test docker image in a kind cluster with Karpenter..."; \
129+
echo "Test docker image in a kind cluster with Karpenter drift and disruption detection..."; \
130130
./internal/testing/local_env_prep_karpenter_helm.sh "${K8S_SHREDDER_VERSION}" "${KINDNODE_VERSION}" "${TEST_CLUSTERNAME_KARPENTER}" "${KUBECONFIG_KARPENTER}" && \
131131
./internal/testing/cluster_upgrade_karpenter.sh "${TEST_CLUSTERNAME_KARPENTER}" "${KUBECONFIG_KARPENTER}" || \
132132
exit 1; \
133133
} || { \
134134
echo >&2 "[WARN] I require kind but it's not installed(see https://kind.sigs.k8s.io). Assuming a cluster is already accessible."; \
135135
}
136136

137+
137138
local-test-node-labels: build ## Test docker image in a kind cluster with node label detection enabled
138139
@hash kind 2>/dev/null && { \
139140
echo "Test docker image in a kind cluster with node label detection..."; \

README.md

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,13 +53,15 @@ The following options can be used to customize the k8s-shredder controller:
5353
| ToBeDeletedTaint | "ToBeDeletedByClusterAutoscaler" | Node taint used for skipping a subset of parked nodes that are already handled by cluster-autoscaler |
5454
| ArgoRolloutsAPIVersion | "v1alpha1" | API version from `argoproj.io` API group to be used while handling Argo Rollouts objects |
5555
| EnableKarpenterDriftDetection | false | Controls whether to scan for drifted Karpenter NodeClaims and automatically label their nodes |
56+
| EnableKarpenterDisruptionDetection | false | Controls whether to scan for disrupted Karpenter NodeClaims and automatically label their nodes |
5657
| ParkedByLabel | "shredder.ethos.adobe.net/parked-by" | Label used to identify which component parked the node |
5758
| ParkedNodeTaint | "shredder.ethos.adobe.net/upgrade-status=parked:NoSchedule" | Taint to apply to parked nodes in format key=value:effect |
5859
| EnableNodeLabelDetection | false | Controls whether to scan for nodes with specific labels and automatically park them |
5960
| NodeLabelsToDetect | [] | List of node labels to detect. Supports both key-only and key=value formats |
6061
| MaxParkedNodes | 0 | Maximum number of nodes that can be parked simultaneously. Set to 0 (default) for no limit. |
6162
| ExtraParkingLabels | {} | (Optional) Map of extra labels to apply to nodes and pods during parking. Example: `{ "example.com/owner": "infrastructure" }` |
6263
| EvictionSafetyCheck | true | Controls whether to perform safety checks before force eviction. If true, nodes will be unparked if pods don't have required parking labels. |
64+
| ParkingReasonLabel | "shredder.ethos.adobe.net/parked-reason" | Label used to track why a node or pod was parked (values: node-label, karpenter-drifted, karpenter-disrupted) |
6365

6466
### How it works
6567

@@ -89,6 +91,24 @@ k8s-shredder includes an optional feature for automatic detection of drifted Kar
8991

9092
This integration allows k8s-shredder to automatically handle node lifecycle management for clusters using Karpenter, ensuring that drifted nodes are properly marked for eviction and eventual replacement.
9193

94+
#### Karpenter Disruption Detection
95+
96+
k8s-shredder includes an optional feature for automatic detection of disrupted Karpenter NodeClaims. This feature is disabled by default, but can be enabled by setting `EnableKarpenterDisruptionDetection` to `true`. When enabled, at the beginning of each eviction loop, the controller will:
97+
98+
1. Scan the Kubernetes cluster for Karpenter NodeClaims that are marked as disrupted (e.g., "Disrupting", "Terminating", "Empty", "Underutilized")
99+
2. Identify the nodes associated with these disrupted NodeClaims
100+
3. Automatically process these nodes by:
101+
102+
- **Labeling** nodes and their non-DaemonSet pods with:
103+
- `UpgradeStatusLabel` (set to "parked")
104+
- `ExpiresOnLabel` (set to current time + `ParkedNodeTTL`)
105+
- `ParkedByLabel` (set to "k8s-shredder")
106+
- Any labels specified in `ExtraParkingLabels`
107+
- **Cordoning** the nodes to prevent new pod scheduling
108+
- **Tainting** the nodes with the configured `ParkedNodeTaint`
109+
110+
This integration ensures that nodes undergoing disruption as part of bin-packing operations have all pods evicted in a reasonable amount of time, preventing them from getting stuck due to blocking Pod Disruption Budgets (PDBs). It complements the drift detection feature by handling nodes that are actively being disrupted by Karpenter's consolidation and optimization processes.
111+
92112
#### Labeled Node Detection
93113

94114
k8s-shredder includes optional automatic detection of nodes with specific labels. This feature is disabled by default but can be enabled by setting `EnableNodeLabelDetection` to `true`. When enabled, at the beginning of each eviction loop, the application will:
@@ -193,6 +213,32 @@ EvictionSafetyCheck: false # Disable safety checks (force eviction always procee
193213
**Logging:**
194214
When safety checks fail, k8s-shredder logs detailed information about which pods are missing required labels, helping operators understand why the node was unparked instead of force evicted.
195215

216+
#### Parking Reason Tracking
217+
218+
k8s-shredder automatically tracks why nodes and pods were parked by applying a configurable parking reason label. This feature helps operators understand the source of parking actions and enables better monitoring and debugging.
219+
220+
**Configuration:**
221+
```yaml
222+
ParkingReasonLabel: "shredder.ethos.adobe.net/parked-reason" # Default label name
223+
```
224+
225+
**Parking Reason Values:**
226+
- `node-label`: Node was parked due to node label detection
227+
- `karpenter-drifted`: Node was parked due to Karpenter drift detection
228+
- `karpenter-disrupted`: Node was parked due to Karpenter disruption detection
229+
230+
**Behavior:**
231+
- The parking reason label is applied to both nodes and their non-DaemonSet pods during parking
232+
- The label is automatically removed during the unparking process (e.g., when safety checks fail)
233+
- The label value corresponds to the detection method that triggered the parking action
234+
- This label works alongside other parking labels and doesn't interfere with existing functionality
235+
236+
**Use cases:**
237+
- **Monitoring**: Track which detection method is most active in your cluster
238+
- **Debugging**: Understand why specific nodes were parked
239+
- **Automation**: Trigger different workflows based on parking reason
240+
- **Compliance**: Audit parking actions and their sources
241+
196242
## Metrics
197243

198244
k8s-shredder exposes comprehensive metrics for monitoring its operation. You can find detailed information about all available metrics in the [metrics documentation](docs/metrics.md).

charts/k8s-shredder/Chart.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,5 +12,5 @@ maintainers:
1212
- name: sfotony
1313
1414
url: https://adobe.com
15-
version: 0.2.5
16-
appVersion: v0.3.5
15+
version: 0.2.6
16+
appVersion: v0.3.6

charts/k8s-shredder/README.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# k8s-shredder
22

3-
![Version: 0.2.5](https://img.shields.io/badge/Version-0.2.5-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square) ![AppVersion: v0.3.5](https://img.shields.io/badge/AppVersion-v0.3.5-informational?style=flat-square)
3+
![Version: 0.2.6](https://img.shields.io/badge/Version-0.2.6-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square) ![AppVersion: v0.3.6](https://img.shields.io/badge/AppVersion-v0.3.6-informational?style=flat-square)
44

55
a novel way of dealing with kubernetes nodes blocked from draining
66

@@ -64,9 +64,10 @@ a novel way of dealing with kubernetes nodes blocked from draining
6464
| serviceAccount.annotations | object | `{}` | Additional annotations for the service account (useful for IAM roles, etc.) |
6565
| serviceAccount.create | bool | `true` | Create a service account for k8s-shredder |
6666
| serviceAccount.name | string | `"k8s-shredder"` | Name of the service account |
67-
| shredder | object | `{"AllowEvictionLabel":"shredder.ethos.adobe.net/allow-eviction","ArgoRolloutsAPIVersion":"v1alpha1","EnableKarpenterDriftDetection":false,"EnableNodeLabelDetection":false,"EvictionLoopInterval":"1h","EvictionSafetyCheck":true,"ExpiresOnLabel":"shredder.ethos.adobe.net/parked-node-expires-on","ExtraParkingLabels":{},"MaxParkedNodes":0,"NamespacePrefixSkipInitialEviction":"ns-ethos-","NodeLabelsToDetect":[],"ParkedByLabel":"shredder.ethos.adobe.net/parked-by","ParkedByValue":"k8s-shredder","ParkedNodeTTL":"168h","ParkedNodeTaint":"shredder.ethos.adobe.net/upgrade-status=parked:NoSchedule","RestartedAtAnnotation":"shredder.ethos.adobe.net/restartedAt","RollingRestartThreshold":0.1,"ToBeDeletedTaint":"ToBeDeletedByClusterAutoscaler","UpgradeStatusLabel":"shredder.ethos.adobe.net/upgrade-status"}` | Core k8s-shredder configuration |
67+
| shredder | object | `{"AllowEvictionLabel":"shredder.ethos.adobe.net/allow-eviction","ArgoRolloutsAPIVersion":"v1alpha1","EnableKarpenterDisruptionDetection":false,"EnableKarpenterDriftDetection":false,"EnableNodeLabelDetection":false,"EvictionLoopInterval":"1h","EvictionSafetyCheck":true,"ExpiresOnLabel":"shredder.ethos.adobe.net/parked-node-expires-on","ExtraParkingLabels":{},"MaxParkedNodes":0,"NamespacePrefixSkipInitialEviction":"ns-ethos-","NodeLabelsToDetect":[],"ParkedByLabel":"shredder.ethos.adobe.net/parked-by","ParkedByValue":"k8s-shredder","ParkedNodeTTL":"168h","ParkedNodeTaint":"shredder.ethos.adobe.net/upgrade-status=parked:NoSchedule","ParkingReasonLabel":"shredder.ethos.adobe.net/parked-reason","RestartedAtAnnotation":"shredder.ethos.adobe.net/restartedAt","RollingRestartThreshold":0.1,"ToBeDeletedTaint":"ToBeDeletedByClusterAutoscaler","UpgradeStatusLabel":"shredder.ethos.adobe.net/upgrade-status"}` | Core k8s-shredder configuration |
6868
| shredder.AllowEvictionLabel | string | `"shredder.ethos.adobe.net/allow-eviction"` | Label to explicitly allow eviction on specific resources |
6969
| shredder.ArgoRolloutsAPIVersion | string | `"v1alpha1"` | API version for Argo Rollouts integration |
70+
| shredder.EnableKarpenterDisruptionDetection | bool | `false` | Enable Karpenter disruption detection for node lifecycle management |
7071
| shredder.EnableKarpenterDriftDetection | bool | `false` | Enable Karpenter drift detection for node lifecycle management |
7172
| shredder.EnableNodeLabelDetection | bool | `false` | Enable detection of nodes based on specific labels |
7273
| shredder.EvictionLoopInterval | string | `"1h"` | How often to run the main eviction loop |
@@ -80,6 +81,7 @@ a novel way of dealing with kubernetes nodes blocked from draining
8081
| shredder.ParkedByValue | string | `"k8s-shredder"` | Value set in the ParkedByLabel to identify k8s-shredder as the parking agent |
8182
| shredder.ParkedNodeTTL | string | `"168h"` | How long parked nodes should remain before being eligible for deletion (7 days default) |
8283
| shredder.ParkedNodeTaint | string | `"shredder.ethos.adobe.net/upgrade-status=parked:NoSchedule"` | Taint applied to parked nodes to prevent new pod scheduling |
84+
| shredder.ParkingReasonLabel | string | `"shredder.ethos.adobe.net/parked-reason"` | Label used to track why a node or pod was parked |
8385
| shredder.RestartedAtAnnotation | string | `"shredder.ethos.adobe.net/restartedAt"` | Annotation to track when a workload was last restarted |
8486
| shredder.RollingRestartThreshold | float | `0.1` | Maximum percentage of nodes that can be restarted simultaneously during rolling restarts |
8587
| shredder.ToBeDeletedTaint | string | `"ToBeDeletedByClusterAutoscaler"` | Taint indicating nodes scheduled for deletion by cluster autoscaler |

charts/k8s-shredder/templates/configmap.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,11 +18,13 @@ data:
1818
ToBeDeletedTaint: "{{.Values.shredder.ToBeDeletedTaint}}"
1919
ArgoRolloutsAPIVersion: "{{.Values.shredder.ArgoRolloutsAPIVersion}}"
2020
EnableKarpenterDriftDetection: {{.Values.shredder.EnableKarpenterDriftDetection}}
21+
EnableKarpenterDisruptionDetection: {{.Values.shredder.EnableKarpenterDisruptionDetection}}
2122
ParkedByLabel: "{{.Values.shredder.ParkedByLabel}}"
2223
ParkedByValue: "{{.Values.shredder.ParkedByValue}}"
2324
ParkedNodeTaint: "{{.Values.shredder.ParkedNodeTaint}}"
2425
EnableNodeLabelDetection: {{.Values.shredder.EnableNodeLabelDetection}}
2526
NodeLabelsToDetect: {{.Values.shredder.NodeLabelsToDetect | toJson}}
2627
MaxParkedNodes: {{.Values.shredder.MaxParkedNodes}}
2728
EvictionSafetyCheck: {{.Values.shredder.EvictionSafetyCheck}}
29+
ParkingReasonLabel: "{{.Values.shredder.ParkingReasonLabel}}"
2830
ExtraParkingLabels: {{.Values.shredder.ExtraParkingLabels | toJson}}

charts/k8s-shredder/values.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,8 @@ shredder:
5050
ArgoRolloutsAPIVersion: v1alpha1
5151
# -- Enable Karpenter drift detection for node lifecycle management
5252
EnableKarpenterDriftDetection: false
53+
# -- Enable Karpenter disruption detection for node lifecycle management
54+
EnableKarpenterDisruptionDetection: false
5355
# -- Label to track which component parked a node
5456
ParkedByLabel: shredder.ethos.adobe.net/parked-by
5557
# -- Value set in the ParkedByLabel to identify k8s-shredder as the parking agent
@@ -64,6 +66,8 @@ shredder:
6466
MaxParkedNodes: 0
6567
# -- Controls whether to perform safety checks before force eviction
6668
EvictionSafetyCheck: true
69+
# -- Label used to track why a node or pod was parked
70+
ParkingReasonLabel: shredder.ethos.adobe.net/parked-reason
6771
# -- Additional labels to apply to nodes and pods during parking
6872
ExtraParkingLabels: {}
6973
# Example configuration:

cmd/root.go

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -117,6 +117,7 @@ func discoverConfig() {
117117
viper.SetDefault("ToBeDeletedTaint", "ToBeDeletedByClusterAutoscaler")
118118
viper.SetDefault("ArgoRolloutsAPIVersion", "v1alpha1")
119119
viper.SetDefault("EnableKarpenterDriftDetection", false)
120+
viper.SetDefault("EnableKarpenterDisruptionDetection", false)
120121
viper.SetDefault("ParkedByLabel", "shredder.ethos.adobe.net/parked-by")
121122
viper.SetDefault("ParkedByValue", "k8s-shredder")
122123
viper.SetDefault("ParkedNodeTaint", "shredder.ethos.adobe.net/upgrade-status=parked:NoSchedule")
@@ -125,6 +126,7 @@ func discoverConfig() {
125126
viper.SetDefault("MaxParkedNodes", 0)
126127
viper.SetDefault("ExtraParkingLabels", map[string]string{})
127128
viper.SetDefault("EvictionSafetyCheck", true)
129+
viper.SetDefault("ParkingReasonLabel", "shredder.ethos.adobe.net/parked-reason")
128130

129131
err := viper.ReadInConfig()
130132
if err != nil {
@@ -164,6 +166,7 @@ func parseConfig() {
164166
"ToBeDeletedTaint": cfg.ToBeDeletedTaint,
165167
"ArgoRolloutsAPIVersion": cfg.ArgoRolloutsAPIVersion,
166168
"EnableKarpenterDriftDetection": cfg.EnableKarpenterDriftDetection,
169+
"EnableKarpenterDisruptionDetection": cfg.EnableKarpenterDisruptionDetection,
167170
"ParkedByLabel": cfg.ParkedByLabel,
168171
"ParkedByValue": cfg.ParkedByValue,
169172
"ParkedNodeTaint": cfg.ParkedNodeTaint,
@@ -172,6 +175,7 @@ func parseConfig() {
172175
"MaxParkedNodes": cfg.MaxParkedNodes,
173176
"ExtraParkingLabels": cfg.ExtraParkingLabels,
174177
"EvictionSafetyCheck": cfg.EvictionSafetyCheck,
178+
"ParkingReasonLabel": cfg.ParkingReasonLabel,
175179
}).Info("Loaded configuration")
176180
}
177181

config.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ ParkedNodeTaint: shredder.ethos.adobe.net/upgrade-status=parked:NoSchedule # Ta
2222
ArgoRolloutsAPIVersion: v1alpha1 # API version from argoproj.io API group to be used while handling Argo Rollouts objects
2323
# Karpenter integration
2424
EnableKarpenterDriftDetection: false # Controls whether to scan for drifted Karpenter NodeClaims and automatically label their nodes
25+
EnableKarpenterDisruptionDetection: false # Controls whether to scan for disrupted Karpenter NodeClaims and automatically label their nodes
2526
# Node label detection
2627
EnableNodeLabelDetection: false # Controls whether to scan for nodes with specific labels and automatically park them
2728
NodeLabelsToDetect: [] # List of node labels to detect. Supports both key-only and key=value formats
@@ -40,3 +41,6 @@ MaxParkedNodes: 0 # Maximum number of nodes that can be parked simultaneously.
4041

4142
# Safety settings
4243
EvictionSafetyCheck: true # Controls whether to perform safety checks before force eviction. If true, nodes will be unparked if pods don't have required parking labels.
44+
45+
# Parking reason tracking
46+
ParkingReasonLabel: shredder.ethos.adobe.net/parked-reason # Label used to track why a node or pod was parked

docs/metrics.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,11 @@ k8s-shredder exposes metrics in Prometheus format to help operators monitor the
8686
- **Description**: Total number of drifted Karpenter nodes detected
8787
- **Use Case**: Monitor the volume of Karpenter drift detection activity
8888

89+
### `shredder_karpenter_disrupted_nodes_total`
90+
- **Type**: Counter
91+
- **Description**: Total number of disrupted Karpenter nodes detected
92+
- **Use Case**: Monitor the volume of Karpenter disruption detection activity
93+
8994
### `shredder_karpenter_nodes_parked_total`
9095
- **Type**: Counter
9196
- **Description**: Total number of Karpenter nodes successfully parked
@@ -220,4 +225,4 @@ Metrics are exposed on the configured port (default: 8080) at the `/metrics` end
220225
- **Health Endpoint**: Available at `/healthz` for health checks
221226
- **OpenMetrics Format**: Enabled by default for better compatibility
222227

223-
For more information about configuring k8s-shredder, see the [main README](../README.md).
228+
For more information about configuring k8s-shredder, see the [main README](../README.md).

0 commit comments

Comments
 (0)