You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Implement MaxParkedNodes, arbitrary node labels, and parking SafetyCheck (#384)
* Implement MaxParkedNodes feature to limit number of nodes parked at one time
* Add way to add arbitrary labels to parked nodes and pods
* Add SafetyCheck feature to make sure we don't force evict unlabeled pods
* update helm chart to default to latest app version
@@ -52,12 +52,14 @@ The following options can be used to customize the k8s-shredder controller:
52
52
| AllowEvictionLabel | "shredder.ethos.adobe.net/allow-eviction" | Label used for skipping evicting pods that have explicitly set this label on false |
53
53
| ToBeDeletedTaint | "ToBeDeletedByClusterAutoscaler" | Node taint used for skipping a subset of parked nodes that are already handled by cluster-autoscaler |
54
54
| ArgoRolloutsAPIVersion | "v1alpha1" | API version from `argoproj.io` API group to be used while handling Argo Rollouts objects |
55
-
56
55
| EnableKarpenterDriftDetection | false | Controls whether to scan for drifted Karpenter NodeClaims and automatically label their nodes |
57
56
| ParkedByLabel | "shredder.ethos.adobe.net/parked-by" | Label used to identify which component parked the node |
58
57
| ParkedNodeTaint | "shredder.ethos.adobe.net/upgrade-status=parked:NoSchedule" | Taint to apply to parked nodes in format key=value:effect|
59
58
| EnableNodeLabelDetection | false | Controls whether to scan for nodes with specific labels and automatically park them |
60
59
| NodeLabelsToDetect |[]| List of node labels to detect. Supports both key-only and key=value formats |
60
+
| MaxParkedNodes | 0 | Maximum number of nodes that can be parked simultaneously. Set to 0 (default) for no limit. |
61
+
| ExtraParkingLabels | {} | (Optional) Map of extra labels to apply to nodes and pods during parking. Example: `{ "example.com/owner": "infrastructure" }`|
62
+
| EvictionSafetyCheck | true | Controls whether to perform safety checks before force eviction. If true, nodes will be unparked if pods don't have required parking labels. |
61
63
62
64
### How it works
63
65
@@ -81,6 +83,7 @@ k8s-shredder includes an optional feature for automatic detection of drifted Kar
81
83
-`UpgradeStatusLabel` (set to "parked")
82
84
-`ExpiresOnLabel` (set to current time + `ParkedNodeTTL`)
83
85
-`ParkedByLabel` (set to "k8s-shredder")
86
+
- Any labels specified in `ExtraParkingLabels`
84
87
-**Cordoning** the nodes to prevent new pod scheduling
85
88
-**Tainting** the nodes with the configured `ParkedNodeTaint`
86
89
@@ -98,15 +101,102 @@ k8s-shredder includes optional automatic detection of nodes with specific labels
98
101
-`UpgradeStatusLabel` (set to "parked")
99
102
-`ExpiresOnLabel` (set to current time + `ParkedNodeTTL`)
100
103
-`ParkedByLabel` (set to "k8s-shredder")
104
+
- Any labels specified in `ExtraParkingLabels`
101
105
-**Cordoning** the nodes to prevent new pod scheduling
102
106
-**Tainting** the nodes with the configured `ParkedNodeTaint`
103
107
104
108
This integration allows k8s-shredder to automatically handle node lifecycle management based on custom labeling strategies, enabling teams to mark nodes for parking using their own operational workflows and labels. For example, this can be used in conjunction with [AKS cluster upgrades](https://learn.microsoft.com/en-us/azure/aks/upgrade-cluster#set-new-cordon-behavior).
105
109
110
+
#### Parking Limits with MaxParkedNodes
111
+
112
+
k8s-shredder supports limiting the maximum number of nodes that can be parked simultaneously using the `MaxParkedNodes` configuration option. This feature helps prevent overwhelming the cluster with too many parked nodes at once, which could impact application availability.
113
+
114
+
When `MaxParkedNodes` is set to a positive integer:
115
+
116
+
1.**Before parking nodes**: k8s-shredder counts the number of currently parked nodes
117
+
2.**Calculate available slots**: `availableSlots = MaxParkedNodes - currentlyParked`
118
+
3.**Limit parking**: If the number of eligible nodes exceeds available slots, only the first `availableSlots` nodes are parked
119
+
4.**Skip if full**: If no slots are available (currentlyParked >= MaxParkedNodes), parking is skipped for that eviction interval
120
+
121
+
**Examples:**
122
+
-`MaxParkedNodes: 0` (default): No limit, all eligible nodes are parked
123
+
-`MaxParkedNodes: 5`: Maximum 5 nodes can be parked at any time
124
+
-`MaxParkedNodes: -1`: Invalid value, treated as 0 (no limit) with a warning logged
125
+
126
+
This limit applies to both Karpenter drift detection and node label detection features. When multiple nodes are eligible for parking but the limit would be exceeded, k8s-shredder will park the nodes in the order they are discovered and skip the remaining nodes until the next eviction interval.
127
+
128
+
**Use cases:**
129
+
-**Gradual node replacement**: Control the pace of node cycling during cluster upgrades
130
+
-**Resource management**: Prevent excessive resource pressure from too many parked nodes
131
+
-**Application stability**: Ensure applications have sufficient capacity during node transitions
132
+
-**Cost optimization**: Balance between node replacement speed and cluster stability
133
+
134
+
#### ExtraParkingLabels
135
+
136
+
The `ExtraParkingLabels` option allows you to specify a map of additional Kubernetes labels that will be applied to all nodes and pods during the parking process. This is useful for custom automation, monitoring, or compliance workflows.
137
+
138
+
**Configuration:**
139
+
```yaml
140
+
ExtraParkingLabels:
141
+
example.com/owner: "infrastructure"
142
+
example.com/maintenance: "true"
143
+
example.com/upgrade-batch: "batch-1"
144
+
```
145
+
146
+
**Use cases:**
147
+
- **Team ownership**: Mark parked nodes with team ownership labels for accountability
148
+
- **Maintenance tracking**: Add labels to track maintenance windows or upgrade batches
149
+
- **Compliance**: Apply labels required by compliance or governance policies
150
+
- **Monitoring**: Enable custom alerting or monitoring based on parking labels
151
+
- **Automation**: Trigger external automation workflows based on parking labels
152
+
153
+
**Behavior:**
154
+
- Labels are applied to both nodes and their non-DaemonSet pods during parking
155
+
- Labels are removed during the unparking process (if `EvictionSafetyCheck` triggers unparking)
156
+
- If not set or empty, no extra labels are applied
157
+
- Labels are applied in addition to the standard parking labels (`UpgradeStatusLabel`, `ExpiresOnLabel`, `ParkedByLabel`)
158
+
159
+
#### EvictionSafetyCheck
160
+
161
+
The `EvictionSafetyCheck` feature provides an additional safety mechanism to prevent force eviction of pods that weren't properly prepared for parking. When enabled (default: `true`), k8s-shredder performs a safety check before force evicting pods from expired parked nodes.
162
+
163
+
**How it works:**
164
+
165
+
1. **Before force eviction**: When a node's TTL expires and force eviction is about to begin, k8s-shredder checks all non-DaemonSet and non-static pods on the node
166
+
2. **Required labels check**: Each pod must have:
167
+
- `UpgradeStatusLabel`set to "parked"
168
+
- `ExpiresOnLabel`present with any value
169
+
3. **Safety decision**:
170
+
- If **all** pods have the required labels → proceed with force eviction
171
+
- If **any** pod is missing required labels → unpark the node instead of force evicting
172
+
173
+
**Unparking process:**
174
+
When safety check fails, k8s-shredder automatically unparks the node by:
175
+
- Removing `ExpiresOnLabel` and `ExtraParkingLabels` from nodes and pods
176
+
- Removing the `ParkedNodeTaint`
177
+
- Uncordoning the node (making it schedulable again)
178
+
- Setting `UpgradeStatusLabel` to "unparked" on nodes and pods
179
+
- Setting `ParkedByLabel` to the configured `ParkedByValue`
180
+
181
+
**Use cases:**
182
+
- **Safety during manual parking**: If nodes are manually parked but pods weren't properly labeled
183
+
- **Partial parking failures**: When parking automation fails to label all pods
184
+
- **Emergency recovery**: Provides a safe way to recover from parking mistakes
185
+
- **Compliance**: Ensures only properly prepared workloads are force evicted
When safety checks fail, k8s-shredder logs detailed information about which pods are missing required labels, helping operators understand why the node was unparked instead of force evicted.
195
+
106
196
## Metrics
107
197
108
198
k8s-shredder exposes comprehensive metrics for monitoring its operation. You can find detailed information about all available metrics in the [metrics documentation](docs/metrics.md).
| shredder.EnableNodeLabelDetection | bool |`false`| Enable detection of nodes based on specific labels |
71
72
| shredder.EvictionLoopInterval | string |`"1h"`| How often to run the main eviction loop |
73
+
| shredder.EvictionSafetyCheck | bool |`true`| Controls whether to perform safety checks before force eviction |
72
74
| shredder.ExpiresOnLabel | string |`"shredder.ethos.adobe.net/parked-node-expires-on"`| Label used to track when a parked node expires |
75
+
| shredder.ExtraParkingLabels | object |`{}`| Additional labels to apply to nodes and pods during parking |
76
+
| shredder.MaxParkedNodes | int |`0`| Maximum number of nodes that can be parked simultaneously (0 = no limit) |
73
77
| shredder.NamespacePrefixSkipInitialEviction | string |`"ns-ethos-"`| Namespace prefix to skip during initial eviction (useful for system namespaces) |
74
78
| shredder.NodeLabelsToDetect | list |`[]`| List of node labels to monitor for triggering shredder actions |
75
79
| shredder.ParkedByLabel | string |`"shredder.ethos.adobe.net/parked-by"`| Label to track which component parked a node |
0 commit comments