Skip to content

Commit 178dfe5

Browse files
authored
Add node parking metrics and documentation (#341)
* Add node parking metrics and documentation * Update local-test-* to test parking metrics and share fewer config items * Update CI to also test karpenter and node-label parking
1 parent 542da0a commit 178dfe5

19 files changed

+1042
-59
lines changed

.github/workflows/ci.yaml

Lines changed: 61 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
name: CI tests
1+
name: CI Tests
22

33
on: pull_request
44

@@ -32,3 +32,63 @@ jobs:
3232
run: make local-test
3333
- name: Run e2e tests
3434
run: make e2e-tests
35+
36+
ci-karpenter:
37+
runs-on: ubuntu-latest
38+
name: ci-karpenter
39+
steps:
40+
- name: Checkout
41+
uses: actions/checkout@v4
42+
- name: Setup Go
43+
uses: actions/setup-go@v5
44+
with:
45+
go-version: '1.24'
46+
- name: Run Gosec Security Scanner
47+
uses: securego/gosec@master
48+
with:
49+
args: -quiet -exclude=G107 ./...
50+
- name: Run golangci-lint
51+
uses: golangci/golangci-lint-action@v8
52+
with:
53+
# Optional: version of golangci-lint to use in form of v1.2 or v1.2.3 or `latest` to use the latest version
54+
# version: v1.46
55+
args: -v --timeout 5m --no-config ./...
56+
- name: Install k8s Kind Cluster
57+
uses: helm/[email protected]
58+
with:
59+
install_only: true
60+
version: v0.29.0
61+
- name: Prepare test environment
62+
run: make local-test-karpenter
63+
- name: Run e2e tests
64+
run: make e2e-tests
65+
66+
ci-node-labels:
67+
runs-on: ubuntu-latest
68+
name: ci-node-labels
69+
steps:
70+
- name: Checkout
71+
uses: actions/checkout@v4
72+
- name: Setup Go
73+
uses: actions/setup-go@v5
74+
with:
75+
go-version: '1.24'
76+
- name: Run Gosec Security Scanner
77+
uses: securego/gosec@master
78+
with:
79+
args: -quiet -exclude=G107 ./...
80+
- name: Run golangci-lint
81+
uses: golangci/golangci-lint-action@v8
82+
with:
83+
# Optional: version of golangci-lint to use in form of v1.2 or v1.2.3 or `latest` to use the latest version
84+
# version: v1.46
85+
args: -v --timeout 5m --no-config ./...
86+
- name: Install k8s Kind Cluster
87+
uses: helm/[email protected]
88+
with:
89+
install_only: true
90+
version: v0.29.0
91+
- name: Prepare test environment
92+
run: make local-test-node-labels
93+
- name: Run e2e tests
94+
run: make e2e-tests

LICENSE

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -186,7 +186,7 @@
186186
same "printed page" as the copyright notice for easier
187187
identification within third-party archives.
188188

189-
Copyright 2022 Adobe
189+
Copyright 2025 Adobe
190190

191191
Licensed under the Apache License, Version 2.0 (the "License");
192192
you may not use this file except in compliance with the License.

Makefile

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -101,7 +101,16 @@ unit-test: ## Run unit tests
101101

102102
e2e-tests: ## Run e2e tests for k8s-shredder deployed in a local kind cluster
103103
@echo "Run e2e tests for k8s-shredder..."
104-
@KUBECONFIG=${PWD}/${KUBECONFIG_LOCALTEST} go test internal/testing/e2e_test.go -v
104+
@if [ -f "${PWD}/${KUBECONFIG_KARPENTER}" ]; then \
105+
echo "Using Karpenter test cluster configuration..."; \
106+
KUBECONFIG=${PWD}/${KUBECONFIG_KARPENTER} go test internal/testing/e2e_test.go -v; \
107+
elif [ -f "${PWD}/${KUBECONFIG_NODE_LABELS}" ]; then \
108+
echo "Using node labels test cluster configuration..."; \
109+
KUBECONFIG=${PWD}/${KUBECONFIG_NODE_LABELS} go test internal/testing/e2e_test.go -v; \
110+
else \
111+
echo "Using default test cluster configuration..."; \
112+
KUBECONFIG=${PWD}/${KUBECONFIG_LOCALTEST} go test internal/testing/e2e_test.go -v; \
113+
fi
105114

106115
# DEMO targets
107116
# -----------

README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -103,6 +103,10 @@ k8s-shredder includes optional automatic detection of nodes with specific labels
103103

104104
This integration allows k8s-shredder to automatically handle node lifecycle management based on custom labeling strategies, enabling teams to mark nodes for parking using their own operational workflows and labels. For example, this can be used in conjunction with [AKS cluster upgrades](https://learn.microsoft.com/en-us/azure/aks/upgrade-cluster#set-new-cordon-behavior).
105105

106+
## Metrics
107+
108+
k8s-shredder exposes comprehensive metrics for monitoring its operation. You can find detailed information about all available metrics in the [metrics documentation](docs/metrics.md).
109+
106110
#### Creating a new release
107111

108112
See [RELEASE.md](RELEASE.md).

docs/metrics.md

Lines changed: 223 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,223 @@
1+
# k8s-shredder Metrics
2+
3+
This document describes all the metrics exposed by k8s-shredder. These metrics are available at the `/metrics` endpoint and can be scraped by Prometheus or other monitoring systems.
4+
5+
## Overview
6+
7+
k8s-shredder exposes metrics in Prometheus format to help operators monitor the health and performance of the node parking and eviction processes. The metrics are organized into several categories:
8+
9+
- **Core Operation Metrics**: General operation counters and timing
10+
- **API Server Metrics**: Kubernetes API interaction metrics
11+
- **Node Processing Metrics**: Node parking and processing statistics
12+
- **Pod Processing Metrics**: Pod eviction and processing statistics
13+
- **Karpenter Integration Metrics**: Karpenter drift detection metrics
14+
- **Node Label Detection Metrics**: Node label-based detection metrics
15+
- **Shared Metrics**: Aggregated metrics across all detection methods
16+
17+
## Core Operation Metrics
18+
19+
### `shredder_loops_total`
20+
- **Type**: Counter
21+
- **Description**: Total number of eviction loops completed
22+
- **Use Case**: Monitor the frequency of eviction loop execution and overall system activity
23+
24+
### `shredder_loops_duration_seconds`
25+
- **Type**: Summary
26+
- **Description**: Duration of eviction loops in seconds
27+
- **Objectives**: 0.5: 1200, 0.9: 900, 0.99: 600
28+
- **Use Case**: Monitor the performance of eviction loops and identify slow operations
29+
30+
### `shredder_errors_total`
31+
- **Type**: Counter
32+
- **Description**: Total number of errors encountered during operation
33+
- **Use Case**: Monitor system health and identify operational issues
34+
35+
## API Server Metrics
36+
37+
### `shredder_apiserver_requests_total`
38+
- **Type**: Counter Vector
39+
- **Labels**: `verb`, `resource`, `status`
40+
- **Description**: Total requests made to the Kubernetes API
41+
- **Use Case**: Monitor API usage patterns and identify potential rate limiting issues
42+
43+
### `shredder_apiserver_requests_duration_seconds`
44+
- **Type**: Summary Vector
45+
- **Labels**: `verb`, `resource`, `status`
46+
- **Description**: Duration of Kubernetes API requests in seconds
47+
- **Objectives**: 0.5: 0.05, 0.9: 0.01, 0.99: 0.001
48+
- **Use Case**: Monitor API performance and identify slow API calls
49+
50+
## Node Processing Metrics
51+
52+
### `shredder_processed_nodes_total`
53+
- **Type**: Counter
54+
- **Description**: Total number of nodes processed during eviction loops
55+
- **Use Case**: Monitor the volume of node processing activity
56+
57+
### `shredder_node_force_to_evict_time`
58+
- **Type**: Gauge Vector
59+
- **Labels**: `node_name`
60+
- **Description**: Unix timestamp when a node will be forcibly evicted
61+
- **Use Case**: Monitor when nodes are scheduled for forced eviction
62+
63+
## Pod Processing Metrics
64+
65+
### `shredder_processed_pods_total`
66+
- **Type**: Counter
67+
- **Description**: Total number of pods processed during eviction loops
68+
- **Use Case**: Monitor the volume of pod processing activity
69+
70+
### `shredder_pod_errors_total`
71+
- **Type**: Gauge Vector
72+
- **Labels**: `pod_name`, `namespace`, `reason`, `action`
73+
- **Description**: Total pod errors per eviction loop
74+
- **Use Case**: Monitor pod eviction failures and their reasons
75+
76+
### `shredder_pod_force_to_evict_time`
77+
- **Type**: Gauge Vector
78+
- **Labels**: `pod_name`, `namespace`
79+
- **Description**: Unix timestamp when a pod will be forcibly evicted
80+
- **Use Case**: Monitor when pods are scheduled for forced eviction
81+
82+
## Karpenter Integration Metrics
83+
84+
### `shredder_karpenter_drifted_nodes_total`
85+
- **Type**: Counter
86+
- **Description**: Total number of drifted Karpenter nodes detected
87+
- **Use Case**: Monitor the volume of Karpenter drift detection activity
88+
89+
### `shredder_karpenter_nodes_parked_total`
90+
- **Type**: Counter
91+
- **Description**: Total number of Karpenter nodes successfully parked
92+
- **Use Case**: Monitor successful Karpenter node parking operations
93+
94+
### `shredder_karpenter_nodes_parking_failed_total`
95+
- **Type**: Counter
96+
- **Description**: Total number of Karpenter nodes that failed to be parked
97+
- **Use Case**: Monitor Karpenter node parking failures
98+
99+
### `shredder_karpenter_processing_duration_seconds`
100+
- **Type**: Summary
101+
- **Description**: Duration of Karpenter node processing in seconds
102+
- **Objectives**: 0.5: 0.05, 0.9: 0.01, 0.99: 0.001
103+
- **Use Case**: Monitor the performance of Karpenter drift detection and parking operations
104+
105+
## Node Label Detection Metrics
106+
107+
### `shredder_node_label_nodes_parked_total`
108+
- **Type**: Counter
109+
- **Description**: Total number of nodes successfully parked via node label detection
110+
- **Use Case**: Monitor successful node label-based parking operations
111+
112+
### `shredder_node_label_nodes_parking_failed_total`
113+
- **Type**: Counter
114+
- **Description**: Total number of nodes that failed to be parked via node label detection
115+
- **Use Case**: Monitor node label-based parking failures
116+
117+
### `shredder_node_label_processing_duration_seconds`
118+
- **Type**: Summary
119+
- **Description**: Duration of node label detection and parking process in seconds
120+
- **Objectives**: 0.5: 0.05, 0.9: 0.01, 0.99: 0.001
121+
- **Use Case**: Monitor the performance of node label detection and parking operations
122+
123+
### `shredder_node_label_matching_nodes_total`
124+
- **Type**: Gauge
125+
- **Description**: Total number of nodes matching the label criteria
126+
- **Use Case**: Monitor the current number of nodes that match the configured label selectors
127+
128+
## Shared Metrics
129+
130+
These metrics aggregate data across all detection methods (Karpenter and node label detection) to provide a unified view of node parking activity.
131+
132+
### `shredder_nodes_parked_total`
133+
- **Type**: Counter
134+
- **Description**: Total number of nodes successfully parked (shared across all detection methods)
135+
- **Use Case**: Monitor total node parking activity regardless of detection method
136+
137+
### `shredder_nodes_parking_failed_total`
138+
- **Type**: Counter
139+
- **Description**: Total number of nodes that failed to be parked (shared across all detection methods)
140+
- **Use Case**: Monitor total node parking failures regardless of detection method
141+
142+
### `shredder_processing_duration_seconds`
143+
- **Type**: Summary
144+
- **Description**: Duration of node processing in seconds (shared across all detection methods)
145+
- **Objectives**: 0.5: 0.05, 0.9: 0.01, 0.99: 0.001
146+
- **Use Case**: Monitor total node processing performance regardless of detection method
147+
148+
## Metric Relationships
149+
150+
### Detection Method Metrics
151+
- **Karpenter metrics** are incremented when `EnableKarpenterDriftDetection=true`
152+
- **Node label metrics** are incremented when `EnableNodeLabelDetection=true`
153+
- **Shared metrics** are incremented whenever either detection method processes nodes
154+
155+
### Processing Flow
156+
1. **Detection**: Nodes are identified via Karpenter drift or label matching
157+
2. **Parking**: Nodes are labeled, cordoned, and tainted
158+
3. **Eviction**: Pods are evicted from parked nodes over time
159+
4. **Cleanup**: Nodes are eventually removed when all pods are evicted
160+
161+
## Alerting Recommendations
162+
163+
### High Error Rates
164+
```promql
165+
rate(shredder_errors_total[5m]) > 0.1
166+
```
167+
168+
### Slow Processing
169+
```promql
170+
histogram_quantile(0.95, rate(shredder_processing_duration_seconds_bucket[5m])) > 30
171+
```
172+
173+
### Failed Node Parking
174+
```promql
175+
rate(shredder_nodes_parking_failed_total[5m]) > 0
176+
```
177+
178+
### High API Latency
179+
```promql
180+
histogram_quantile(0.95, rate(shredder_apiserver_requests_duration_seconds_bucket[5m])) > 5
181+
```
182+
183+
### Parked Pods Alert
184+
```promql
185+
# Alert when pods are running on parked nodes
186+
kube_ethos_upgrade:parked_pod > 0
187+
```
188+
189+
## Example Queries
190+
191+
### Node Parking Success Rate
192+
```promql
193+
rate(shredder_nodes_parked_total[5m]) / (rate(shredder_nodes_parked_total[5m]) + rate(shredder_nodes_parking_failed_total[5m]))
194+
```
195+
196+
### Average Processing Duration
197+
```promql
198+
histogram_quantile(0.5, rate(shredder_processing_duration_seconds_bucket[5m]))
199+
```
200+
201+
### Nodes Parked by Detection Method
202+
```promql
203+
# Karpenter nodes
204+
rate(shredder_karpenter_nodes_parked_total[5m])
205+
206+
# Label-based nodes
207+
rate(shredder_node_label_nodes_parked_total[5m])
208+
```
209+
210+
### Current Matching Nodes
211+
```promql
212+
shredder_node_label_matching_nodes_total
213+
```
214+
215+
## Configuration
216+
217+
Metrics are exposed on the configured port (default: 8080) at the `/metrics` endpoint. The metrics server can be configured using the following options:
218+
219+
- **Metrics Port**: Configure the port for metrics exposure
220+
- **Health Endpoint**: Available at `/healthz` for health checks
221+
- **OpenMetrics Format**: Enabled by default for better compatibility
222+
223+
For more information about configuring k8s-shredder, see the [main README](../README.md).

internal/check_license.sh

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,56 @@
11
#!/usr/bin/env bash
22
set -ueo pipefail
33

4+
CURRENT_YEAR=$(date +%Y)
5+
export CURRENT_YEAR
6+
7+
# Function to update copyright year in Go files
8+
update_go_copyright_year() {
9+
local file=$1
10+
local temp_file=$(mktemp)
11+
12+
# Check if file has a copyright header
13+
if head -n3 "$file" | grep -q "Copyright.*20[0-9]\{2\}"; then
14+
# Update the year to current year
15+
echo "Processing file: $file"
16+
# Now do the replacement
17+
sed "s/202[0-9]/$CURRENT_YEAR/g" "$file" > "$temp_file"
18+
else
19+
# Add copyright header if missing
20+
echo "// Copyright $CURRENT_YEAR Adobe. All rights reserved." > "$temp_file"
21+
cat "$file" >> "$temp_file"
22+
fi
23+
24+
# Replace original file with modified content
25+
mv "$temp_file" "$file"
26+
}
27+
28+
# Function to update copyright year in LICENSE file
29+
update_license_copyright_year() {
30+
local file=$1
31+
local temp_file=$(mktemp)
32+
33+
echo "Processing LICENSE file"
34+
35+
# Update only the line containing "Copyright 2022 Adobe"
36+
sed "s/Copyright 202[0-9] Adobe/Copyright $CURRENT_YEAR Adobe/g" "$file" > "$temp_file"
37+
38+
# Replace original file with modified content
39+
mv "$temp_file" "$file"
40+
}
41+
42+
export -f update_go_copyright_year
43+
export -f update_license_copyright_year
44+
45+
# Update LICENSE file if it exists
46+
if [ -f "LICENSE" ]; then
47+
update_license_copyright_year "LICENSE"
48+
fi
49+
50+
# Find all Go files and update their copyright headers
51+
find . -type f -iname '*.go' ! -path '*/vendor/*' -exec bash -c 'update_go_copyright_year "$1"' _ {} \;
52+
53+
# Check if any files are missing the license header
454
licRes=$(
555
find . -type f -iname '*.go' ! -path '*/vendor/*' -exec \
656
sh -c 'head -n3 $1 | grep -Eq "(Copyright|generated|GENERATED)" || echo "$1"' {} {} \;

0 commit comments

Comments
 (0)