|
| 1 | +# k8s-shredder Metrics |
| 2 | + |
| 3 | +This document describes all the metrics exposed by k8s-shredder. These metrics are available at the `/metrics` endpoint and can be scraped by Prometheus or other monitoring systems. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +k8s-shredder exposes metrics in Prometheus format to help operators monitor the health and performance of the node parking and eviction processes. The metrics are organized into several categories: |
| 8 | + |
| 9 | +- **Core Operation Metrics**: General operation counters and timing |
| 10 | +- **API Server Metrics**: Kubernetes API interaction metrics |
| 11 | +- **Node Processing Metrics**: Node parking and processing statistics |
| 12 | +- **Pod Processing Metrics**: Pod eviction and processing statistics |
| 13 | +- **Karpenter Integration Metrics**: Karpenter drift detection metrics |
| 14 | +- **Node Label Detection Metrics**: Node label-based detection metrics |
| 15 | +- **Shared Metrics**: Aggregated metrics across all detection methods |
| 16 | + |
| 17 | +## Core Operation Metrics |
| 18 | + |
| 19 | +### `shredder_loops_total` |
| 20 | +- **Type**: Counter |
| 21 | +- **Description**: Total number of eviction loops completed |
| 22 | +- **Use Case**: Monitor the frequency of eviction loop execution and overall system activity |
| 23 | + |
| 24 | +### `shredder_loops_duration_seconds` |
| 25 | +- **Type**: Summary |
| 26 | +- **Description**: Duration of eviction loops in seconds |
| 27 | +- **Objectives**: 0.5: 1200, 0.9: 900, 0.99: 600 |
| 28 | +- **Use Case**: Monitor the performance of eviction loops and identify slow operations |
| 29 | + |
| 30 | +### `shredder_errors_total` |
| 31 | +- **Type**: Counter |
| 32 | +- **Description**: Total number of errors encountered during operation |
| 33 | +- **Use Case**: Monitor system health and identify operational issues |
| 34 | + |
| 35 | +## API Server Metrics |
| 36 | + |
| 37 | +### `shredder_apiserver_requests_total` |
| 38 | +- **Type**: Counter Vector |
| 39 | +- **Labels**: `verb`, `resource`, `status` |
| 40 | +- **Description**: Total requests made to the Kubernetes API |
| 41 | +- **Use Case**: Monitor API usage patterns and identify potential rate limiting issues |
| 42 | + |
| 43 | +### `shredder_apiserver_requests_duration_seconds` |
| 44 | +- **Type**: Summary Vector |
| 45 | +- **Labels**: `verb`, `resource`, `status` |
| 46 | +- **Description**: Duration of Kubernetes API requests in seconds |
| 47 | +- **Objectives**: 0.5: 0.05, 0.9: 0.01, 0.99: 0.001 |
| 48 | +- **Use Case**: Monitor API performance and identify slow API calls |
| 49 | + |
| 50 | +## Node Processing Metrics |
| 51 | + |
| 52 | +### `shredder_processed_nodes_total` |
| 53 | +- **Type**: Counter |
| 54 | +- **Description**: Total number of nodes processed during eviction loops |
| 55 | +- **Use Case**: Monitor the volume of node processing activity |
| 56 | + |
| 57 | +### `shredder_node_force_to_evict_time` |
| 58 | +- **Type**: Gauge Vector |
| 59 | +- **Labels**: `node_name` |
| 60 | +- **Description**: Unix timestamp when a node will be forcibly evicted |
| 61 | +- **Use Case**: Monitor when nodes are scheduled for forced eviction |
| 62 | + |
| 63 | +## Pod Processing Metrics |
| 64 | + |
| 65 | +### `shredder_processed_pods_total` |
| 66 | +- **Type**: Counter |
| 67 | +- **Description**: Total number of pods processed during eviction loops |
| 68 | +- **Use Case**: Monitor the volume of pod processing activity |
| 69 | + |
| 70 | +### `shredder_pod_errors_total` |
| 71 | +- **Type**: Gauge Vector |
| 72 | +- **Labels**: `pod_name`, `namespace`, `reason`, `action` |
| 73 | +- **Description**: Total pod errors per eviction loop |
| 74 | +- **Use Case**: Monitor pod eviction failures and their reasons |
| 75 | + |
| 76 | +### `shredder_pod_force_to_evict_time` |
| 77 | +- **Type**: Gauge Vector |
| 78 | +- **Labels**: `pod_name`, `namespace` |
| 79 | +- **Description**: Unix timestamp when a pod will be forcibly evicted |
| 80 | +- **Use Case**: Monitor when pods are scheduled for forced eviction |
| 81 | + |
| 82 | +## Karpenter Integration Metrics |
| 83 | + |
| 84 | +### `shredder_karpenter_drifted_nodes_total` |
| 85 | +- **Type**: Counter |
| 86 | +- **Description**: Total number of drifted Karpenter nodes detected |
| 87 | +- **Use Case**: Monitor the volume of Karpenter drift detection activity |
| 88 | + |
| 89 | +### `shredder_karpenter_nodes_parked_total` |
| 90 | +- **Type**: Counter |
| 91 | +- **Description**: Total number of Karpenter nodes successfully parked |
| 92 | +- **Use Case**: Monitor successful Karpenter node parking operations |
| 93 | + |
| 94 | +### `shredder_karpenter_nodes_parking_failed_total` |
| 95 | +- **Type**: Counter |
| 96 | +- **Description**: Total number of Karpenter nodes that failed to be parked |
| 97 | +- **Use Case**: Monitor Karpenter node parking failures |
| 98 | + |
| 99 | +### `shredder_karpenter_processing_duration_seconds` |
| 100 | +- **Type**: Summary |
| 101 | +- **Description**: Duration of Karpenter node processing in seconds |
| 102 | +- **Objectives**: 0.5: 0.05, 0.9: 0.01, 0.99: 0.001 |
| 103 | +- **Use Case**: Monitor the performance of Karpenter drift detection and parking operations |
| 104 | + |
| 105 | +## Node Label Detection Metrics |
| 106 | + |
| 107 | +### `shredder_node_label_nodes_parked_total` |
| 108 | +- **Type**: Counter |
| 109 | +- **Description**: Total number of nodes successfully parked via node label detection |
| 110 | +- **Use Case**: Monitor successful node label-based parking operations |
| 111 | + |
| 112 | +### `shredder_node_label_nodes_parking_failed_total` |
| 113 | +- **Type**: Counter |
| 114 | +- **Description**: Total number of nodes that failed to be parked via node label detection |
| 115 | +- **Use Case**: Monitor node label-based parking failures |
| 116 | + |
| 117 | +### `shredder_node_label_processing_duration_seconds` |
| 118 | +- **Type**: Summary |
| 119 | +- **Description**: Duration of node label detection and parking process in seconds |
| 120 | +- **Objectives**: 0.5: 0.05, 0.9: 0.01, 0.99: 0.001 |
| 121 | +- **Use Case**: Monitor the performance of node label detection and parking operations |
| 122 | + |
| 123 | +### `shredder_node_label_matching_nodes_total` |
| 124 | +- **Type**: Gauge |
| 125 | +- **Description**: Total number of nodes matching the label criteria |
| 126 | +- **Use Case**: Monitor the current number of nodes that match the configured label selectors |
| 127 | + |
| 128 | +## Shared Metrics |
| 129 | + |
| 130 | +These metrics aggregate data across all detection methods (Karpenter and node label detection) to provide a unified view of node parking activity. |
| 131 | + |
| 132 | +### `shredder_nodes_parked_total` |
| 133 | +- **Type**: Counter |
| 134 | +- **Description**: Total number of nodes successfully parked (shared across all detection methods) |
| 135 | +- **Use Case**: Monitor total node parking activity regardless of detection method |
| 136 | + |
| 137 | +### `shredder_nodes_parking_failed_total` |
| 138 | +- **Type**: Counter |
| 139 | +- **Description**: Total number of nodes that failed to be parked (shared across all detection methods) |
| 140 | +- **Use Case**: Monitor total node parking failures regardless of detection method |
| 141 | + |
| 142 | +### `shredder_processing_duration_seconds` |
| 143 | +- **Type**: Summary |
| 144 | +- **Description**: Duration of node processing in seconds (shared across all detection methods) |
| 145 | +- **Objectives**: 0.5: 0.05, 0.9: 0.01, 0.99: 0.001 |
| 146 | +- **Use Case**: Monitor total node processing performance regardless of detection method |
| 147 | + |
| 148 | +## Metric Relationships |
| 149 | + |
| 150 | +### Detection Method Metrics |
| 151 | +- **Karpenter metrics** are incremented when `EnableKarpenterDriftDetection=true` |
| 152 | +- **Node label metrics** are incremented when `EnableNodeLabelDetection=true` |
| 153 | +- **Shared metrics** are incremented whenever either detection method processes nodes |
| 154 | + |
| 155 | +### Processing Flow |
| 156 | +1. **Detection**: Nodes are identified via Karpenter drift or label matching |
| 157 | +2. **Parking**: Nodes are labeled, cordoned, and tainted |
| 158 | +3. **Eviction**: Pods are evicted from parked nodes over time |
| 159 | +4. **Cleanup**: Nodes are eventually removed when all pods are evicted |
| 160 | + |
| 161 | +## Alerting Recommendations |
| 162 | + |
| 163 | +### High Error Rates |
| 164 | +```promql |
| 165 | +rate(shredder_errors_total[5m]) > 0.1 |
| 166 | +``` |
| 167 | + |
| 168 | +### Slow Processing |
| 169 | +```promql |
| 170 | +histogram_quantile(0.95, rate(shredder_processing_duration_seconds_bucket[5m])) > 30 |
| 171 | +``` |
| 172 | + |
| 173 | +### Failed Node Parking |
| 174 | +```promql |
| 175 | +rate(shredder_nodes_parking_failed_total[5m]) > 0 |
| 176 | +``` |
| 177 | + |
| 178 | +### High API Latency |
| 179 | +```promql |
| 180 | +histogram_quantile(0.95, rate(shredder_apiserver_requests_duration_seconds_bucket[5m])) > 5 |
| 181 | +``` |
| 182 | + |
| 183 | +### Parked Pods Alert |
| 184 | +```promql |
| 185 | +# Alert when pods are running on parked nodes |
| 186 | +kube_ethos_upgrade:parked_pod > 0 |
| 187 | +``` |
| 188 | + |
| 189 | +## Example Queries |
| 190 | + |
| 191 | +### Node Parking Success Rate |
| 192 | +```promql |
| 193 | +rate(shredder_nodes_parked_total[5m]) / (rate(shredder_nodes_parked_total[5m]) + rate(shredder_nodes_parking_failed_total[5m])) |
| 194 | +``` |
| 195 | + |
| 196 | +### Average Processing Duration |
| 197 | +```promql |
| 198 | +histogram_quantile(0.5, rate(shredder_processing_duration_seconds_bucket[5m])) |
| 199 | +``` |
| 200 | + |
| 201 | +### Nodes Parked by Detection Method |
| 202 | +```promql |
| 203 | +# Karpenter nodes |
| 204 | +rate(shredder_karpenter_nodes_parked_total[5m]) |
| 205 | +
|
| 206 | +# Label-based nodes |
| 207 | +rate(shredder_node_label_nodes_parked_total[5m]) |
| 208 | +``` |
| 209 | + |
| 210 | +### Current Matching Nodes |
| 211 | +```promql |
| 212 | +shredder_node_label_matching_nodes_total |
| 213 | +``` |
| 214 | + |
| 215 | +## Configuration |
| 216 | + |
| 217 | +Metrics are exposed on the configured port (default: 8080) at the `/metrics` endpoint. The metrics server can be configured using the following options: |
| 218 | + |
| 219 | +- **Metrics Port**: Configure the port for metrics exposure |
| 220 | +- **Health Endpoint**: Available at `/healthz` for health checks |
| 221 | +- **OpenMetrics Format**: Enabled by default for better compatibility |
| 222 | + |
| 223 | +For more information about configuring k8s-shredder, see the [main README](../README.md). |
0 commit comments