|
| 1 | +# VPA Resource Optimization Guide |
| 2 | + |
| 3 | +How to use VPA, Goldilocks, and Kyverno to right-size Kubernetes resource requests based on actual workload behavior. |
| 4 | + |
| 5 | +## The Toolchain |
| 6 | + |
| 7 | +| Tool | What It Does | Location | |
| 8 | +|------|-------------|----------| |
| 9 | +| **metrics-server** | Provides `metrics.k8s.io` API (CPU/memory data from kubelet) | `infrastructure/controllers/metrics-server/` | |
| 10 | +| **VPA** (Vertical Pod Autoscaler) | Analyzes metrics, generates resource recommendations | `infrastructure/controllers/vertical-pod-autoscaler/` | |
| 11 | +| **Kyverno Policy** (`vpa-auto-create`) | Auto-generates a VPA resource for every Deployment and StatefulSet | `infrastructure/controllers/kyverno/policies/vpa-auto-create.yaml` | |
| 12 | +| **Goldilocks** | Web dashboard to visualize VPA recommendations per namespace | `infrastructure/controllers/goldilocks/` | |
| 13 | + |
| 14 | +### How They Fit Together |
| 15 | + |
| 16 | +``` |
| 17 | +kubelet /metrics/resource |
| 18 | + | |
| 19 | + v |
| 20 | +metrics-server (provides metrics.k8s.io API) |
| 21 | + | |
| 22 | + v |
| 23 | +VPA Recommender (reads metrics, writes recommendations to VPA status) |
| 24 | + ^ |
| 25 | + | |
| 26 | +Kyverno generate policy (auto-creates VPA for every Deployment/StatefulSet) |
| 27 | + | |
| 28 | + v |
| 29 | +VPA resources (one per workload, updateMode: "Off") |
| 30 | + | |
| 31 | + v |
| 32 | +Goldilocks Dashboard (reads VPA recommendations, shows per-namespace view) |
| 33 | + | |
| 34 | + v |
| 35 | +Human reviews → updates values.yaml → Git push → ArgoCD applies |
| 36 | +``` |
| 37 | + |
| 38 | +**Key point**: Kyverno creates VPAs for ALL workloads automatically. Goldilocks also creates VPAs for namespaces it scans, but since `on-by-default: "true"` is set, both cover all namespaces. Duplicate VPAs are harmless — they share the same name and Kyverno's `synchronize: true` keeps them in sync. |
| 39 | + |
| 40 | +## Accessing the Dashboard |
| 41 | + |
| 42 | +**Goldilocks Dashboard**: https://goldilocks.vanillax.me |
| 43 | + |
| 44 | +This is routed via the internal gateway (`gateway-internal`). No port-forward needed if you're on the LAN. |
| 45 | + |
| 46 | +Fallback (if gateway is down): |
| 47 | +```bash |
| 48 | +kubectl port-forward -n goldilocks svc/goldilocks-dashboard 8080:80 |
| 49 | +# Open http://localhost:8080 |
| 50 | +``` |
| 51 | + |
| 52 | +The dashboard shows every namespace with VPA-enabled workloads. For each container it displays: |
| 53 | +- Current resource requests/limits |
| 54 | +- VPA lower bound, target, and upper bound |
| 55 | +- Suggested `requests` and `limits` YAML you can copy-paste |
| 56 | + |
| 57 | +## Reading VPA Recommendations |
| 58 | + |
| 59 | +### Via kubectl |
| 60 | + |
| 61 | +```bash |
| 62 | +# Quick overview: all VPA targets across the cluster |
| 63 | +kubectl get vpa -A -o custom-columns=\ |
| 64 | +NAMESPACE:.metadata.namespace,\ |
| 65 | +NAME:.metadata.name,\ |
| 66 | +CPU:.status.recommendation.containerRecommendations[0].target.cpu,\ |
| 67 | +MEM:.status.recommendation.containerRecommendations[0].target.memory |
| 68 | + |
| 69 | +# Detailed view for a specific namespace |
| 70 | +kubectl get vpa -n argocd -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{range .status.recommendation.containerRecommendations[*]}{" "}{.containerName}{": cpu="}{.target.cpu}{" mem="}{.target.memory}{"\n"}{end}{end}' |
| 71 | + |
| 72 | +# Full detail for a specific VPA |
| 73 | +kubectl describe vpa <name> -n <namespace> |
| 74 | +``` |
| 75 | + |
| 76 | +### Understanding the Four Values |
| 77 | + |
| 78 | +VPA recommendations include four values per container: |
| 79 | + |
| 80 | +| Value | Meaning | Use For | |
| 81 | +|-------|---------|---------| |
| 82 | +| **lowerBound** | Minimum to avoid throttling/OOM | Red flag if current request is below this | |
| 83 | +| **target** | Optimal request based on observed usage | Set `requests:` to this value | |
| 84 | +| **upperBound** | Peak observed consumption | Informs `limits:` setting | |
| 85 | +| **uncappedTarget** | Ideal ignoring any VPA min/max constraints | Same as target when no constraints are set | |
| 86 | + |
| 87 | +**Memory values** are in bytes. Quick conversions: |
| 88 | +- `104857600` = 100Mi |
| 89 | +- `268435456` = 256Mi |
| 90 | +- `536870912` = 512Mi |
| 91 | +- `1073741824` = 1Gi |
| 92 | +- `1610612736` = 1.5Gi |
| 93 | + |
| 94 | +## When to Change Resources |
| 95 | + |
| 96 | +### Decision Matrix |
| 97 | + |
| 98 | +| Situation | Action | Priority | |
| 99 | +|-----------|--------|----------| |
| 100 | +| Current request < **lowerBound** | **INCREASE NOW** | Pod is being throttled or OOM-killed | |
| 101 | +| Current request < **target** | **INCREASE** | Under-provisioned, degraded performance | |
| 102 | +| Current request within 20% of **target** | **KEEP** | Already well-tuned | |
| 103 | +| Current request > 1.5x **target** | **DECREASE** | Over-provisioned, wasting resources | |
| 104 | +| Current request > 5x **target** | **DECREASE** | Heavily over-provisioned | |
| 105 | + |
| 106 | +### Timing |
| 107 | + |
| 108 | +- **Wait at least 7 days** before trusting VPA numbers. Initial recommendations are noisy. |
| 109 | +- **Review weekly**, not daily. Over-correcting defeats the purpose. |
| 110 | +- **Re-check after major changes** (new features, traffic spikes, version upgrades). VPA is backward-looking. |
| 111 | +- **Upper bounds stabilize over ~14 days**. They'll be very wide initially. |
| 112 | + |
| 113 | +### How to Apply Changes |
| 114 | + |
| 115 | +1. Read the VPA recommendation (Goldilocks dashboard or kubectl) |
| 116 | +2. Update the app's `values.yaml` with new resource requests |
| 117 | +3. Add a comment documenting the VPA data and reasoning: |
| 118 | + |
| 119 | +```yaml |
| 120 | +# VPA-optimized (YYYY-MM-DD) |
| 121 | +# VPA target: cpu Xm, memory Y |
| 122 | +# Previous: cpu Am (reason for change) |
| 123 | +resources: |
| 124 | + requests: |
| 125 | + cpu: Xm # Match VPA target |
| 126 | + memory: Y # Match VPA target + buffer |
| 127 | + limits: |
| 128 | + cpu: 2Xm # 2x request for burst |
| 129 | + memory: 2Y # 2x request for spikes |
| 130 | +``` |
| 131 | +
|
| 132 | +4. Git commit and push — ArgoCD applies via GitOps |
| 133 | +
|
| 134 | +### Setting Requests vs Limits |
| 135 | +
|
| 136 | +| Field | Rule of Thumb | |
| 137 | +|-------|--------------| |
| 138 | +| `requests.cpu` | VPA `target` (or 1.1-1.2x for buffer) | |
| 139 | +| `requests.memory` | VPA `target` (or 1.2-1.5x — memory OOM is fatal, CPU throttling is not) | |
| 140 | +| `limits.cpu` | 2-4x request (allows burst). Or omit entirely to let pods burst freely. | |
| 141 | +| `limits.memory` | 2-4x request (or match VPA `upperBound` if spikes are expected) | |
| 142 | + |
| 143 | +## Common Workload Patterns |
| 144 | + |
| 145 | +### CPU-Bound (Helm rendering, image processing) |
| 146 | +High CPU target, low memory target. Increase CPU generously, keep memory modest. |
| 147 | +``` |
| 148 | +Example: argocd-repo-server |
| 149 | + VPA target: cpu 2975m, memory 523Mi |
| 150 | + Action: cpu 3000m request, memory 768Mi request |
| 151 | +``` |
| 152 | +
|
| 153 | +### Memory-Bound (Databases, caches) |
| 154 | +Low CPU target, high memory target. Increase memory, keep CPU low. |
| 155 | +``` |
| 156 | +Example: Redis |
| 157 | + VPA target: cpu 23m, memory 100Mi |
| 158 | + Action: cpu 50m request, memory 128Mi request |
| 159 | +``` |
| 160 | +
|
| 161 | +### Idle/Lightweight (UI servers, webhooks) |
| 162 | +Both CPU and memory very low. Set modest requests with generous limits for occasional spikes. |
| 163 | +``` |
| 164 | +Example: argocd-server |
| 165 | + VPA target: cpu 23m, memory 175Mi |
| 166 | + Action: cpu 50m request, memory 256Mi request |
| 167 | +``` |
| 168 | +
|
| 169 | +### GPU Workloads |
| 170 | +VPA only tracks CPU/memory, not GPU. Recommendations will show low CPU/memory because compute happens on GPU VRAM. Set CPU/memory based on data loading needs, not inference. |
| 171 | +
|
| 172 | +## Real-World Example: ArgoCD Optimization |
| 173 | +
|
| 174 | +### Before (manual guesswork) |
| 175 | +``` |
| 176 | +controller: cpu: 1000m, memory: 1Gi # UNDER-PROVISIONED (below lowerBound!) |
| 177 | +repo-server: cpu: 1000m, memory: 1Gi # UNDER-PROVISIONED 3x |
| 178 | +server: cpu: 500m, memory: 512Mi # OVER-PROVISIONED 20x |
| 179 | +applicationSet: cpu: 250m, memory: 256Mi # OVER-PROVISIONED 5x |
| 180 | +redis: cpu: 100m, memory: 128Mi # OVER-PROVISIONED 4x |
| 181 | +Total: 2.85 CPU, 2.9Gi memory |
| 182 | +``` |
| 183 | +
|
| 184 | +### VPA Said |
| 185 | +``` |
| 186 | +controller: target: 2048m CPU, 1.25Gi memory (lowerBound: 1021m > current 1000m!) |
| 187 | +repo-server: target: 2975m CPU, 523Mi memory |
| 188 | +server: target: 23m CPU, 175Mi memory |
| 189 | +applicationSet: target: 49m CPU, 100Mi memory |
| 190 | +redis: target: 23m CPU, 100Mi memory |
| 191 | +``` |
| 192 | +
|
| 193 | +### After (VPA-optimized) |
| 194 | +``` |
| 195 | +controller: cpu: 2000m, memory: 1536Mi # DOUBLED (was throttled) |
| 196 | +repo-server: cpu: 3000m, memory: 768Mi # TRIPLED CPU, halved memory |
| 197 | +server: cpu: 50m, memory: 256Mi # REDUCED 10x |
| 198 | +applicationSet: cpu: 100m, memory: 128Mi # REDUCED 2.5x |
| 199 | +redis: cpu: 50m, memory: 128Mi # REDUCED 2x |
| 200 | +Total: 5.2 CPU, 2.8Gi memory |
| 201 | +``` |
| 202 | +
|
| 203 | +**Result**: +2.35 CPU where it was needed (controller/repo-server), -0.1Gi memory overall, no more CPU throttling on the controller. |
| 204 | +
|
| 205 | +See `infrastructure/controllers/argocd/values.yaml` for the actual implementation with inline VPA documentation. |
| 206 | +
|
| 207 | +## Excluded Namespaces |
| 208 | +
|
| 209 | +The Kyverno `vpa-auto-create` policy excludes: |
| 210 | +- `kube-system` — critical system components, don't touch |
| 211 | +- `kyverno` — policy engine, restart = cluster-wide impact |
| 212 | +- `vertical-pod-autoscaler` — VPA managing itself creates feedback loops |
| 213 | +
|
| 214 | +## K8s 1.35: In-Place Pod Resize (Future) |
| 215 | +
|
| 216 | +This cluster runs K8s v1.35.1 where In-Place Pod Resize is GA. VPA supports `updateMode: "InPlaceOrRecreate"` which resizes pods **without restarting them** when possible. |
| 217 | +
|
| 218 | +Currently we use `updateMode: "Off"` (manual review). When confident in VPA accuracy after 2-4 weeks of observation, you can switch individual workloads to `InPlaceOrRecreate`: |
| 219 | +
|
| 220 | +```yaml |
| 221 | +apiVersion: autoscaling.k8s.io/v1 |
| 222 | +kind: VerticalPodAutoscaler |
| 223 | +spec: |
| 224 | + updatePolicy: |
| 225 | + updateMode: "InPlaceOrRecreate" # Live resize when possible |
| 226 | +``` |
| 227 | + |
| 228 | +**Start with non-critical workloads** (dev tools, media apps) before enabling on infrastructure. |
| 229 | + |
| 230 | +## Troubleshooting |
| 231 | + |
| 232 | +### No recommendations showing |
| 233 | +- VPA needs ~5-10 minutes for initial data, 24+ hours for accuracy |
| 234 | +- Check metrics-server: `kubectl top nodes` (should return data) |
| 235 | +- Check VPA recommender: `kubectl logs -n vertical-pod-autoscaler -l app.kubernetes.io/component=recommender` |
| 236 | + |
| 237 | +### Goldilocks dashboard is empty |
| 238 | +- Check if Goldilocks controller is running: `kubectl get pods -n goldilocks` |
| 239 | +- Goldilocks is set to `on-by-default: "true"` — all namespaces should appear |
| 240 | +- VPA resources must exist (Kyverno creates them on Deployment/StatefulSet CREATE/UPDATE) |
| 241 | + |
| 242 | +### VPA recommendations seem too high/low |
| 243 | +- Not enough data — wait 7-14 days |
| 244 | +- Workload changed recently — VPA is backward-looking |
| 245 | +- Check `upperBound` for peak usage context |
| 246 | +- Batch/cron workloads have spiky usage — use `upperBound` for limits |
| 247 | + |
| 248 | +### Pods OOMKilled after applying VPA |
| 249 | +- VPA target reflects steady-state, not initialization spikes |
| 250 | +- Set `limits.memory` well above `requests.memory` (2-4x) |
| 251 | +- Check startup memory with `kubectl top pod` during pod init |
| 252 | + |
| 253 | +## Quick Reference |
| 254 | + |
| 255 | +```bash |
| 256 | +# Goldilocks dashboard (LAN) |
| 257 | +https://goldilocks.vanillax.me |
| 258 | + |
| 259 | +# All VPA recommendations (cluster-wide) |
| 260 | +kubectl get vpa -A -o custom-columns=\ |
| 261 | +NS:.metadata.namespace,\ |
| 262 | +NAME:.metadata.name,\ |
| 263 | +CPU:.status.recommendation.containerRecommendations[0].target.cpu,\ |
| 264 | +MEM:.status.recommendation.containerRecommendations[0].target.memory |
| 265 | + |
| 266 | +# Current resource usage vs requests |
| 267 | +kubectl top pods -n <namespace> |
| 268 | + |
| 269 | +# Compare current requests vs VPA target |
| 270 | +kubectl get deploy <name> -n <ns> -o jsonpath='{.spec.template.spec.containers[0].resources}' |
| 271 | +kubectl get vpa <name> -n <ns> -o jsonpath='{.status.recommendation.containerRecommendations[0].target}' |
| 272 | +``` |
| 273 | + |
| 274 | +## Related Docs |
| 275 | + |
| 276 | +- [Monitoring README](../monitoring/README.md) — metrics-server vs Prometheus pipelines |
| 277 | +- [VPA component README](../infrastructure/controllers/vertical-pod-autoscaler/README.md) |
| 278 | +- [Kyverno VPA policy](../infrastructure/controllers/kyverno/policies/vpa-auto-create.yaml) |
| 279 | +- [Goldilocks config](../infrastructure/controllers/goldilocks/) |
| 280 | + |
| 281 | +--- |
| 282 | + |
| 283 | +**Last Updated**: 2026-02-24 |
| 284 | +**Cluster**: talos-prod-cluster (K8s v1.35.1, Talos v1.12.4) |
0 commit comments