Skip to content

Commit df1d83c

Browse files
committed
docs
1 parent 1ff94c8 commit df1d83c

3 files changed

Lines changed: 286 additions & 0 deletions

File tree

docs/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,5 +9,6 @@ See the [README](https://github.com/mitchross/talos-argocd-proxmox) for setup in
99
- [ArgoCD & GitOps Architecture](argocd.md) - Sync waves, app-of-apps pattern, health checks
1010
- [Backup & Restore](backup-restore.md) - Kyverno + VolSync + PVC Plumber automated backups
1111
- [Full Backup Flow](pvc-plumber-full-flow.md) - Complete bare-metal to disaster recovery walkthrough
12+
- [VPA Resource Optimization](vpa-resource-optimization.md) - Using VPA/Goldilocks to right-size pod resources
1213
- [Network Topology](network-topology.md) - Cluster networking and 10G infrastructure
1314
- [Network Security](network-policy.md) - Cilium network policies and LAN isolation

docs/vpa-resource-optimization.md

Lines changed: 284 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,284 @@
1+
# VPA Resource Optimization Guide
2+
3+
How to use VPA, Goldilocks, and Kyverno to right-size Kubernetes resource requests based on actual workload behavior.
4+
5+
## The Toolchain
6+
7+
| Tool | What It Does | Location |
8+
|------|-------------|----------|
9+
| **metrics-server** | Provides `metrics.k8s.io` API (CPU/memory data from kubelet) | `infrastructure/controllers/metrics-server/` |
10+
| **VPA** (Vertical Pod Autoscaler) | Analyzes metrics, generates resource recommendations | `infrastructure/controllers/vertical-pod-autoscaler/` |
11+
| **Kyverno Policy** (`vpa-auto-create`) | Auto-generates a VPA resource for every Deployment and StatefulSet | `infrastructure/controllers/kyverno/policies/vpa-auto-create.yaml` |
12+
| **Goldilocks** | Web dashboard to visualize VPA recommendations per namespace | `infrastructure/controllers/goldilocks/` |
13+
14+
### How They Fit Together
15+
16+
```
17+
kubelet /metrics/resource
18+
|
19+
v
20+
metrics-server (provides metrics.k8s.io API)
21+
|
22+
v
23+
VPA Recommender (reads metrics, writes recommendations to VPA status)
24+
^
25+
|
26+
Kyverno generate policy (auto-creates VPA for every Deployment/StatefulSet)
27+
|
28+
v
29+
VPA resources (one per workload, updateMode: "Off")
30+
|
31+
v
32+
Goldilocks Dashboard (reads VPA recommendations, shows per-namespace view)
33+
|
34+
v
35+
Human reviews → updates values.yaml → Git push → ArgoCD applies
36+
```
37+
38+
**Key point**: Kyverno creates VPAs for ALL workloads automatically. Goldilocks also creates VPAs for namespaces it scans, but since `on-by-default: "true"` is set, both cover all namespaces. Duplicate VPAs are harmless — they share the same name and Kyverno's `synchronize: true` keeps them in sync.
39+
40+
## Accessing the Dashboard
41+
42+
**Goldilocks Dashboard**: https://goldilocks.vanillax.me
43+
44+
This is routed via the internal gateway (`gateway-internal`). No port-forward needed if you're on the LAN.
45+
46+
Fallback (if gateway is down):
47+
```bash
48+
kubectl port-forward -n goldilocks svc/goldilocks-dashboard 8080:80
49+
# Open http://localhost:8080
50+
```
51+
52+
The dashboard shows every namespace with VPA-enabled workloads. For each container it displays:
53+
- Current resource requests/limits
54+
- VPA lower bound, target, and upper bound
55+
- Suggested `requests` and `limits` YAML you can copy-paste
56+
57+
## Reading VPA Recommendations
58+
59+
### Via kubectl
60+
61+
```bash
62+
# Quick overview: all VPA targets across the cluster
63+
kubectl get vpa -A -o custom-columns=\
64+
NAMESPACE:.metadata.namespace,\
65+
NAME:.metadata.name,\
66+
CPU:.status.recommendation.containerRecommendations[0].target.cpu,\
67+
MEM:.status.recommendation.containerRecommendations[0].target.memory
68+
69+
# Detailed view for a specific namespace
70+
kubectl get vpa -n argocd -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{range .status.recommendation.containerRecommendations[*]}{" "}{.containerName}{": cpu="}{.target.cpu}{" mem="}{.target.memory}{"\n"}{end}{end}'
71+
72+
# Full detail for a specific VPA
73+
kubectl describe vpa <name> -n <namespace>
74+
```
75+
76+
### Understanding the Four Values
77+
78+
VPA recommendations include four values per container:
79+
80+
| Value | Meaning | Use For |
81+
|-------|---------|---------|
82+
| **lowerBound** | Minimum to avoid throttling/OOM | Red flag if current request is below this |
83+
| **target** | Optimal request based on observed usage | Set `requests:` to this value |
84+
| **upperBound** | Peak observed consumption | Informs `limits:` setting |
85+
| **uncappedTarget** | Ideal ignoring any VPA min/max constraints | Same as target when no constraints are set |
86+
87+
**Memory values** are in bytes. Quick conversions:
88+
- `104857600` = 100Mi
89+
- `268435456` = 256Mi
90+
- `536870912` = 512Mi
91+
- `1073741824` = 1Gi
92+
- `1610612736` = 1.5Gi
93+
94+
## When to Change Resources
95+
96+
### Decision Matrix
97+
98+
| Situation | Action | Priority |
99+
|-----------|--------|----------|
100+
| Current request < **lowerBound** | **INCREASE NOW** | Pod is being throttled or OOM-killed |
101+
| Current request < **target** | **INCREASE** | Under-provisioned, degraded performance |
102+
| Current request within 20% of **target** | **KEEP** | Already well-tuned |
103+
| Current request > 1.5x **target** | **DECREASE** | Over-provisioned, wasting resources |
104+
| Current request > 5x **target** | **DECREASE** | Heavily over-provisioned |
105+
106+
### Timing
107+
108+
- **Wait at least 7 days** before trusting VPA numbers. Initial recommendations are noisy.
109+
- **Review weekly**, not daily. Over-correcting defeats the purpose.
110+
- **Re-check after major changes** (new features, traffic spikes, version upgrades). VPA is backward-looking.
111+
- **Upper bounds stabilize over ~14 days**. They'll be very wide initially.
112+
113+
### How to Apply Changes
114+
115+
1. Read the VPA recommendation (Goldilocks dashboard or kubectl)
116+
2. Update the app's `values.yaml` with new resource requests
117+
3. Add a comment documenting the VPA data and reasoning:
118+
119+
```yaml
120+
# VPA-optimized (YYYY-MM-DD)
121+
# VPA target: cpu Xm, memory Y
122+
# Previous: cpu Am (reason for change)
123+
resources:
124+
requests:
125+
cpu: Xm # Match VPA target
126+
memory: Y # Match VPA target + buffer
127+
limits:
128+
cpu: 2Xm # 2x request for burst
129+
memory: 2Y # 2x request for spikes
130+
```
131+
132+
4. Git commit and push — ArgoCD applies via GitOps
133+
134+
### Setting Requests vs Limits
135+
136+
| Field | Rule of Thumb |
137+
|-------|--------------|
138+
| `requests.cpu` | VPA `target` (or 1.1-1.2x for buffer) |
139+
| `requests.memory` | VPA `target` (or 1.2-1.5x — memory OOM is fatal, CPU throttling is not) |
140+
| `limits.cpu` | 2-4x request (allows burst). Or omit entirely to let pods burst freely. |
141+
| `limits.memory` | 2-4x request (or match VPA `upperBound` if spikes are expected) |
142+
143+
## Common Workload Patterns
144+
145+
### CPU-Bound (Helm rendering, image processing)
146+
High CPU target, low memory target. Increase CPU generously, keep memory modest.
147+
```
148+
Example: argocd-repo-server
149+
VPA target: cpu 2975m, memory 523Mi
150+
Action: cpu 3000m request, memory 768Mi request
151+
```
152+
153+
### Memory-Bound (Databases, caches)
154+
Low CPU target, high memory target. Increase memory, keep CPU low.
155+
```
156+
Example: Redis
157+
VPA target: cpu 23m, memory 100Mi
158+
Action: cpu 50m request, memory 128Mi request
159+
```
160+
161+
### Idle/Lightweight (UI servers, webhooks)
162+
Both CPU and memory very low. Set modest requests with generous limits for occasional spikes.
163+
```
164+
Example: argocd-server
165+
VPA target: cpu 23m, memory 175Mi
166+
Action: cpu 50m request, memory 256Mi request
167+
```
168+
169+
### GPU Workloads
170+
VPA only tracks CPU/memory, not GPU. Recommendations will show low CPU/memory because compute happens on GPU VRAM. Set CPU/memory based on data loading needs, not inference.
171+
172+
## Real-World Example: ArgoCD Optimization
173+
174+
### Before (manual guesswork)
175+
```
176+
controller: cpu: 1000m, memory: 1Gi # UNDER-PROVISIONED (below lowerBound!)
177+
repo-server: cpu: 1000m, memory: 1Gi # UNDER-PROVISIONED 3x
178+
server: cpu: 500m, memory: 512Mi # OVER-PROVISIONED 20x
179+
applicationSet: cpu: 250m, memory: 256Mi # OVER-PROVISIONED 5x
180+
redis: cpu: 100m, memory: 128Mi # OVER-PROVISIONED 4x
181+
Total: 2.85 CPU, 2.9Gi memory
182+
```
183+
184+
### VPA Said
185+
```
186+
controller: target: 2048m CPU, 1.25Gi memory (lowerBound: 1021m > current 1000m!)
187+
repo-server: target: 2975m CPU, 523Mi memory
188+
server: target: 23m CPU, 175Mi memory
189+
applicationSet: target: 49m CPU, 100Mi memory
190+
redis: target: 23m CPU, 100Mi memory
191+
```
192+
193+
### After (VPA-optimized)
194+
```
195+
controller: cpu: 2000m, memory: 1536Mi # DOUBLED (was throttled)
196+
repo-server: cpu: 3000m, memory: 768Mi # TRIPLED CPU, halved memory
197+
server: cpu: 50m, memory: 256Mi # REDUCED 10x
198+
applicationSet: cpu: 100m, memory: 128Mi # REDUCED 2.5x
199+
redis: cpu: 50m, memory: 128Mi # REDUCED 2x
200+
Total: 5.2 CPU, 2.8Gi memory
201+
```
202+
203+
**Result**: +2.35 CPU where it was needed (controller/repo-server), -0.1Gi memory overall, no more CPU throttling on the controller.
204+
205+
See `infrastructure/controllers/argocd/values.yaml` for the actual implementation with inline VPA documentation.
206+
207+
## Excluded Namespaces
208+
209+
The Kyverno `vpa-auto-create` policy excludes:
210+
- `kube-system` — critical system components, don't touch
211+
- `kyverno` — policy engine, restart = cluster-wide impact
212+
- `vertical-pod-autoscaler` — VPA managing itself creates feedback loops
213+
214+
## K8s 1.35: In-Place Pod Resize (Future)
215+
216+
This cluster runs K8s v1.35.1 where In-Place Pod Resize is GA. VPA supports `updateMode: "InPlaceOrRecreate"` which resizes pods **without restarting them** when possible.
217+
218+
Currently we use `updateMode: "Off"` (manual review). When confident in VPA accuracy after 2-4 weeks of observation, you can switch individual workloads to `InPlaceOrRecreate`:
219+
220+
```yaml
221+
apiVersion: autoscaling.k8s.io/v1
222+
kind: VerticalPodAutoscaler
223+
spec:
224+
updatePolicy:
225+
updateMode: "InPlaceOrRecreate" # Live resize when possible
226+
```
227+
228+
**Start with non-critical workloads** (dev tools, media apps) before enabling on infrastructure.
229+
230+
## Troubleshooting
231+
232+
### No recommendations showing
233+
- VPA needs ~5-10 minutes for initial data, 24+ hours for accuracy
234+
- Check metrics-server: `kubectl top nodes` (should return data)
235+
- Check VPA recommender: `kubectl logs -n vertical-pod-autoscaler -l app.kubernetes.io/component=recommender`
236+
237+
### Goldilocks dashboard is empty
238+
- Check if Goldilocks controller is running: `kubectl get pods -n goldilocks`
239+
- Goldilocks is set to `on-by-default: "true"` — all namespaces should appear
240+
- VPA resources must exist (Kyverno creates them on Deployment/StatefulSet CREATE/UPDATE)
241+
242+
### VPA recommendations seem too high/low
243+
- Not enough data — wait 7-14 days
244+
- Workload changed recently — VPA is backward-looking
245+
- Check `upperBound` for peak usage context
246+
- Batch/cron workloads have spiky usage — use `upperBound` for limits
247+
248+
### Pods OOMKilled after applying VPA
249+
- VPA target reflects steady-state, not initialization spikes
250+
- Set `limits.memory` well above `requests.memory` (2-4x)
251+
- Check startup memory with `kubectl top pod` during pod init
252+
253+
## Quick Reference
254+
255+
```bash
256+
# Goldilocks dashboard (LAN)
257+
https://goldilocks.vanillax.me
258+
259+
# All VPA recommendations (cluster-wide)
260+
kubectl get vpa -A -o custom-columns=\
261+
NS:.metadata.namespace,\
262+
NAME:.metadata.name,\
263+
CPU:.status.recommendation.containerRecommendations[0].target.cpu,\
264+
MEM:.status.recommendation.containerRecommendations[0].target.memory
265+
266+
# Current resource usage vs requests
267+
kubectl top pods -n <namespace>
268+
269+
# Compare current requests vs VPA target
270+
kubectl get deploy <name> -n <ns> -o jsonpath='{.spec.template.spec.containers[0].resources}'
271+
kubectl get vpa <name> -n <ns> -o jsonpath='{.status.recommendation.containerRecommendations[0].target}'
272+
```
273+
274+
## Related Docs
275+
276+
- [Monitoring README](../monitoring/README.md) — metrics-server vs Prometheus pipelines
277+
- [VPA component README](../infrastructure/controllers/vertical-pod-autoscaler/README.md)
278+
- [Kyverno VPA policy](../infrastructure/controllers/kyverno/policies/vpa-auto-create.yaml)
279+
- [Goldilocks config](../infrastructure/controllers/goldilocks/)
280+
281+
---
282+
283+
**Last Updated**: 2026-02-24
284+
**Cluster**: talos-prod-cluster (K8s v1.35.1, Talos v1.12.4)

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ nav:
3131
- Home: index.md
3232
- Architecture:
3333
- ArgoCD & GitOps: argocd.md
34+
- VPA Resource Optimization: vpa-resource-optimization.md
3435
- Network Topology: network-topology.md
3536
- Network Security: network-policy.md
3637
- Backup & Restore:

0 commit comments

Comments
 (0)