Summary
The WVA controller currently emits only 4 Prometheus metrics (wva_desired_replicas, wva_current_replicas, wva_desired_ratio, wva_replica_scaling_total). Critical internal state — saturation analysis results, optimizer decisions, limiter constraints, collection health, and error counts — is only visible through logs. This makes it difficult for operators to monitor, alert on, and troubleshoot autoscaling behavior without log parsing.
This epic defines a phased improvement to WVA observability across metrics, Kubernetes events, alerting rules, and dashboards, broken into independently deliverable sub-issues.
Current State
| Area |
Status |
Notes |
| Prometheus metrics |
4 metrics (replica counts + scaling counter) |
No saturation, capacity, error, or timing metrics |
| Kubernetes events |
1 event type (ServiceMonitor deletion) |
No events for scaling decisions or errors |
| Status conditions |
3 types (TargetResolved, MetricsAvailable, OptimizationReady) |
Good coverage |
| Structured logging |
Consistent ctrl.Log with key-value pairs |
Good, but many values only at DEBUG level |
| Health endpoints |
/healthz (ping), /readyz (ConfigMap bootstrap) |
No Prometheus connectivity check |
| Alerting rules |
None |
No PrometheusRule CRDs |
| Dashboards |
Benchmark-only (replica counts + vLLM metrics) |
No operational dashboard |
Components with No Observability Today
These files operate silently (no metrics, DEBUG-only logs):
internal/engines/analyzers/saturation_v2/analyzer.go — zero metrics or logging
internal/engines/pipeline/greedy_score_optimizer.go — DEBUG logs only
internal/engines/pipeline/enforcer.go — DEBUG logs only
internal/engines/pipeline/cost_aware_optimizer.go — DEBUG logs only
internal/collector/replica_metrics.go — INFO/DEBUG logs only, no metrics
internal/engines/scalefromzero/engine.go — minimal logging, no metrics
Sub-Issues
- Sub-issue 1: Saturation and Capacity Metrics
- Sub-issue 2: Error Tracking Metrics
- Sub-issue 3: Optimization Loop Performance Metrics
- Sub-issue 4: Pipeline Stage Visibility Metrics (Limiter, Enforcer, Optimizer)
- Sub-issue 5: Collector Health Metrics
- Sub-issue 6: Configuration State Metrics
- Sub-issue 7: Kubernetes Events for Scaling Decisions
- Sub-issue 8: PrometheusRule Alerting Rules
- Sub-issue 9: Operational Grafana Dashboard
- Sub-issue 10: Observability Documentation
Implementation Order
Phase 1 (P0): Sub-issues 1, 2, 3 — Core metrics (can be done in parallel)
Phase 2 (P1): Sub-issues 4, 5, 7 — Pipeline + collector metrics + events
Phase 3 (P2): Sub-issues 6, 8, 9, 10 — Config metrics, alerts, dashboard, docs
Implementation Notes
- All new metrics must use the existing
internal/metrics/ package pattern (register via prometheus.Registerer, support optional controller_instance label)
- Histogram buckets for duration metrics:
{0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10} seconds
- Each sub-issue should be a separate PR for reviewability
- Sub-issues within the same phase can be developed in parallel
Summary
The WVA controller currently emits only 4 Prometheus metrics (
wva_desired_replicas,wva_current_replicas,wva_desired_ratio,wva_replica_scaling_total). Critical internal state — saturation analysis results, optimizer decisions, limiter constraints, collection health, and error counts — is only visible through logs. This makes it difficult for operators to monitor, alert on, and troubleshoot autoscaling behavior without log parsing.This epic defines a phased improvement to WVA observability across metrics, Kubernetes events, alerting rules, and dashboards, broken into independently deliverable sub-issues.
Current State
ctrl.Logwith key-value pairs/healthz(ping),/readyz(ConfigMap bootstrap)Components with No Observability Today
These files operate silently (no metrics, DEBUG-only logs):
internal/engines/analyzers/saturation_v2/analyzer.go— zero metrics or logginginternal/engines/pipeline/greedy_score_optimizer.go— DEBUG logs onlyinternal/engines/pipeline/enforcer.go— DEBUG logs onlyinternal/engines/pipeline/cost_aware_optimizer.go— DEBUG logs onlyinternal/collector/replica_metrics.go— INFO/DEBUG logs only, no metricsinternal/engines/scalefromzero/engine.go— minimal logging, no metricsSub-Issues
Implementation Order
Implementation Notes
internal/metrics/package pattern (register viaprometheus.Registerer, support optionalcontroller_instancelabel){0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10}seconds