Skip to content

Epic: Improve WVA Observability — Metrics, Events, Alerting, and Dashboards #911

@ev-shindin

Description

@ev-shindin

Summary

The WVA controller currently emits only 4 Prometheus metrics (wva_desired_replicas, wva_current_replicas, wva_desired_ratio, wva_replica_scaling_total). Critical internal state — saturation analysis results, optimizer decisions, limiter constraints, collection health, and error counts — is only visible through logs. This makes it difficult for operators to monitor, alert on, and troubleshoot autoscaling behavior without log parsing.

This epic defines a phased improvement to WVA observability across metrics, Kubernetes events, alerting rules, and dashboards, broken into independently deliverable sub-issues.


Current State

Area Status Notes
Prometheus metrics 4 metrics (replica counts + scaling counter) No saturation, capacity, error, or timing metrics
Kubernetes events 1 event type (ServiceMonitor deletion) No events for scaling decisions or errors
Status conditions 3 types (TargetResolved, MetricsAvailable, OptimizationReady) Good coverage
Structured logging Consistent ctrl.Log with key-value pairs Good, but many values only at DEBUG level
Health endpoints /healthz (ping), /readyz (ConfigMap bootstrap) No Prometheus connectivity check
Alerting rules None No PrometheusRule CRDs
Dashboards Benchmark-only (replica counts + vLLM metrics) No operational dashboard

Components with No Observability Today

These files operate silently (no metrics, DEBUG-only logs):

  • internal/engines/analyzers/saturation_v2/analyzer.go — zero metrics or logging
  • internal/engines/pipeline/greedy_score_optimizer.go — DEBUG logs only
  • internal/engines/pipeline/enforcer.go — DEBUG logs only
  • internal/engines/pipeline/cost_aware_optimizer.go — DEBUG logs only
  • internal/collector/replica_metrics.go — INFO/DEBUG logs only, no metrics
  • internal/engines/scalefromzero/engine.go — minimal logging, no metrics

Sub-Issues

  • Sub-issue 1: Saturation and Capacity Metrics
  • Sub-issue 2: Error Tracking Metrics
  • Sub-issue 3: Optimization Loop Performance Metrics
  • Sub-issue 4: Pipeline Stage Visibility Metrics (Limiter, Enforcer, Optimizer)
  • Sub-issue 5: Collector Health Metrics
  • Sub-issue 6: Configuration State Metrics
  • Sub-issue 7: Kubernetes Events for Scaling Decisions
  • Sub-issue 8: PrometheusRule Alerting Rules
  • Sub-issue 9: Operational Grafana Dashboard
  • Sub-issue 10: Observability Documentation

Implementation Order

Phase 1 (P0):  Sub-issues 1, 2, 3    — Core metrics (can be done in parallel)
Phase 2 (P1):  Sub-issues 4, 5, 7    — Pipeline + collector metrics + events
Phase 3 (P2):  Sub-issues 6, 8, 9, 10 — Config metrics, alerts, dashboard, docs

Implementation Notes

  • All new metrics must use the existing internal/metrics/ package pattern (register via prometheus.Registerer, support optional controller_instance label)
  • Histogram buckets for duration metrics: {0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10} seconds
  • Each sub-issue should be a separate PR for reviewability
  • Sub-issues within the same phase can be developed in parallel

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestepicneeds-triageIndicates an issue or PR lacks a triage label and requires one.

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions