WVA integrates with Prometheus to collect metrics from vLLM inference servers and expose custom autoscaling metrics. This guide covers Prometheus configuration, metric collection, and security best practices.
WVA supports two methods for configuring Prometheus connectivity:
Set Prometheus configuration via environment variables in the WVA deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: workload-variant-autoscaler-controller-manager
spec:
template:
spec:
containers:
- name: manager
env:
# Required: Prometheus server URL
- name: PROMETHEUS_BASE_URL
value: "https://prometheus-k8s.monitoring.svc.cluster.local:9091"
# Optional: TLS configuration
- name: PROMETHEUS_TLS_INSECURE_SKIP_VERIFY
value: "false" # Set to "true" only for testing/development
- name: PROMETHEUS_CA_CERT_PATH
value: "/etc/prometheus-certs/ca.crt"
- name: PROMETHEUS_CLIENT_CERT_PATH
value: "/etc/prometheus-certs/client.crt"
- name: PROMETHEUS_CLIENT_KEY_PATH
value: "/etc/prometheus-certs/client.key"
- name: PROMETHEUS_SERVER_NAME
value: "prometheus-k8s.monitoring.svc.cluster.local"
# Optional: Bearer token authentication
- name: PROMETHEUS_BEARER_TOKEN
valueFrom:
secretKeyRef:
name: prometheus-token
key: tokenEnvironment Variable Reference:
| Variable | Required | Description | Default |
|---|---|---|---|
PROMETHEUS_BASE_URL |
Yes | Prometheus server URL (HTTPS only in production) | - |
PROMETHEUS_TLS_INSECURE_SKIP_VERIFY |
No | Skip TLS certificate verification (dev/test only) | false |
PROMETHEUS_CA_CERT_PATH |
No | Path to CA certificate for TLS verification | - |
PROMETHEUS_CLIENT_CERT_PATH |
No | Path to client certificate for mutual TLS | - |
PROMETHEUS_CLIENT_KEY_PATH |
No | Path to client private key for mutual TLS | - |
PROMETHEUS_SERVER_NAME |
No | Expected server name in TLS certificate | - |
PROMETHEUS_BEARER_TOKEN |
No | Bearer token for Prometheus authentication | - |
Alternatively, configure Prometheus via the controller's ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: wva-variantautoscaling-config
namespace: workload-variant-autoscaler-system
data:
PROMETHEUS_BASE_URL: "https://prometheus-k8s.monitoring.svc.cluster.local:9091"
PROMETHEUS_TLS_INSECURE_SKIP_VERIFY: "false"
PROMETHEUS_CA_CERT_PATH: "/etc/prometheus-certs/ca.crt"
PROMETHEUS_CLIENT_CERT_PATH: "/etc/prometheus-certs/client.crt"
PROMETHEUS_CLIENT_KEY_PATH: "/etc/prometheus-certs/client.key"
PROMETHEUS_SERVER_NAME: "prometheus-k8s.monitoring.svc.cluster.local"
PROMETHEUS_BEARER_TOKEN: "your-bearer-token" # Not recommended - use Secret insteadConfiguration Priority:
- Environment variables (checked first)
- ConfigMap values (fallback)
- Error if neither provides
PROMETHEUS_BASE_URL
Production Deployments:
- Always use HTTPS endpoints (
https://) - Provide CA certificate via
PROMETHEUS_CA_CERT_PATH - Never set
PROMETHEUS_TLS_INSECURE_SKIP_VERIFY=truein production
Development/Testing:
- You may set
PROMETHEUS_TLS_INSECURE_SKIP_VERIFY=truefor local clusters - Example (port-forwarding to Prometheus):
# Terminal 1: Port forward Prometheus kubectl port-forward -n monitoring svc/prometheus-k8s 9091:9091 # Terminal 2: Set environment for local development export PROMETHEUS_BASE_URL=https://127.0.0.1:9091 export PROMETHEUS_TLS_INSECURE_SKIP_VERIFY=true
WVA implements security measures to prevent PromQL injection attacks:
-
Parameter Escaping: All query parameters (namespace, model ID, variant name) are automatically escaped:
- Backslashes are escaped:
\→\\ - Double quotes are escaped:
"→\"
- Backslashes are escaped:
-
Namespace Validation: Namespace values are validated before use in PromQL queries to prevent malicious label matchers
Example - Safe Query Construction:
// User input (potentially malicious)
namespace := `prod",malicious="value`
// WVA automatically escapes the value
escapedNamespace := EscapePromQLValue(namespace)
// Result: `prod\",malicious=\"value`
// Safe PromQL query
query := fmt.Sprintf(`vllm_kv_cache_usage{namespace="%s"}`, escapedNamespace)
// Result: vllm_kv_cache_usage{namespace="prod\",malicious=\"value"}
// Prometheus treats this as a literal string, preventing injectionWhy This Matters:
- Prevents unauthorized access to metrics from other namespaces
- Blocks label injection attacks that could manipulate query results
- Ensures multi-tenant deployments remain isolated
WVA exposes custom metrics that provide insights into autoscaling behavior and optimization performance. These metrics are exposed via Prometheus at the /metrics endpoint.
All custom metrics are prefixed with inferno_ and include labels for variant_name, namespace, and other relevant dimensions.
No optimization metrics are currently exposed. Optimization timing is logged at DEBUG level.
- Type: Gauge
- Description: Current number of replicas for each variant
- Labels:
variant_name: Name of the variantnamespace: Kubernetes namespaceaccelerator_type: Type of accelerator being used
- Use Case: Monitor current number of replicas per variant
- Type: Gauge
- Description: Desired number of replicas for each variant
- Labels:
variant_name: Name of the variantnamespace: Kubernetes namespaceaccelerator_type: Type of accelerator being used
- Use Case: Expose the desired optimized number of replicas per variant
- Type: Gauge
- Description: Ratio of the desired number of replicas and the current number of replicas for each variant
- Labels:
variant_name: Name of the variantnamespace: Kubernetes namespaceaccelerator_type: Type of accelerator being used
- Use Case: Compare the desired and current number of replicas per variant, for scaling purposes
- Type: Counter
- Description: Total number of replica scaling operations
- Labels:
variant_name: Name of the variantnamespace: Kubernetes namespacedirection: Direction of scaling (up, down)reason: Reason for scaling
- Use Case: Track scaling frequency and reasons
The metrics are exposed at the /metrics endpoint on port 8080 (HTTP).
WVA metrics are exposed on port 8080 (HTTP):
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: workload-variant-autoscaler
namespace: workload-variant-autoscaler-system
labels:
release: kube-prometheus-stack
spec:
selector:
matchLabels:
control-plane: controller-manager
endpoints:
- port: http
scheme: http
interval: 30s
path: /metrics# Current replicas by variant
wva_current_replicas
# Scaling frequency
rate(wva_replica_scaling_total[5m])
# Desired replicas by variant
wva_desired_replicas
# Scaling frequency by direction
rate(wva_replica_scaling_total{direction="scale_up"}[5m])
# Replica count mismatch
abs(wva_desired_replicas - wva_current_replicas)
# Scaling frequency by reason
rate(wva_replica_scaling_total[5m]) by (reason)