Add a PrometheusRule CRD with baseline alerting rules.
Alerts:
groups:
- name: wva.rules
rules:
- alert: WVAHighErrorRate
expr: rate(wva_errors_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "WVA error rate elevated"
- alert: WVAOptimizationLoopStalled
expr: rate(wva_models_processed_total[10m]) == 0
for: 15m
labels:
severity: critical
annotations:
summary: "WVA optimization loop has stopped processing models"
- alert: WVAMetricsCollectionFailing
expr: rate(wva_metrics_collection_errors_total[5m]) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "WVA metrics collection failing frequently"
- alert: WVAGPUResourceExhausted
expr: wva_available_gpus == 0
for: 5m
labels:
severity: warning
annotations:
summary: "No GPUs available for WVA scaling"
- alert: WVAReplicaScalingThrashing
expr: rate(wva_replica_scaling_total[10m]) > 2
for: 10m
labels:
severity: warning
annotations:
summary: "WVA scaling decisions changing rapidly (possible thrashing)"
Implementation:
- Add
PrometheusRule CRD in config/prometheus/
- Include in Helm chart as optional (enabled via
values.yaml flag)
- Thresholds should be configurable in Helm values
Acceptance Criteria:
Add a
PrometheusRuleCRD with baseline alerting rules.Alerts:
Implementation:
PrometheusRuleCRD inconfig/prometheus/values.yamlflag)Acceptance Criteria:
make deployand Helm chart