Skip to content

PrometheusRule Alerting Rules #919

@ev-shindin

Description

@ev-shindin

Add a PrometheusRule CRD with baseline alerting rules.

Alerts:

groups:
- name: wva.rules
  rules:
  - alert: WVAHighErrorRate
    expr: rate(wva_errors_total[5m]) > 0.1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "WVA error rate elevated"

  - alert: WVAOptimizationLoopStalled
    expr: rate(wva_models_processed_total[10m]) == 0
    for: 15m
    labels:
      severity: critical
    annotations:
      summary: "WVA optimization loop has stopped processing models"

  - alert: WVAMetricsCollectionFailing
    expr: rate(wva_metrics_collection_errors_total[5m]) > 0.5
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "WVA metrics collection failing frequently"

  - alert: WVAGPUResourceExhausted
    expr: wva_available_gpus == 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "No GPUs available for WVA scaling"

  - alert: WVAReplicaScalingThrashing
    expr: rate(wva_replica_scaling_total[10m]) > 2
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "WVA scaling decisions changing rapidly (possible thrashing)"

Implementation:

  • Add PrometheusRule CRD in config/prometheus/
  • Include in Helm chart as optional (enabled via values.yaml flag)
  • Thresholds should be configurable in Helm values

Acceptance Criteria:

  • PrometheusRule CRD deploys with make deploy and Helm chart
  • Alerts fire correctly when conditions are met (manual verification)
  • Helm chart allows enabling/disabling and threshold overrides

Metadata

Metadata

Assignees

Labels

needs-triageIndicates an issue or PR lacks a triage label and requires one.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions