Skip to content

Commit 3d71574

Browse files
committed
fix: alerting rule
1 parent bca9711 commit 3d71574

File tree

1 file changed

+7
-2
lines changed
  • components/operators/gpu-operator-certified/instance/components/gpu-monitoring

1 file changed

+7
-2
lines changed

components/operators/gpu-operator-certified/instance/components/gpu-monitoring/gpu-monitoring.yaml

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,8 +14,13 @@ spec:
1414
runbook_url: https://github.com/redhat-na-ssa/demo-ai-gitops-catalog/tree/main/components/operators/gpu-operator-certified/instance/components/monitoring-dashboard
1515
summary: Cloud costs may increase by requesting specialized resources.
1616
expr: |
17-
sum (kube_pod_resource_request{resource="nvidia.com/gpu"} >= 1 ) > 0
17+
# sum (kube_pod_resource_request{resource="nvidia.com/gpu"} >= 1 ) > 0
18+
sum by (namespace, pod) (
19+
kube_pod_resource_request{resource="nvidia.com/gpu"}
20+
or
21+
kube_pod_resource_limit{resource="nvidia.com/gpu"}
22+
) > 0
1823
# sum by (namespace, pod,resource) (kube_pod_resource_request{resource="nvidia.com/gpu"} >= 1) > 0
19-
# for: 1m
24+
# for: 1m
2025
labels:
2126
severity: info

0 commit comments

Comments
 (0)