Skip to content

Commit e636617

Browse files
committed
CARRY: Adding runbooks for alerts
1 parent bbcb742 commit e636617

5 files changed

+192
-0
lines changed

config/rhoai/prometheus_rule.yaml

+4
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ spec:
1515
annotations:
1616
summary: "Kueue pod is down ({{ $labels.pod }})"
1717
description: "The Kueue pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is not ready."
18+
triage: "https://github.com/opendatahub-io/kueue/tree/dev/docs/alerts/runbooks/kueue-pod-down.md"
1819
- name: kueue-info-alerts
1920
rules:
2021
- alert: LowClusterQueueResourceUsage
@@ -25,6 +26,7 @@ spec:
2526
annotations:
2627
summary: Low {{ $labels.resource }} resource usage in cluster queue {{ $labels.cluster_queue }}
2728
description: The {{ $labels.resource }} resource usage in cluster queue {{ $labels.cluster_queue }} is below 20% of its nominal quota for more than 1 day.
29+
triage: "https://github.com/opendatahub-io/kueue/tree/dev/docs/alerts/runbooks/low-cluster-queue-resource-usage.md"
2830
- alert: ResourceReservationExceedsQuota
2931
expr: (sum(kueue_cluster_queue_resource_reservation) by (resource, cluster_queue)) / 10 > (sum(kueue_cluster_queue_nominal_quota) by (resource, cluster_queue))
3032
for: 10m
@@ -33,6 +35,7 @@ spec:
3335
annotations:
3436
summary: Resource {{ $labels.resource }} reservation far exceeds the available quota in cluster queue {{ $labels.cluster_queue}}
3537
description: Resource {{ $labels.resource }} reservation is 10 times the available quota in cluster queue {{ $labels.cluster_queue}}
38+
triage: "https://github.com/opendatahub-io/kueue/tree/dev/docs/alerts/runbooks/resource-reservation-exceeds-quota.md"
3639
- alert: PendingWorkloadPods
3740
expr: (sum by (namespace, pod) (sum_over_time(kube_pod_status_phase{phase="Pending"}[3d])) >= 3 * 24 * 60) >0
3841
for: 1m
@@ -41,4 +44,5 @@ spec:
4144
annotations:
4245
summary: Pod {{ $labels.pod }} in the {{ $labels.namespace }} namespace has been pending for more than 3 days
4346
description: A pod {{ $labels.pod }} in the {{ $labels.namespace }} namespace has been in the pending state for more than 3 days.
47+
triage: "https://github.com/opendatahub-io/kueue/tree/dev/docs/alerts/runbooks/pending-workload-pods.md"
4448

+42
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# Kueue Pod Down
2+
3+
## Severity: Critical
4+
5+
## Impact
6+
7+
Any workloads running on the cluster will not be able to use the Kueue component.
8+
9+
## Summary
10+
11+
This alert is triggered when the `kube_pod_status_ready` query shows that the Kueue controller pod is not ready.
12+
13+
## Steps
14+
15+
1. Check to see if the `kueue-controller` pod is running in the `redhat-ods-applications` namespace:
16+
17+
```bash
18+
$ oc -n redhat-ods-applications get pods -l app.kubernetes.io/name=kueue
19+
```
20+
21+
2. If the pod is not running, look at the pod's logs/events to see what may be causing the issues. Please make sure to grab the logs/events so they can be shared with the engineering team later:
22+
23+
```bash
24+
# Check pod logs
25+
$ oc -n redhat-ods-applications logs -l app.kubernetes.io/name=kueue --prefix=true
26+
27+
# Check events
28+
$ oc -n redhat-ods-applications get events | grep pod/kueue-controller
29+
30+
# Check pod status fields
31+
$ oc -n redhat-ods-applications get pods -l app.kubernetes.io/name=kueue -o jsonpath="{range .items[*]}{.status}{\"\n\n\"}{end}"
32+
```
33+
34+
3. Redeploy Kueue Operator by restarting the deployment:
35+
36+
```bash
37+
$ oc -n redhat-ods-applications rollout restart deployments/kueue-controller-manager
38+
```
39+
40+
This should result in a new pod getting deployed, attempt step (1) again and see if the pod achieves running state.
41+
42+
4. If the problem persists, capture the logs and escalate to the RHOAI engineering team.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# Low Cluster Queue Resource Usage
2+
3+
## Severity: Info
4+
5+
## Impact
6+
7+
Resources that are consistently unused can be redistributed.
8+
9+
## Summary
10+
11+
This alert is triggered when the resource usage in a cluster queue is below 20% of its nominal quota for more than 1 day.
12+
13+
## Steps
14+
15+
1. Check current resource usage for the cluster queue and ensure that the nominal quota for the resource in question is correctly configured. Update the cluster-queue-name in the script below to describe the cluster queue.
16+
```bash
17+
cluster_queue=< cluster-queue-name >
18+
oc describe clusterqueue $cluster_queue
19+
```
20+
- If you would like to view just the Flavors and Nominal Quota you can use the following command:
21+
```bash
22+
oc describe clusterqueue $cluster_queue | awk '/Flavors:/,/^$/'
23+
```
24+
25+
2. Review the workloads that are linked with the cluster queue to see if the assigned resources are required.
26+
```bash
27+
# Find local queues linked to the cluster queue
28+
local_queues=$(oc get localqueues --all-namespaces -o json | jq -r --arg clusterQueue "$cluster_queue" '.items[] | select(.spec.clusterQueue == $clusterQueue) | "\(.metadata.namespace)/\(.metadata.name)"')
29+
30+
# Find workloads linked to the local queues
31+
for local_queue in $local_queues; do
32+
namespace=$(echo $local_queue | cut -d '/' -f 1)
33+
queue_name=$(echo $local_queue | cut -d '/' -f 2)
34+
35+
echo "Checking workloads linked to local queue $queue_name in namespace $namespace..."
36+
37+
oc get workloads --namespace $namespace -o json | jq -r --arg queueName "$queue_name" '.items[] | select(.spec.queueName == $queueName) | "\(.metadata.namespace)/\(.metadata.name)"'
38+
done
39+
```
40+
41+
3. Review individual workloads. Update the namespace and workload-name in the script below to view details of the workload.
42+
```bash
43+
namespace=< namespace >
44+
workload_name=< workload-name >
45+
oc describe workload -n $namespace $workload_name
46+
```
47+
48+
4. Consider reducing the cluster queue nominal quota if resource usage is consistently low.
49+
You can patch the clusterqueue using the following command. Note that you must change the values to refer to the exact resource you want to change.
50+
This will change the nominal quota for cpu to 10, in the first flavor referenced in the named cluster queue resource:
51+
```bash
52+
oc patch clusterqueue $cluster_queue --type='json' -p='[{"op": "replace", "path": "/spec/resourceGroups/0/flavors/0/resources/0/nominalQuota", "value": "10"}]'
53+
```
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
# Pending Workload Pods
2+
3+
## Severity: Info
4+
5+
## Impact
6+
Knowledge of pods in a prolonged pending state will allow users to troubleshoot and fix any issues in order to run their workloads successfully.
7+
8+
## Summary
9+
10+
This alert is triggered when a pod is in the pending state for more than 3 days.
11+
12+
## Steps
13+
14+
1. Identify the pending pod in your project namespace. Update the project namespace below to the name of your project namespace.
15+
```bash
16+
namespace=< project-namespace >
17+
oc get pods -A --field-selector=status.phase=Pending # This will show all pods in the cluster with Pending status
18+
oc get pods -n $namespace --field-selector=status.phase=Pending # This will show all pods in the specified namespace with Pending status
19+
```
20+
21+
2. Get further details on the pod.
22+
```bash
23+
pod=< pod-name >
24+
oc describe pod $pod -n $namespace
25+
```
26+
27+
3. Review the pod logs and determine why it is in a pending state.
28+
```bash
29+
oc logs $pod -n $namespace
30+
```
31+
32+
4. Review the pod events in order to determine why it is in a pending state.
33+
```bash
34+
oc get events --field-selector involvedObject.name=$pod --namespace=$namespace
35+
```
36+
37+
5. Review the results of the steps above to determine the best course of action for successfully running the workload.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
# Resource Reservation Exceeds Quota
2+
3+
## Severity: Info
4+
5+
## Impact
6+
7+
Knowledge of over requested resources will allow the user to adjust the nominal quota or resources requested by a workload.
8+
9+
## Summary
10+
11+
This alert is triggered when resource reservation is 10 times the available nominal quota in a cluster queue.
12+
13+
## Steps
14+
15+
1. Check current resource reservation for the cluster queue and ensure that the nominal quota for the resource in question is correctly configured. Update the cluster-queue-name in the script below to describe the cluster queue.
16+
```bash
17+
cluster_queue=< cluster-queue-name >
18+
oc describe clusterqueue $cluster_queue
19+
```
20+
21+
- If you would just like to view the Flavors Reservation and Flavors Usage you can use the following command:
22+
```bash
23+
oc describe clusterqueue $cluster_queue | awk '/Flavors Reservation:/,/^$/'
24+
```
25+
26+
2. Review the workloads that are linked with the cluster queue to see if the requested resources are required.
27+
```bash
28+
# Find local queues linked to the cluster queue
29+
local_queues=$(oc get localqueues --all-namespaces -o json | jq -r --arg clusterQueue "$cluster_queue" '.items[] | select(.spec.clusterQueue == $clusterQueue) | "\(.metadata.namespace)/\(.metadata.name)"')
30+
31+
# Find workloads linked to the local queues
32+
for local_queue in $local_queues; do
33+
namespace=$(echo $local_queue | cut -d '/' -f 1)
34+
queue_name=$(echo $local_queue | cut -d '/' -f 2)
35+
36+
echo "Checking workloads linked to local queue $queue_name in namespace $namespace..."
37+
38+
oc get workloads --namespace $namespace -o json | jq -r --arg queueName "$queue_name" '.items[] | select(.spec.queueName == $queueName) | "\(.metadata.namespace)/\(.metadata.name)"'
39+
done
40+
```
41+
42+
3. Review individual workloads. Update the namespace and workload-name in the script below to view details of the workload.
43+
```bash
44+
namespace=< namespace >
45+
workload_name=< workload-name >
46+
oc describe workload -n $namespace $workload_name
47+
```
48+
49+
4. Consider increasing the cluster queue nominal quota.
50+
You can patch the clusterqueue using the following command. Note that you must change the values to refer to the exact resource you want to change.
51+
This will change the nominal quota for cpu to 10, in the first flavor referenced in the named cluster queue resource:
52+
```bash
53+
oc patch clusterqueue $cluster_queue --type='json' -p='[{"op": "replace", "path": "/spec/resourceGroups/0/flavors/0/resources/0/nominalQuota", "value": "10"}]'
54+
```
55+
56+
5. Alternatively consider altering the resources requested in the pending workloads, if possible.

0 commit comments

Comments
 (0)