CARRY: Adding runbooks for alerts

Fiona-Waters · Fiona-Waters · commit e63661769a85 · 2024-08-12T11:06:40.000+01:00
diff --git a/config/rhoai/prometheus_rule.yaml b/config/rhoai/prometheus_rule.yaml
@@ -15,6 +15,7 @@ spec:
         annotations:
           summary: "Kueue pod is down ({{ $labels.pod }})"
           description: "The Kueue pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is not ready."
+          triage: "https://github.com/opendatahub-io/kueue/tree/dev/docs/alerts/runbooks/kueue-pod-down.md"
     - name: kueue-info-alerts
       rules:
       - alert: LowClusterQueueResourceUsage
@@ -25,6 +26,7 @@ spec:
         annotations:
           summary: Low {{ $labels.resource }} resource usage in cluster queue {{ $labels.cluster_queue }}
           description: The {{ $labels.resource }} resource usage in cluster queue {{ $labels.cluster_queue }} is below 20% of its nominal quota for more than 1 day.
+          triage: "https://github.com/opendatahub-io/kueue/tree/dev/docs/alerts/runbooks/low-cluster-queue-resource-usage.md"
       - alert: ResourceReservationExceedsQuota
         expr: (sum(kueue_cluster_queue_resource_reservation) by (resource, cluster_queue)) / 10 > (sum(kueue_cluster_queue_nominal_quota) by (resource, cluster_queue))
         for: 10m
@@ -33,6 +35,7 @@ spec:
         annotations:
           summary: Resource {{ $labels.resource }} reservation far exceeds the available quota in cluster queue {{ $labels.cluster_queue}}
           description: Resource {{ $labels.resource }} reservation is 10 times the available quota in cluster queue {{ $labels.cluster_queue}}
+          triage: "https://github.com/opendatahub-io/kueue/tree/dev/docs/alerts/runbooks/resource-reservation-exceeds-quota.md"
       - alert: PendingWorkloadPods
         expr: (sum by (namespace, pod) (sum_over_time(kube_pod_status_phase{phase="Pending"}[3d])) >= 3 * 24 * 60) >0
         for: 1m
@@ -41,4 +44,5 @@ spec:
         annotations:
           summary: Pod {{ $labels.pod }} in the {{ $labels.namespace }} namespace has been pending for more than 3 days
           description: A pod {{ $labels.pod }} in the {{ $labels.namespace }} namespace has been in the pending state for more than 3 days.
+          triage: "https://github.com/opendatahub-io/kueue/tree/dev/docs/alerts/runbooks/pending-workload-pods.md"
 
diff --git a/docs/alerts/runbooks/kueue-pod-down.md b/docs/alerts/runbooks/kueue-pod-down.md
@@ -0,0 +1,42 @@
+# Kueue Pod Down
+
+## Severity: Critical
+
+## Impact
+
+Any workloads running on the cluster will not be able to use the Kueue component.
+
+## Summary
+
+This alert is triggered when the `kube_pod_status_ready` query shows that the Kueue controller pod is not ready.
+
+## Steps
+
+1. Check to see if the `kueue-controller` pod is running in the `redhat-ods-applications` namespace:
+
+```bash
+$ oc -n redhat-ods-applications get pods -l app.kubernetes.io/name=kueue
+```
+
+2. If the pod is not running, look at the pod's logs/events to see what may be causing the issues. Please make sure to grab the logs/events so they can be shared with the engineering team later:
+
+```bash
+# Check pod logs 
+$ oc -n redhat-ods-applications logs -l app.kubernetes.io/name=kueue --prefix=true
+
+# Check events 
+$ oc -n redhat-ods-applications get events | grep pod/kueue-controller
+
+# Check pod status fields
+$ oc -n redhat-ods-applications get pods -l app.kubernetes.io/name=kueue -o jsonpath="{range .items[*]}{.status}{\"\n\n\"}{end}"
+```
+
+3. Redeploy Kueue Operator by restarting the deployment:
+
+```bash
+$ oc -n redhat-ods-applications rollout restart deployments/kueue-controller-manager
+```
+
+This should result in a new pod getting deployed, attempt step (1) again and see if the pod achieves running state.
+
+4. If the problem persists, capture the logs and escalate to the RHOAI engineering team.
diff --git a/docs/alerts/runbooks/low-cluster-queue-resource-usage.md b/docs/alerts/runbooks/low-cluster-queue-resource-usage.md
@@ -0,0 +1,53 @@
+# Low Cluster Queue Resource Usage
+
+## Severity: Info
+
+## Impact
+
+Resources that are consistently unused can be redistributed.
+
+## Summary
+
+This alert is triggered when the resource usage in a cluster queue is below 20% of its nominal quota for more than 1 day. 
+
+## Steps
+
+1. Check current resource usage for the cluster queue and ensure that the nominal quota for the resource in question is correctly configured. Update the cluster-queue-name in the script below to describe the cluster queue.
+```bash
+cluster_queue=< cluster-queue-name >
+oc describe clusterqueue $cluster_queue
+```
+ - If you would like to view just the Flavors and Nominal Quota you can use the following command:
+```bash
+oc describe clusterqueue $cluster_queue | awk '/Flavors:/,/^$/' 
+```
+
+2. Review the workloads that are linked with the cluster queue to see if the assigned resources are required. 
+```bash
+# Find local queues linked to the cluster queue
+local_queues=$(oc get localqueues --all-namespaces -o json | jq -r --arg clusterQueue "$cluster_queue" '.items[] | select(.spec.clusterQueue == $clusterQueue) | "\(.metadata.namespace)/\(.metadata.name)"')
+
+# Find workloads linked to the local queues
+for local_queue in $local_queues; do
+  namespace=$(echo $local_queue | cut -d '/' -f 1)
+  queue_name=$(echo $local_queue | cut -d '/' -f 2)
+
+  echo "Checking workloads linked to local queue $queue_name in namespace $namespace..."
+
+  oc get workloads --namespace $namespace -o json | jq -r --arg queueName "$queue_name" '.items[] | select(.spec.queueName == $queueName) | "\(.metadata.namespace)/\(.metadata.name)"'
+done
+```
+
+3. Review individual workloads. Update the namespace and workload-name in the script below to view details of the workload.
+```bash
+namespace=< namespace >
+workload_name=< workload-name >
+oc describe workload -n $namespace $workload_name
+```
+
+4. Consider reducing the cluster queue nominal quota if resource usage is consistently low. 
+You can patch the clusterqueue using the following command. Note that you must change the values to refer to the exact resource you want to change. 
+This will change the nominal quota for cpu to 10, in the first flavor referenced in the named cluster queue resource:
+```bash
+oc patch clusterqueue $cluster_queue --type='json' -p='[{"op": "replace", "path": "/spec/resourceGroups/0/flavors/0/resources/0/nominalQuota", "value": "10"}]'
+```
diff --git a/docs/alerts/runbooks/pending-workload-pods.md b/docs/alerts/runbooks/pending-workload-pods.md
@@ -0,0 +1,37 @@
+# Pending Workload Pods
+
+## Severity: Info
+
+## Impact
+Knowledge of pods in a prolonged pending state will allow users to troubleshoot and fix any issues in order to run their workloads successfully.
+
+## Summary
+
+This alert is triggered when a pod is in the pending state for more than 3 days.
+
+## Steps
+
+1. Identify the pending pod in your project namespace. Update the project namespace below to the name of your project namespace.
+```bash
+namespace=< project-namespace >
+oc get pods -A --field-selector=status.phase=Pending # This will show all pods in the cluster with Pending status
+oc get pods -n $namespace --field-selector=status.phase=Pending # This will show all pods in the specified namespace with Pending status
+```
+
+2. Get further details on the pod.
+```bash
+pod=< pod-name >
+oc describe pod $pod -n $namespace
+```
+
+3. Review the pod logs and determine why it is in a pending state. 
+```bash
+oc logs $pod -n $namespace
+```
+
+4. Review the pod events in order to determine why it is in a pending state. 
+```bash
+oc get events --field-selector involvedObject.name=$pod --namespace=$namespace
+```
+
+5. Review the results of the steps above to determine the best course of action for successfully running the workload.
diff --git a/docs/alerts/runbooks/resource-reservation-exceeds-quota.md b/docs/alerts/runbooks/resource-reservation-exceeds-quota.md
@@ -0,0 +1,56 @@
+# Resource Reservation Exceeds Quota
+
+## Severity: Info
+
+## Impact
+
+Knowledge of over requested resources will allow the user to adjust the nominal quota or resources requested by a workload.
+
+## Summary
+
+This alert is triggered when resource reservation is 10 times the available nominal quota in a cluster queue.
+
+## Steps
+
+1. Check current resource reservation for the cluster queue and ensure that the nominal quota for the resource in question is correctly configured. Update the cluster-queue-name in the script below to describe the cluster queue.
+```bash
+cluster_queue=< cluster-queue-name >
+oc describe clusterqueue $cluster_queue
+```
+
+ - If you would just like to view the Flavors Reservation and Flavors Usage you can use the following command:
+```bash
+oc describe clusterqueue $cluster_queue | awk '/Flavors Reservation:/,/^$/' 
+```
+
+2. Review the workloads that are linked with the cluster queue to see if the requested resources are required. 
+```bash
+# Find local queues linked to the cluster queue
+local_queues=$(oc get localqueues --all-namespaces -o json | jq -r --arg clusterQueue "$cluster_queue" '.items[] | select(.spec.clusterQueue == $clusterQueue) | "\(.metadata.namespace)/\(.metadata.name)"')
+
+# Find workloads linked to the local queues
+for local_queue in $local_queues; do
+  namespace=$(echo $local_queue | cut -d '/' -f 1)
+  queue_name=$(echo $local_queue | cut -d '/' -f 2)
+
+  echo "Checking workloads linked to local queue $queue_name in namespace $namespace..."
+
+  oc get workloads --namespace $namespace -o json | jq -r --arg queueName "$queue_name" '.items[] | select(.spec.queueName == $queueName) | "\(.metadata.namespace)/\(.metadata.name)"'
+done
+```
+
+3. Review individual workloads. Update the namespace and workload-name in the script below to view details of the workload.
+```bash
+namespace=< namespace >
+workload_name=< workload-name >
+oc describe workload -n $namespace $workload_name
+```
+
+4. Consider increasing the cluster queue nominal quota. 
+You can patch the clusterqueue using the following command. Note that you must change the values to refer to the exact resource you want to change. 
+This will change the nominal quota for cpu to 10, in the first flavor referenced in the named cluster queue resource:
+```bash
+oc patch clusterqueue $cluster_queue --type='json' -p='[{"op": "replace", "path": "/spec/resourceGroups/0/flavors/0/resources/0/nominalQuota", "value": "10"}]'
+```
+
+5. Alternatively consider altering the resources requested in the pending workloads, if possible.