Onwards

jabellard · jabellard · commit a97d9acfc8b9 · 2026-05-03T22:27:39.000-04:00
Signed-off-by: Joe Nathan Abellard &lt;contact@jabellard.com&gt;
diff --git a/docs/administrator/reliability/guide.md b/docs/administrator/reliability/guide.md
@@ -232,13 +232,21 @@ The multi-burn-rate framework solves this by:
 
 2. **Multiple time windows**: Each alert condition uses two windows — a long window and a short window (sized at 1/12th of the long window). The long window detects that error budget is being consumed at an unsustainable rate. The short window confirms the issue is still actively occurring, preventing alerts from firing on problems that have already resolved.
 
-3. **Severity-based thresholds**: The Google SRE Workbook defines three alert tiers based on the rate and duration of budget consumption:
+3. **Severity-based thresholds**: The framework defines four alert tiers across two severity levels, based on the rate and duration of budget consumption:
 
-   | Severity | Burn Rate | Long Window | Short Window | Budget Consumed | Time to Exhaustion |
-   |----------|-----------|-------------|--------------|-----------------|-------------------|
-   | Page | 14.4x | 1 hour | 5 minutes | 2% | ~50 hours |
-   | Page | 6x | 6 hours | 30 minutes | 5% | ~5 days |
-   | Ticket | 1x | 3 days | 6 hours | 10% | 30 days |
+   **Page (critical):**
+
+   | Burn Rate | Long Window | Short Window | Budget Consumed | Time to Exhaustion | Action |
+   |-----------|-------------|--------------|-----------------|-------------------|--------|
+   | 14.4x | 1 hour | 5 minutes | 2% | ~2 days | Immediate action required |
+   | 6x | 6 hours | 30 minutes | 5% | ~5 days | Investigate and fix promptly |
+
+   **Ticket (warning):**
+
+   | Burn Rate | Long Window | Short Window | Budget Consumed | Time to Exhaustion | Action |
+   |-----------|-------------|--------------|-----------------|-------------------|--------|
+   | 3x | 1 day | 2 hours | 10% | ~10 days | Address during business hours |
+   | 1x | 3 days | 6 hours | 10% | 30 days | Plan remediation before window closes |
 
    An alert fires only when **both** the long window and the short window exceed their burn-rate threshold simultaneously. This dual-window requirement eliminates false positives from brief spikes that have already resolved.
 
diff --git a/docs/runbooks/SLO/binding-sync-work-availability.md b/docs/runbooks/SLO/binding-sync-work-availability.md
@@ -94,7 +94,7 @@ done
 **5. Check RBAC permissions:**
 
 ```bash
-kubectl logs -n karmada-system -l app=karmada-controller-manager | grep -i "forbidden\|unauthorized\|rbac"
+kubectl logs -n karmada-system -l app=karmada-controller-manager --tail=200 | grep -i "forbidden\|unauthorized\|rbac"
 ```
 
 **6. Check override policy application.** Override policies are applied during Work creation; failures cause binding sync errors.
@@ -104,17 +104,23 @@ kubectl get overridepolicies -A
 kubectl get clusteroverridepolicies
 ```
 
-**7. Check the binding controller workqueue:**
+**7. Check the binding sync error rate:**
 
 ```promql
-workqueue_depth{name=~"binding.*|clusterbinding.*"}
+sum by (result) (rate(binding_sync_work_duration_seconds_count[5m]))
+```
+
+**8. Check the binding controller workqueue:**
+
+```promql
+workqueue_depth{name=~"binding-controller|cluster-resource-binding-controller"}
 
-rate(workqueue_retries_total{name=~"binding.*"}[5m])
+rate(workqueue_retries_total{name=~"binding-controller|cluster-resource-binding-controller"}[5m])
 /
-rate(workqueue_adds_total{name=~"binding.*"}[5m])
+rate(workqueue_adds_total{name=~"binding-controller|cluster-resource-binding-controller"}[5m])
 ```
 
-**8. Check for recent changes.** Were override policies modified? Were new member clusters registered? Were execution namespaces deleted?
+**9. Check for recent changes.** Were override policies modified? Were new member clusters registered? Were execution namespaces deleted?
 
 ### Mitigation
 
diff --git a/docs/runbooks/SLO/binding-sync-work-latency.md b/docs/runbooks/SLO/binding-sync-work-latency.md
@@ -54,46 +54,54 @@ Binding sync latency determines how quickly scheduling decisions translate into
 - **[SLO Overview Dashboard](https://grafana.com/grafana/dashboards/14643)** — Check the fleet-wide view to see if other SLOs are also burning budget, which may indicate a broader systemic issue.
 
 
-**2. Check for complex override policies.** Override policy evaluation happens during binding sync; complex policies with many rules are a common cause of latency.
+**2. Check binding sync duration breakdown:**
+
+```promql
+histogram_quantile(0.95,
+  sum by (le, result) (rate(binding_sync_work_duration_seconds_bucket[5m])))
+```
+
+**3. Check for complex override policies.** Override policy evaluation happens during binding sync; complex policies with many rules are a common cause of latency.
 
 ```bash
 kubectl get clusteroverridepolicies -o yaml | grep -c "overrideRules"
 kubectl get overridepolicies -A -o yaml | grep -c "overrideRules"
 ```
 
-**3. Check the binding controller workqueue:**
+**4. Check the binding controller workqueue:**
 
 ```promql
-workqueue_depth{name=~"binding.*|clusterbinding.*"}
+workqueue_depth{name=~"binding-controller|cluster-resource-binding-controller"}
 
 histogram_quantile(0.95,
-  rate(workqueue_queue_duration_seconds_bucket{name=~"binding.*"}[5m]))
+  sum by (le) (rate(workqueue_queue_duration_seconds_bucket{name=~"binding-controller|cluster-resource-binding-controller"}[5m])))
 
 histogram_quantile(0.95,
-  rate(workqueue_work_duration_seconds_bucket{name=~"binding.*"}[5m]))
+  sum by (le) (rate(workqueue_work_duration_seconds_bucket{name=~"binding-controller|cluster-resource-binding-controller"}[5m])))
 ```
 
-**4. Check for large resource manifests.** Large resource manifests take longer to process and create as Work objects.
+**5. Check binding volume.** A high number of bindings being processed simultaneously increases controller load.
 
 ```bash
-kubectl get resourcebindings -A -o json | jq '.items | map({name: .metadata.name, namespace: .metadata.namespace}) | .[0:10]'
+kubectl get resourcebindings -A --no-headers | wc -l
+kubectl get clusterresourcebindings --no-headers | wc -l
 ```
 
-**5. Check controller-manager resource usage:**
+**6. Check controller-manager resource usage:**
 
 ```bash
 kubectl top pod -n karmada-system -l app=karmada-controller-manager
 ```
 
-**6. Check controller-manager logs:**
+**7. Check controller-manager logs:**
 
 Check the logs using your logging solution (e.g., kubectl logs, Loki, Elasticsearch):
 
 ```bash
-kubectl logs -n karmada-system -l app=karmada-controller-manager --tail=200 | grep -i "SyncWork\|slow\|retry"
+kubectl logs -n karmada-system -l app=karmada-controller-manager --tail=200 | grep -i "Sync work\|failed"
 ```
 
-**7. Check for recent changes.** Were override policies added or modified? Did the number of propagated resources increase significantly?
+**8. Check for recent changes.** Were override policies added or modified? Did the number of propagated resources increase significantly?
 
 ### Mitigation
 
diff --git a/docs/runbooks/SLO/cluster-sync-latency.md b/docs/runbooks/SLO/cluster-sync-latency.md
@@ -73,13 +73,13 @@ kubectl run conn-test --rm -it --image=curlimages/curl --namespace=karmada-syste
 
 **4. Check network latency to member clusters.** Cross-region member clusters inherently have higher sync latency. If you have recently added cross-region clusters, consider adjusting the threshold.
 
-**5. Check the cluster controller workqueue:**
+**5. Check the cluster status controller workqueue:**
 
 ```promql
-workqueue_depth{name="cluster"}
+workqueue_depth{name="cluster-status-controller"}
 
 histogram_quantile(0.95,
-  rate(workqueue_queue_duration_seconds_bucket{name="cluster"}[5m]))
+  sum by (le) (rate(workqueue_queue_duration_seconds_bucket{name="cluster-status-controller"}[5m])))
 ```
 
 **6. Check controller-manager resource usage:**
@@ -99,7 +99,7 @@ kubectl get clusters --no-headers | wc -l
 Check the logs using your logging solution (e.g., kubectl logs, Loki, Elasticsearch):
 
 ```bash
-kubectl logs -n karmada-system -l app=karmada-controller-manager --tail=200 | grep -i "cluster status\|slow\|sync"
+kubectl logs -n karmada-system -l app=karmada-controller-manager --tail=200 | grep -i "cluster\|failed\|sync"
 ```
 
 **9. Check for recent changes.** Were new member clusters added (especially cross-region)? Were there network path changes?
diff --git a/docs/runbooks/SLO/karmada-apiserver-availability.md b/docs/runbooks/SLO/karmada-apiserver-availability.md
@@ -108,7 +108,7 @@ kubectl logs -n karmada-system <etcd-pod> --tail=50
 Check etcd request latency:
 
 ```promql
-histogram_quantile(0.99, rate(etcd_request_duration_seconds_bucket[5m]))
+histogram_quantile(0.99, sum by (le) (rate(etcd_request_duration_seconds_bucket[5m])))
 ```
 
 **8. If 429s are dominant, identify top request sources:**
diff --git a/docs/runbooks/SLO/karmada-apiserver-latency.md b/docs/runbooks/SLO/karmada-apiserver-latency.md
@@ -27,7 +27,7 @@ Page-level alerts are not enabled for this SLO. Only ticket-level alerts will fi
 
 ## What This Alert Means
 
-Karmada API Server requests (excluding WATCH and APPLY operations) are taking longer than the configured latency threshold at an elevated rate. This indicates performance degradation in the control plane.
+Karmada API Server requests are taking longer than the configured latency threshold at an elevated rate. This indicates performance degradation in the control plane.
 
 ## Impact
 
@@ -98,7 +98,7 @@ apiserver_current_inflight_requests
 
 **7. Check API Server logs:**
 
-Check the Karmada API Server logs for slow requests using your logging solution (e.g., kubectl logs, Loki, Elasticsearch):
+Check the Karmada API Server logs for errors using your logging solution (e.g., kubectl logs, Loki, Elasticsearch):
 
 ```bash
 kubectl logs -n karmada-system -l app=karmada-apiserver --tail=100
diff --git a/docs/runbooks/SLO/karmada-scheduler-availability.md b/docs/runbooks/SLO/karmada-scheduler-availability.md
@@ -37,7 +37,7 @@ This alert uses the **multi-burn rate, multi-window** alerting framework ([descr
 
 ## What This Alert Means
 
-The Karmada scheduler is failing to successfully schedule workloads to member clusters (Stage 2: Scheduling). The scheduler decides which clusters should run your workloads based on your propagation policies. This SLO tracks `result="error"` (actual scheduler failures), not `result="unschedulable"` (constraint mismatches).
+The Karmada scheduler is failing to successfully schedule workloads to member clusters (Stage 2: Scheduling). The scheduler decides which clusters should run your workloads based on your propagation policies. This SLO tracks `result="error"` on the `karmada_scheduler_schedule_attempts_total` metric — note that this includes both system-level scheduler failures and unschedulable outcomes (e.g., no clusters matching placement constraints).
 
 ## Impact
 
@@ -73,13 +73,13 @@ kubectl get events -n karmada-system --field-selector reason=ScheduleBindingFail
 
 Review the event messages to understand why the scheduling failed and determine the fix.
 
-**3. Distinguish errors from unschedulable results:**
+**3. Check the scheduling error rate and breakdown by type:**
 
 ```promql
-sum by (result) (rate(karmada_scheduler_schedule_attempts_total[5m]))
+sum by (result, schedule_type) (rate(karmada_scheduler_schedule_attempts_total[5m]))
 ```
 
-If `result="unschedulable"` is high, that is a different issue (cluster capacity or policy constraints), not a scheduler error.
+This metric reports `result="scheduled"` for successes and `result="error"` for failures. The `schedule_type` label helps distinguish first-time scheduling from rescheduling. Note that `result="error"` includes both system-level failures (e.g., estimator unreachable) and constraint mismatches (e.g., no eligible clusters) — check events and logs to distinguish them.
 
 **4. Check the scheduler pod health:**
 
@@ -88,7 +88,7 @@ kubectl get pods -n karmada-system -l app=karmada-scheduler
 kubectl logs -n karmada-system -l app=karmada-scheduler --tail=200 | grep -i "error\|failed\|panic"
 ```
 
-**5. Check for plugin execution failures:**
+**5. Check for slow plugins.** Slow plugins can cause scheduling timeouts that surface as errors.
 
 ```promql
 histogram_quantile(0.99,
@@ -134,7 +134,7 @@ rate(karmada_scheduler_queue_incoming_bindings_total[5m]) by (event)
 | Plugin failures in logs | Identify failing plugin; check plugin configuration |
 | Estimator unreachable | Check estimator pod health and network connectivity |
 | All clusters NotReady | Address cluster connectivity issues first |
-| High unschedulable rate | Review PropagationPolicy placement rules and cluster capacity |
+| No matching clusters for placement rules | Review PropagationPolicy `clusterAffinity` and `spreadConstraints`; check cluster labels |
 
 ## Related Resources
 
diff --git a/docs/runbooks/SLO/karmada-scheduler-estimator-availability.md b/docs/runbooks/SLO/karmada-scheduler-estimator-availability.md
@@ -88,7 +88,7 @@ kubectl describe cluster <cluster-name>
 cluster_ready_state == 0
 ```
 
-**5. Check network connectivity between scheduler and estimators:**
+**5. Check estimator service existence:**
 
 ```bash
 kubectl get svc -n karmada-system | grep estimator
@@ -99,7 +99,7 @@ kubectl get svc -n karmada-system | grep estimator
 Estimator failures surface in the scheduler as `ScheduleBindingFailed` events. Check the scheduler logs for error messages referencing the estimator:
 
 ```bash
-kubectl logs -n karmada-system -l app=karmada-scheduler --tail=200 | grep -i "estimator\|error"
+kubectl logs -n karmada-system -l app=karmada-scheduler --tail=200 | grep -i "estimator"
 ```
 
 **7. Check Kubernetes events for scheduling failures related to estimator issues:**
diff --git a/docs/runbooks/SLO/karmada-scheduler-estimator-latency.md b/docs/runbooks/SLO/karmada-scheduler-estimator-latency.md
@@ -93,7 +93,7 @@ kubectl --context=<member-cluster-context> get pods -A --no-headers | wc -l
 Check the logs using your logging solution (e.g., kubectl logs, Loki, Elasticsearch):
 
 ```bash
-kubectl logs -n karmada-system <estimator-pod> --tail=200 | grep -i "slow\|latency\|error"
+kubectl logs -n karmada-system <estimator-pod> --tail=200 | grep -i "error\|failed"
 ```
 
 **8. Check for recent changes.** Were new member clusters added? Did cluster sizes grow significantly? Were network paths changed?
@@ -105,7 +105,7 @@ kubectl logs -n karmada-system <estimator-pod> --tail=200 | grep -i "slow\|laten
 | Member cluster API server slow | Address member cluster API server performance |
 | High network latency (cross-region) | Adjust the latency threshold to match expected network latency |
 | Estimator CPU-constrained | Increase CPU limits for the estimator pod |
-| Very large member cluster (many pods) | Consider capacity caching in the estimator (if supported) |
+| Very large member cluster (many pods) | Increase estimator `--parallelism` to speed up computation; increase `--kube-api-qps` and `--kube-api-burst` if API throttling is occurring |
 
 ## Related Resources
 
diff --git a/docs/runbooks/SLO/karmada-scheduler-latency.md b/docs/runbooks/SLO/karmada-scheduler-latency.md
@@ -27,7 +27,7 @@ Page-level alerts are not enabled for this SLO. Only ticket-level alerts will fi
 
 ## What This Alert Means
 
-End-to-end scheduling operations (Stage 2: Scheduling) are exceeding the configured latency threshold at an elevated rate. This measures the time from when a ResourceBinding enters the scheduling queue to when a placement decision is made.
+End-to-end scheduling operations (Stage 2: Scheduling) are exceeding the configured latency threshold at an elevated rate.
 
 ## Impact
 
@@ -97,7 +97,7 @@ kubectl top pod -n karmada-system -l app=karmada-scheduler
 Check the logs using your logging solution (e.g., kubectl logs, Loki, Elasticsearch):
 
 ```bash
-kubectl logs -n karmada-system -l app=karmada-scheduler --tail=200 | grep -i "slow\|latency\|error"
+kubectl logs -n karmada-system -l app=karmada-scheduler --tail=200 | grep -i "error"
 ```
 
 **8. Check for recent changes.** Were new clusters registered, new PropagationPolicies added, or scheduler configuration modified?
@@ -110,7 +110,7 @@ kubectl logs -n karmada-system -l app=karmada-scheduler --tail=200 | grep -i "sl
 | Slow plugins | Review plugin configuration; disable non-essential plugins if possible |
 | Many clusters to evaluate | Use `clusterAffinity` in PropagationPolicies to pre-filter clusters |
 | Scheduler CPU-constrained | Increase CPU limits |
-| High scheduling throughput | Consider adding scheduler replicas (check if scheduler supports it) |
+| High scheduling throughput | Add scheduler replicas (leader election is supported and enabled by default) |
 
 ## Related Resources
 
diff --git a/docs/runbooks/SLO/policy-apply-availability.md b/docs/runbooks/SLO/policy-apply-availability.md
@@ -53,9 +53,6 @@ This is a **critical** issue that prevents workload distribution from starting.
 - Missing or mismatched propagation policies
 - Invalid resource specifications
 - Misconfigured `resourceSelectors` causing policy failures
-- Conflicting policies (preemption issues)
-- Controller-manager disconnected from API server
-- RBAC errors preventing controller operations
 
 ## Remediation
 
@@ -81,43 +78,47 @@ Check the logs using your logging solution (e.g., kubectl logs, Loki, Elasticsea
 kubectl logs -n karmada-system -l app=karmada-controller-manager --tail=200 | grep -i "policy\|error\|failed"
 ```
 
-**4. Look for specific policy errors:**
+**4. Check recent events and existing policies:**
 
 ```bash
 kubectl get events -n karmada-system --sort-by='.lastTimestamp' | tail -30
 kubectl get propagationpolicies -A
 kubectl get clusterpropagationpolicies
 ```
 
-**5. Check for conflicting policies:**
+**5. Check the policy apply error rate:**
+
+```promql
+sum by (result) (rate(policy_apply_attempts_total[5m]))
+```
+
+**6. Check for conflicting policies:**
 
 ```promql
 sum(rate(policy_preemption_total[5m])) by (result)
 ```
 
-**6. Check controller workqueue health:**
+**7. Check controller workqueue health:**
 
 ```promql
-workqueue_depth{name="propagationpolicy"}
-rate(workqueue_retries_total{name="propagationpolicy"}[5m])
+workqueue_depth{name=~"propagationPolicy reconciler|clusterPropagationPolicy reconciler"}
+rate(workqueue_retries_total{name=~"propagationPolicy reconciler|clusterPropagationPolicy reconciler"}[5m])
 ```
 
-**7. Check for resource template selector mismatches.** Misconfigured `resourceSelectors` can cause policy failures.
+**8. Check for resource template selector mismatches.** Misconfigured `resourceSelectors` can cause policy failures.
 
 ```bash
 kubectl get propagationpolicies -A -o yaml | grep -A 5 "resourceSelectors"
 ```
 
-**8. Check for recent changes.** Were any policies recently created or modified? Were new resource types introduced?
+**9. Check for recent changes.** Were any policies recently created or modified? Were new resource types introduced?
 
 ### Mitigation
 
 | Symptom | Action |
 |---------|--------|
 | Invalid policy configurations | Review and fix `resourceSelectors`, `placement`, `overrideRules` |
 | Policy conflicts (preemption spikes) | Review overlapping policies; use explicit `priority` to resolve conflicts |
-| Controller disconnected from API server | Check controller-manager pod health and API server connectivity |
-| RBAC errors in logs | Ensure controller-manager service account has required permissions |
 | Workqueue stuck / high depth | Restart the controller-manager pod after fixing the root cause |
 
 ## Related Resources
diff --git a/docs/runbooks/SLO/policy-apply-latency.md b/docs/runbooks/SLO/policy-apply-latency.md
diff --git a/docs/runbooks/SLO/work-sync-workload-availability.md b/docs/runbooks/SLO/work-sync-workload-availability.md
diff --git a/docs/runbooks/SLO/work-sync-workload-latency.md b/docs/runbooks/SLO/work-sync-workload-latency.md