You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/administrator/reliability/guide.md
+14-6Lines changed: 14 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -232,13 +232,21 @@ The multi-burn-rate framework solves this by:
232
232
233
233
2.**Multiple time windows**: Each alert condition uses two windows — a long window and a short window (sized at 1/12th of the long window). The long window detects that error budget is being consumed at an unsustainable rate. The short window confirms the issue is still actively occurring, preventing alerts from firing on problems that have already resolved.
234
234
235
-
3.**Severity-based thresholds**: The Google SRE Workbook defines three alert tiers based on the rate and duration of budget consumption:
235
+
3.**Severity-based thresholds**: The framework defines four alert tiers across two severity levels, based on the rate and duration of budget consumption:
236
236
237
-
| Severity | Burn Rate | Long Window | Short Window | Budget Consumed | Time to Exhaustion |
| 3x | 1 day | 2 hours | 10% |~10 days | Address during business hours |
249
+
| 1x | 3 days | 6 hours | 10% | 30 days | Plan remediation before window closes |
242
250
243
251
An alert fires only when **both** the long window and the short window exceed their burn-rate threshold simultaneously. This dual-window requirement eliminates false positives from brief spikes that have already resolved.
Copy file name to clipboardExpand all lines: docs/runbooks/SLO/binding-sync-work-latency.md
+19-11Lines changed: 19 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -54,46 +54,54 @@ Binding sync latency determines how quickly scheduling decisions translate into
54
54
-**[SLO Overview Dashboard](https://grafana.com/grafana/dashboards/14643)** — Check the fleet-wide view to see if other SLOs are also burning budget, which may indicate a broader systemic issue.
55
55
56
56
57
-
**2. Check for complex override policies.** Override policy evaluation happens during binding sync; complex policies with many rules are a common cause of latency.
57
+
**2. Check binding sync duration breakdown:**
58
+
59
+
```promql
60
+
histogram_quantile(0.95,
61
+
sum by (le, result) (rate(binding_sync_work_duration_seconds_bucket[5m])))
62
+
```
63
+
64
+
**3. Check for complex override policies.** Override policy evaluation happens during binding sync; complex policies with many rules are a common cause of latency.
58
65
59
66
```bash
60
67
kubectl get clusteroverridepolicies -o yaml | grep -c "overrideRules"
61
68
kubectl get overridepolicies -A -o yaml | grep -c "overrideRules"
**4. Check network latency to member clusters.** Cross-region member clusters inherently have higher sync latency. If you have recently added cross-region clusters, consider adjusting the threshold.
75
75
76
-
**5. Check the cluster controller workqueue:**
76
+
**5. Check the cluster status controller workqueue:**
Copy file name to clipboardExpand all lines: docs/runbooks/SLO/karmada-apiserver-latency.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -27,7 +27,7 @@ Page-level alerts are not enabled for this SLO. Only ticket-level alerts will fi
27
27
28
28
## What This Alert Means
29
29
30
-
Karmada API Server requests (excluding WATCH and APPLY operations) are taking longer than the configured latency threshold at an elevated rate. This indicates performance degradation in the control plane.
30
+
Karmada API Server requests are taking longer than the configured latency threshold at an elevated rate. This indicates performance degradation in the control plane.
Copy file name to clipboardExpand all lines: docs/runbooks/SLO/karmada-scheduler-availability.md
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -37,7 +37,7 @@ This alert uses the **multi-burn rate, multi-window** alerting framework ([descr
37
37
38
38
## What This Alert Means
39
39
40
-
The Karmada scheduler is failing to successfully schedule workloads to member clusters (Stage 2: Scheduling). The scheduler decides which clusters should run your workloads based on your propagation policies. This SLO tracks `result="error"`(actual scheduler failures), not `result="unschedulable"` (constraint mismatches).
40
+
The Karmada scheduler is failing to successfully schedule workloads to member clusters (Stage 2: Scheduling). The scheduler decides which clusters should run your workloads based on your propagation policies. This SLO tracks `result="error"`on the `karmada_scheduler_schedule_attempts_total` metric — note that this includes both system-level scheduler failures and unschedulable outcomes (e.g., no clusters matching placement constraints).
Review the event messages to understand why the scheduling failed and determine the fix.
75
75
76
-
**3. Distinguish errors from unschedulable results:**
76
+
**3. Check the scheduling error rate and breakdown by type:**
77
77
78
78
```promql
79
-
sum by (result) (rate(karmada_scheduler_schedule_attempts_total[5m]))
79
+
sum by (result, schedule_type) (rate(karmada_scheduler_schedule_attempts_total[5m]))
80
80
```
81
81
82
-
If `result="unschedulable"`is high, that is a different issue (cluster capacity or policy constraints), not a scheduler error.
82
+
This metric reports `result="scheduled"`for successes and `result="error"` for failures. The `schedule_type` label helps distinguish first-time scheduling from rescheduling. Note that `result="error"` includes both system-level failures (e.g., estimator unreachable) and constraint mismatches (e.g., no eligible clusters) — check events and logs to distinguish them.
| Member cluster API server slow | Address member cluster API server performance |
106
106
| High network latency (cross-region) | Adjust the latency threshold to match expected network latency |
107
107
| Estimator CPU-constrained | Increase CPU limits for the estimator pod |
108
-
| Very large member cluster (many pods) |Consider capacity caching in the estimator (if supported)|
108
+
| Very large member cluster (many pods) |Increase estimator `--parallelism` to speed up computation; increase `--kube-api-qps` and `--kube-api-burst`if API throttling is occurring|
Copy file name to clipboardExpand all lines: docs/runbooks/SLO/karmada-scheduler-latency.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -27,7 +27,7 @@ Page-level alerts are not enabled for this SLO. Only ticket-level alerts will fi
27
27
28
28
## What This Alert Means
29
29
30
-
End-to-end scheduling operations (Stage 2: Scheduling) are exceeding the configured latency threshold at an elevated rate. This measures the time from when a ResourceBinding enters the scheduling queue to when a placement decision is made.
30
+
End-to-end scheduling operations (Stage 2: Scheduling) are exceeding the configured latency threshold at an elevated rate.
31
31
32
32
## Impact
33
33
@@ -97,7 +97,7 @@ kubectl top pod -n karmada-system -l app=karmada-scheduler
97
97
Check the logs using your logging solution (e.g., kubectl logs, Loki, Elasticsearch):
0 commit comments