Skip to content

Commit a97d9ac

Browse files
committed
Onwards
Signed-off-by: Joe Nathan Abellard <contact@jabellard.com>
1 parent 66c14b8 commit a97d9ac

14 files changed

Lines changed: 111 additions & 72 deletions

docs/administrator/reliability/guide.md

Lines changed: 14 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -232,13 +232,21 @@ The multi-burn-rate framework solves this by:
232232

233233
2. **Multiple time windows**: Each alert condition uses two windows — a long window and a short window (sized at 1/12th of the long window). The long window detects that error budget is being consumed at an unsustainable rate. The short window confirms the issue is still actively occurring, preventing alerts from firing on problems that have already resolved.
234234

235-
3. **Severity-based thresholds**: The Google SRE Workbook defines three alert tiers based on the rate and duration of budget consumption:
235+
3. **Severity-based thresholds**: The framework defines four alert tiers across two severity levels, based on the rate and duration of budget consumption:
236236

237-
| Severity | Burn Rate | Long Window | Short Window | Budget Consumed | Time to Exhaustion |
238-
|----------|-----------|-------------|--------------|-----------------|-------------------|
239-
| Page | 14.4x | 1 hour | 5 minutes | 2% | ~50 hours |
240-
| Page | 6x | 6 hours | 30 minutes | 5% | ~5 days |
241-
| Ticket | 1x | 3 days | 6 hours | 10% | 30 days |
237+
**Page (critical):**
238+
239+
| Burn Rate | Long Window | Short Window | Budget Consumed | Time to Exhaustion | Action |
240+
|-----------|-------------|--------------|-----------------|-------------------|--------|
241+
| 14.4x | 1 hour | 5 minutes | 2% | ~2 days | Immediate action required |
242+
| 6x | 6 hours | 30 minutes | 5% | ~5 days | Investigate and fix promptly |
243+
244+
**Ticket (warning):**
245+
246+
| Burn Rate | Long Window | Short Window | Budget Consumed | Time to Exhaustion | Action |
247+
|-----------|-------------|--------------|-----------------|-------------------|--------|
248+
| 3x | 1 day | 2 hours | 10% | ~10 days | Address during business hours |
249+
| 1x | 3 days | 6 hours | 10% | 30 days | Plan remediation before window closes |
242250

243251
An alert fires only when **both** the long window and the short window exceed their burn-rate threshold simultaneously. This dual-window requirement eliminates false positives from brief spikes that have already resolved.
244252

docs/runbooks/SLO/binding-sync-work-availability.md

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -94,7 +94,7 @@ done
9494
**5. Check RBAC permissions:**
9595

9696
```bash
97-
kubectl logs -n karmada-system -l app=karmada-controller-manager | grep -i "forbidden\|unauthorized\|rbac"
97+
kubectl logs -n karmada-system -l app=karmada-controller-manager --tail=200 | grep -i "forbidden\|unauthorized\|rbac"
9898
```
9999

100100
**6. Check override policy application.** Override policies are applied during Work creation; failures cause binding sync errors.
@@ -104,17 +104,23 @@ kubectl get overridepolicies -A
104104
kubectl get clusteroverridepolicies
105105
```
106106

107-
**7. Check the binding controller workqueue:**
107+
**7. Check the binding sync error rate:**
108108

109109
```promql
110-
workqueue_depth{name=~"binding.*|clusterbinding.*"}
110+
sum by (result) (rate(binding_sync_work_duration_seconds_count[5m]))
111+
```
112+
113+
**8. Check the binding controller workqueue:**
114+
115+
```promql
116+
workqueue_depth{name=~"binding-controller|cluster-resource-binding-controller"}
111117
112-
rate(workqueue_retries_total{name=~"binding.*"}[5m])
118+
rate(workqueue_retries_total{name=~"binding-controller|cluster-resource-binding-controller"}[5m])
113119
/
114-
rate(workqueue_adds_total{name=~"binding.*"}[5m])
120+
rate(workqueue_adds_total{name=~"binding-controller|cluster-resource-binding-controller"}[5m])
115121
```
116122

117-
**8. Check for recent changes.** Were override policies modified? Were new member clusters registered? Were execution namespaces deleted?
123+
**9. Check for recent changes.** Were override policies modified? Were new member clusters registered? Were execution namespaces deleted?
118124

119125
### Mitigation
120126

docs/runbooks/SLO/binding-sync-work-latency.md

Lines changed: 19 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -54,46 +54,54 @@ Binding sync latency determines how quickly scheduling decisions translate into
5454
- **[SLO Overview Dashboard](https://grafana.com/grafana/dashboards/14643)** — Check the fleet-wide view to see if other SLOs are also burning budget, which may indicate a broader systemic issue.
5555

5656

57-
**2. Check for complex override policies.** Override policy evaluation happens during binding sync; complex policies with many rules are a common cause of latency.
57+
**2. Check binding sync duration breakdown:**
58+
59+
```promql
60+
histogram_quantile(0.95,
61+
sum by (le, result) (rate(binding_sync_work_duration_seconds_bucket[5m])))
62+
```
63+
64+
**3. Check for complex override policies.** Override policy evaluation happens during binding sync; complex policies with many rules are a common cause of latency.
5865

5966
```bash
6067
kubectl get clusteroverridepolicies -o yaml | grep -c "overrideRules"
6168
kubectl get overridepolicies -A -o yaml | grep -c "overrideRules"
6269
```
6370

64-
**3. Check the binding controller workqueue:**
71+
**4. Check the binding controller workqueue:**
6572

6673
```promql
67-
workqueue_depth{name=~"binding.*|clusterbinding.*"}
74+
workqueue_depth{name=~"binding-controller|cluster-resource-binding-controller"}
6875
6976
histogram_quantile(0.95,
70-
rate(workqueue_queue_duration_seconds_bucket{name=~"binding.*"}[5m]))
77+
sum by (le) (rate(workqueue_queue_duration_seconds_bucket{name=~"binding-controller|cluster-resource-binding-controller"}[5m])))
7178
7279
histogram_quantile(0.95,
73-
rate(workqueue_work_duration_seconds_bucket{name=~"binding.*"}[5m]))
80+
sum by (le) (rate(workqueue_work_duration_seconds_bucket{name=~"binding-controller|cluster-resource-binding-controller"}[5m])))
7481
```
7582

76-
**4. Check for large resource manifests.** Large resource manifests take longer to process and create as Work objects.
83+
**5. Check binding volume.** A high number of bindings being processed simultaneously increases controller load.
7784

7885
```bash
79-
kubectl get resourcebindings -A -o json | jq '.items | map({name: .metadata.name, namespace: .metadata.namespace}) | .[0:10]'
86+
kubectl get resourcebindings -A --no-headers | wc -l
87+
kubectl get clusterresourcebindings --no-headers | wc -l
8088
```
8189

82-
**5. Check controller-manager resource usage:**
90+
**6. Check controller-manager resource usage:**
8391

8492
```bash
8593
kubectl top pod -n karmada-system -l app=karmada-controller-manager
8694
```
8795

88-
**6. Check controller-manager logs:**
96+
**7. Check controller-manager logs:**
8997

9098
Check the logs using your logging solution (e.g., kubectl logs, Loki, Elasticsearch):
9199

92100
```bash
93-
kubectl logs -n karmada-system -l app=karmada-controller-manager --tail=200 | grep -i "SyncWork\|slow\|retry"
101+
kubectl logs -n karmada-system -l app=karmada-controller-manager --tail=200 | grep -i "Sync work\|failed"
94102
```
95103

96-
**7. Check for recent changes.** Were override policies added or modified? Did the number of propagated resources increase significantly?
104+
**8. Check for recent changes.** Were override policies added or modified? Did the number of propagated resources increase significantly?
97105

98106
### Mitigation
99107

docs/runbooks/SLO/cluster-sync-latency.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -73,13 +73,13 @@ kubectl run conn-test --rm -it --image=curlimages/curl --namespace=karmada-syste
7373

7474
**4. Check network latency to member clusters.** Cross-region member clusters inherently have higher sync latency. If you have recently added cross-region clusters, consider adjusting the threshold.
7575

76-
**5. Check the cluster controller workqueue:**
76+
**5. Check the cluster status controller workqueue:**
7777

7878
```promql
79-
workqueue_depth{name="cluster"}
79+
workqueue_depth{name="cluster-status-controller"}
8080
8181
histogram_quantile(0.95,
82-
rate(workqueue_queue_duration_seconds_bucket{name="cluster"}[5m]))
82+
sum by (le) (rate(workqueue_queue_duration_seconds_bucket{name="cluster-status-controller"}[5m])))
8383
```
8484

8585
**6. Check controller-manager resource usage:**
@@ -99,7 +99,7 @@ kubectl get clusters --no-headers | wc -l
9999
Check the logs using your logging solution (e.g., kubectl logs, Loki, Elasticsearch):
100100

101101
```bash
102-
kubectl logs -n karmada-system -l app=karmada-controller-manager --tail=200 | grep -i "cluster status\|slow\|sync"
102+
kubectl logs -n karmada-system -l app=karmada-controller-manager --tail=200 | grep -i "cluster\|failed\|sync"
103103
```
104104

105105
**9. Check for recent changes.** Were new member clusters added (especially cross-region)? Were there network path changes?

docs/runbooks/SLO/karmada-apiserver-availability.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -108,7 +108,7 @@ kubectl logs -n karmada-system <etcd-pod> --tail=50
108108
Check etcd request latency:
109109

110110
```promql
111-
histogram_quantile(0.99, rate(etcd_request_duration_seconds_bucket[5m]))
111+
histogram_quantile(0.99, sum by (le) (rate(etcd_request_duration_seconds_bucket[5m])))
112112
```
113113

114114
**8. If 429s are dominant, identify top request sources:**

docs/runbooks/SLO/karmada-apiserver-latency.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ Page-level alerts are not enabled for this SLO. Only ticket-level alerts will fi
2727
2828
## What This Alert Means
2929

30-
Karmada API Server requests (excluding WATCH and APPLY operations) are taking longer than the configured latency threshold at an elevated rate. This indicates performance degradation in the control plane.
30+
Karmada API Server requests are taking longer than the configured latency threshold at an elevated rate. This indicates performance degradation in the control plane.
3131

3232
## Impact
3333

@@ -98,7 +98,7 @@ apiserver_current_inflight_requests
9898

9999
**7. Check API Server logs:**
100100

101-
Check the Karmada API Server logs for slow requests using your logging solution (e.g., kubectl logs, Loki, Elasticsearch):
101+
Check the Karmada API Server logs for errors using your logging solution (e.g., kubectl logs, Loki, Elasticsearch):
102102

103103
```bash
104104
kubectl logs -n karmada-system -l app=karmada-apiserver --tail=100

docs/runbooks/SLO/karmada-scheduler-availability.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ This alert uses the **multi-burn rate, multi-window** alerting framework ([descr
3737
3838
## What This Alert Means
3939

40-
The Karmada scheduler is failing to successfully schedule workloads to member clusters (Stage 2: Scheduling). The scheduler decides which clusters should run your workloads based on your propagation policies. This SLO tracks `result="error"` (actual scheduler failures), not `result="unschedulable"` (constraint mismatches).
40+
The Karmada scheduler is failing to successfully schedule workloads to member clusters (Stage 2: Scheduling). The scheduler decides which clusters should run your workloads based on your propagation policies. This SLO tracks `result="error"` on the `karmada_scheduler_schedule_attempts_total` metric — note that this includes both system-level scheduler failures and unschedulable outcomes (e.g., no clusters matching placement constraints).
4141

4242
## Impact
4343

@@ -73,13 +73,13 @@ kubectl get events -n karmada-system --field-selector reason=ScheduleBindingFail
7373

7474
Review the event messages to understand why the scheduling failed and determine the fix.
7575

76-
**3. Distinguish errors from unschedulable results:**
76+
**3. Check the scheduling error rate and breakdown by type:**
7777

7878
```promql
79-
sum by (result) (rate(karmada_scheduler_schedule_attempts_total[5m]))
79+
sum by (result, schedule_type) (rate(karmada_scheduler_schedule_attempts_total[5m]))
8080
```
8181

82-
If `result="unschedulable"` is high, that is a different issue (cluster capacity or policy constraints), not a scheduler error.
82+
This metric reports `result="scheduled"` for successes and `result="error"` for failures. The `schedule_type` label helps distinguish first-time scheduling from rescheduling. Note that `result="error"` includes both system-level failures (e.g., estimator unreachable) and constraint mismatches (e.g., no eligible clusters) — check events and logs to distinguish them.
8383

8484
**4. Check the scheduler pod health:**
8585

@@ -88,7 +88,7 @@ kubectl get pods -n karmada-system -l app=karmada-scheduler
8888
kubectl logs -n karmada-system -l app=karmada-scheduler --tail=200 | grep -i "error\|failed\|panic"
8989
```
9090

91-
**5. Check for plugin execution failures:**
91+
**5. Check for slow plugins.** Slow plugins can cause scheduling timeouts that surface as errors.
9292

9393
```promql
9494
histogram_quantile(0.99,
@@ -134,7 +134,7 @@ rate(karmada_scheduler_queue_incoming_bindings_total[5m]) by (event)
134134
| Plugin failures in logs | Identify failing plugin; check plugin configuration |
135135
| Estimator unreachable | Check estimator pod health and network connectivity |
136136
| All clusters NotReady | Address cluster connectivity issues first |
137-
| High unschedulable rate | Review PropagationPolicy placement rules and cluster capacity |
137+
| No matching clusters for placement rules | Review PropagationPolicy `clusterAffinity` and `spreadConstraints`; check cluster labels |
138138

139139
## Related Resources
140140

docs/runbooks/SLO/karmada-scheduler-estimator-availability.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -88,7 +88,7 @@ kubectl describe cluster <cluster-name>
8888
cluster_ready_state == 0
8989
```
9090

91-
**5. Check network connectivity between scheduler and estimators:**
91+
**5. Check estimator service existence:**
9292

9393
```bash
9494
kubectl get svc -n karmada-system | grep estimator
@@ -99,7 +99,7 @@ kubectl get svc -n karmada-system | grep estimator
9999
Estimator failures surface in the scheduler as `ScheduleBindingFailed` events. Check the scheduler logs for error messages referencing the estimator:
100100

101101
```bash
102-
kubectl logs -n karmada-system -l app=karmada-scheduler --tail=200 | grep -i "estimator\|error"
102+
kubectl logs -n karmada-system -l app=karmada-scheduler --tail=200 | grep -i "estimator"
103103
```
104104

105105
**7. Check Kubernetes events for scheduling failures related to estimator issues:**

docs/runbooks/SLO/karmada-scheduler-estimator-latency.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -93,7 +93,7 @@ kubectl --context=<member-cluster-context> get pods -A --no-headers | wc -l
9393
Check the logs using your logging solution (e.g., kubectl logs, Loki, Elasticsearch):
9494

9595
```bash
96-
kubectl logs -n karmada-system <estimator-pod> --tail=200 | grep -i "slow\|latency\|error"
96+
kubectl logs -n karmada-system <estimator-pod> --tail=200 | grep -i "error\|failed"
9797
```
9898

9999
**8. Check for recent changes.** Were new member clusters added? Did cluster sizes grow significantly? Were network paths changed?
@@ -105,7 +105,7 @@ kubectl logs -n karmada-system <estimator-pod> --tail=200 | grep -i "slow\|laten
105105
| Member cluster API server slow | Address member cluster API server performance |
106106
| High network latency (cross-region) | Adjust the latency threshold to match expected network latency |
107107
| Estimator CPU-constrained | Increase CPU limits for the estimator pod |
108-
| Very large member cluster (many pods) | Consider capacity caching in the estimator (if supported) |
108+
| Very large member cluster (many pods) | Increase estimator `--parallelism` to speed up computation; increase `--kube-api-qps` and `--kube-api-burst` if API throttling is occurring |
109109

110110
## Related Resources
111111

docs/runbooks/SLO/karmada-scheduler-latency.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ Page-level alerts are not enabled for this SLO. Only ticket-level alerts will fi
2727
2828
## What This Alert Means
2929

30-
End-to-end scheduling operations (Stage 2: Scheduling) are exceeding the configured latency threshold at an elevated rate. This measures the time from when a ResourceBinding enters the scheduling queue to when a placement decision is made.
30+
End-to-end scheduling operations (Stage 2: Scheduling) are exceeding the configured latency threshold at an elevated rate.
3131

3232
## Impact
3333

@@ -97,7 +97,7 @@ kubectl top pod -n karmada-system -l app=karmada-scheduler
9797
Check the logs using your logging solution (e.g., kubectl logs, Loki, Elasticsearch):
9898

9999
```bash
100-
kubectl logs -n karmada-system -l app=karmada-scheduler --tail=200 | grep -i "slow\|latency\|error"
100+
kubectl logs -n karmada-system -l app=karmada-scheduler --tail=200 | grep -i "error"
101101
```
102102

103103
**8. Check for recent changes.** Were new clusters registered, new PropagationPolicies added, or scheduler configuration modified?
@@ -110,7 +110,7 @@ kubectl logs -n karmada-system -l app=karmada-scheduler --tail=200 | grep -i "sl
110110
| Slow plugins | Review plugin configuration; disable non-essential plugins if possible |
111111
| Many clusters to evaluate | Use `clusterAffinity` in PropagationPolicies to pre-filter clusters |
112112
| Scheduler CPU-constrained | Increase CPU limits |
113-
| High scheduling throughput | Consider adding scheduler replicas (check if scheduler supports it) |
113+
| High scheduling throughput | Add scheduler replicas (leader election is supported and enabled by default) |
114114

115115
## Related Resources
116116

0 commit comments

Comments
 (0)