Skip to content

Commit 278dc60

Browse files
committed
Add reliability guide
Signed-off-by: Joe Nathan Abellard <contact@jabellard.com>
1 parent b567d31 commit 278dc60

17 files changed

Lines changed: 3051 additions & 0 deletions

docs/administrator/reliability/guide.md

Lines changed: 830 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
---
2+
title: Binding Sync Work Availability
3+
sidebar_label: Binding Sync Work Availability
4+
description: Binding Sync Work Availability
5+
---
6+
7+
# Binding to Work Sync Availability SLO Error Budget Burn Rate Exceeded
8+
9+
## Understanding This Alert
10+
11+
This alert fires when the SLO's error budget is being consumed faster than sustainable — **page** alerts indicate urgent issues requiring immediate action, while **ticket** alerts should be addressed during business hours. See the [Reliability Engineering Guide](../../administrator/reliability/guide#slo-alerting-framework) for details on burn rates, time windows, and severity thresholds.
12+
13+
## What This Alert Means
14+
15+
Karmada is failing to create Work resources from your ResourceBindings or ClusterResourceBindings (Stage 4: Work Creation & Override Application). After the scheduler assigns workloads to clusters, the binding controller converts those scheduling decisions into Work objects that drive the actual deployment. This step is failing.
16+
17+
## Impact
18+
19+
- **Workloads scheduled but not propagated** - The scheduler picked clusters, but resources can't move to the next pipeline stage
20+
- **Deployments are stuck** - New applications won't reach member clusters
21+
- **Updates blocked** - Changes to existing workloads won't propagate
22+
23+
This is a **critical** issue that blocks workload propagation after scheduling. User intent (resources created in Karmada) never materializes in member clusters.
24+
25+
## Possible Causes
26+
27+
- Conflicts with existing Work resources
28+
- Invalid resource templates in the binding
29+
- API server issues preventing Work creation
30+
- Missing execution namespaces (one per member cluster)
31+
- RBAC permission issues
32+
- Override policy application failures
33+
34+
## Remediation
35+
36+
**1. Review the Sloth SLO dashboards in Grafana.** These dashboards are your primary tool for understanding the scope and timeline of the issue:
37+
38+
- **[SLO Details Dashboard](https://grafana.com/grafana/dashboards/14348)** — Drill into this specific SLO to see the current burn rate, error budget remaining, monthly burndown chart, and alert state. Use this to confirm the alert and understand when the issue began.
39+
- **[SLO Overview Dashboard](https://grafana.com/grafana/dashboards/14643)** — Check the fleet-wide view to see if other SLOs are also burning budget, which may indicate a broader systemic issue.
40+
41+
42+
**2. Check Kubernetes events for binding sync failures:**
43+
44+
```bash
45+
kubectl get events -n karmada-system --field-selector reason=SyncWorkFailed --sort-by='.lastTimestamp'
46+
```
47+
48+
Review the event messages to understand why the sync failed and determine the fix.
49+
50+
**3. Check the controller-manager logs.** The binding controller runs in the karmada-controller-manager.
51+
52+
Check the logs using your logging solution (e.g., kubectl logs, Loki, Elasticsearch):
53+
54+
```bash
55+
kubectl logs -n karmada-system -l app=karmada-controller-manager --tail=200 | grep -i "binding\|work\|error\|failed"
56+
```
57+
58+
**4. Check execution namespace existence.** Work objects are created in execution namespaces (one per member cluster). Missing namespaces cause Work creation failures.
59+
60+
```bash
61+
kubectl get namespaces | grep karmada-es-
62+
63+
kubectl get clusters -o jsonpath='{.items[*].metadata.name}' | tr ' ' '\n' | while read cluster; do
64+
ns="karmada-es-${cluster}"
65+
kubectl get namespace "$ns" &>/dev/null || echo "MISSING: $ns"
66+
done
67+
```
68+
69+
**5. Check RBAC permissions:**
70+
71+
```bash
72+
kubectl logs -n karmada-system -l app=karmada-controller-manager --tail=200 | grep -i "forbidden\|unauthorized\|rbac"
73+
```
74+
75+
**6. Check override policy application.** Override policies are applied during Work creation; failures cause binding sync errors.
76+
77+
```bash
78+
kubectl get overridepolicies -A
79+
kubectl get clusteroverridepolicies
80+
```
81+
82+
**7. Check the binding sync error rate:**
83+
84+
```promql
85+
sum by (result) (rate(binding_sync_work_duration_seconds_count[5m]))
86+
```
87+
88+
**8. Check the binding controller workqueue:**
89+
90+
```promql
91+
workqueue_depth{name=~"binding-controller|cluster-resource-binding-controller"}
92+
93+
rate(workqueue_retries_total{name=~"binding-controller|cluster-resource-binding-controller"}[5m])
94+
/
95+
rate(workqueue_adds_total{name=~"binding-controller|cluster-resource-binding-controller"}[5m])
96+
```
97+
98+
**9. Check for recent changes.** Were override policies modified? Were new member clusters registered? Were execution namespaces deleted?
99+
100+
### Mitigation
101+
102+
| Symptom | Action |
103+
|---------|--------|
104+
| Missing execution namespaces | Recreate namespaces or trigger cluster re-registration |
105+
| RBAC errors | Review and fix controller-manager RBAC roles |
106+
| Override policy failures | Review and fix OverridePolicy configurations |
107+
| API server errors | Address API server issues first (see [Karmada API Server Availability](./karmada-apiserver-availability)) |
108+
| Controller workqueue stuck | Restart the controller-manager after fixing the root cause |
109+
110+
## Related Resources
111+
112+
- [Reliability Engineering Guide](../../administrator/reliability/guide)
113+
- [Alerting on SLOs - Google SRE Workbook](https://sre.google/workbook/alerting-on-slos/)
Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
---
2+
title: Binding Sync Work Latency
3+
sidebar_label: Binding Sync Work Latency
4+
description: Binding Sync Work Latency
5+
---
6+
7+
# Binding to Work Sync Latency SLO Error Budget Burn Rate Exceeded
8+
9+
## Understanding This Alert
10+
11+
This alert fires when the SLO's error budget is being consumed faster than sustainable. This SLO uses **ticket-level alerts only** — investigate and address during normal business hours. See the [Reliability Engineering Guide](../../administrator/reliability/guide#slo-alerting-framework) for details on burn rates, time windows, and severity thresholds.
12+
13+
## What This Alert Means
14+
15+
The process of converting ResourceBindings into Work resources (Stage 4: Work Creation & Override Application) is taking longer than expected. While operations are succeeding, they are exceeding the configured latency threshold at an elevated rate, which delays the overall resource propagation pipeline.
16+
17+
## Impact
18+
19+
- **Slower deployments** - Resources take longer to move from scheduling decisions to propagation tasks
20+
- **Delayed updates** - Changes to workloads propagate more slowly to member clusters
21+
- **End-to-end pipeline slowdown** - Downstream steps (actual workload deployment) are delayed
22+
23+
Binding sync latency determines how quickly scheduling decisions translate into deployable Work objects.
24+
25+
## Possible Causes
26+
27+
- High volume of resources being propagated simultaneously
28+
- API server slowness affecting Work object creation
29+
- Complex resource templates requiring more processing time
30+
- Complex override policies with many rules
31+
- Large resource manifests taking longer to process
32+
- Controller-manager under resource pressure
33+
34+
## Remediation
35+
36+
**1. Review the Sloth SLO dashboards in Grafana.** These dashboards are your primary tool for understanding the scope and timeline of the issue:
37+
38+
- **[SLO Details Dashboard](https://grafana.com/grafana/dashboards/14348)** — Drill into this specific SLO to see the current burn rate, error budget remaining, monthly burndown chart, and alert state. Use this to confirm the alert and understand when the issue began.
39+
- **[SLO Overview Dashboard](https://grafana.com/grafana/dashboards/14643)** — Check the fleet-wide view to see if other SLOs are also burning budget, which may indicate a broader systemic issue.
40+
41+
42+
**2. Check binding sync duration breakdown:**
43+
44+
```promql
45+
histogram_quantile(0.95,
46+
sum by (le, result) (rate(binding_sync_work_duration_seconds_bucket[5m])))
47+
```
48+
49+
**3. Check for complex override policies.** Override policy evaluation happens during binding sync; complex policies with many rules are a common cause of latency.
50+
51+
```bash
52+
kubectl get clusteroverridepolicies -o yaml | grep -c "overrideRules"
53+
kubectl get overridepolicies -A -o yaml | grep -c "overrideRules"
54+
```
55+
56+
**4. Check the binding controller workqueue:**
57+
58+
```promql
59+
workqueue_depth{name=~"binding-controller|cluster-resource-binding-controller"}
60+
61+
histogram_quantile(0.95,
62+
sum by (le) (rate(workqueue_queue_duration_seconds_bucket{name=~"binding-controller|cluster-resource-binding-controller"}[5m])))
63+
64+
histogram_quantile(0.95,
65+
sum by (le) (rate(workqueue_work_duration_seconds_bucket{name=~"binding-controller|cluster-resource-binding-controller"}[5m])))
66+
```
67+
68+
**5. Check binding volume.** A high number of bindings being processed simultaneously increases controller load.
69+
70+
```bash
71+
kubectl get resourcebindings -A --no-headers | wc -l
72+
kubectl get clusterresourcebindings --no-headers | wc -l
73+
```
74+
75+
**6. Check controller-manager resource usage:**
76+
77+
```bash
78+
kubectl top pod -n karmada-system -l app=karmada-controller-manager
79+
```
80+
81+
**7. Check controller-manager logs:**
82+
83+
Check the logs using your logging solution (e.g., kubectl logs, Loki, Elasticsearch):
84+
85+
```bash
86+
kubectl logs -n karmada-system -l app=karmada-controller-manager --tail=200 | grep -i "Sync work\|failed"
87+
```
88+
89+
**8. Check for recent changes.** Were override policies added or modified? Did the number of propagated resources increase significantly?
90+
91+
### Mitigation
92+
93+
| Root Cause | Action |
94+
|------------|--------|
95+
| Complex override policies | Simplify override rules; reduce number of per-resource overrides |
96+
| Large manifests | Review and reduce ConfigMap/Secret sizes; split large resources |
97+
| Controller CPU-constrained | Increase CPU limits |
98+
| API server slow writes | Address API server/etcd performance |
99+
| Backlog in queue | Check for and resolve any errors causing retries |
100+
101+
## Related Resources
102+
103+
- [Reliability Engineering Guide](../../administrator/reliability/guide)
104+
- [Alerting on SLOs - Google SRE Workbook](https://sre.google/workbook/alerting-on-slos/)
Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
---
2+
title: Cluster Sync Latency
3+
sidebar_label: Cluster Sync Latency
4+
description: Cluster Sync Latency
5+
---
6+
7+
# Cluster Status Sync Latency SLO Error Budget Burn Rate Exceeded
8+
9+
## Understanding This Alert
10+
11+
This alert fires when the SLO's error budget is being consumed faster than sustainable. This SLO uses **ticket-level alerts only** — investigate and address during normal business hours. See the [Reliability Engineering Guide](../../administrator/reliability/guide#slo-alerting-framework) for details on burn rates, time windows, and severity thresholds.
12+
13+
## What This Alert Means
14+
15+
The process of syncing status information from ready member clusters is taking longer than expected. Karmada periodically pulls cluster status (node count, resource capacity, conditions) from each member cluster. This SLO only tracks syncs for clusters that are in a Ready state.
16+
17+
## Impact
18+
19+
- **Stale cluster information** - The scheduler may use outdated capacity data when making placement decisions
20+
- **Suboptimal scheduling** - Workloads may be placed on clusters that appear to have capacity but don't
21+
- **Delayed health detection** - Changes in cluster health take longer to be reflected in the control plane
22+
- **Delayed failover** - Failed clusters continue receiving new workload placements
23+
24+
## Possible Causes
25+
26+
- Member cluster API server slowness
27+
- Network latency between the Karmada control plane and member clusters (especially cross-region)
28+
- Large clusters with many nodes requiring more status data to collect
29+
- Resource pressure on the controller manager
30+
- High number of registered member clusters
31+
32+
## Remediation
33+
34+
**1. Review the Sloth SLO dashboards in Grafana.** These dashboards are your primary tool for understanding the scope and timeline of the issue:
35+
36+
- **[SLO Details Dashboard](https://grafana.com/grafana/dashboards/14348)** — Drill into this specific SLO to see the current burn rate, error budget remaining, monthly burndown chart, and alert state. Use this to confirm the alert and understand when the issue began.
37+
- **[SLO Overview Dashboard](https://grafana.com/grafana/dashboards/14643)** — Check the fleet-wide view to see if other SLOs are also burning budget, which may indicate a broader systemic issue.
38+
39+
40+
**2. Identify which specific clusters are slow:**
41+
42+
```promql
43+
histogram_quantile(0.95,
44+
sum by (le, member_cluster) (rate(cluster_sync_status_duration_seconds_bucket[5m])))
45+
> 1
46+
```
47+
48+
**3. Check member cluster API server latency.** The cluster sync controller fetches status from member cluster API servers; their latency is the primary driver.
49+
50+
```bash
51+
kubectl describe cluster <slow-cluster-name>
52+
```
53+
54+
```bash
55+
kubectl run conn-test --rm -it --image=curlimages/curl --namespace=karmada-system -- \
56+
curl -w "%{time_total}\n" -s -o /dev/null -k https://<cluster-api-server>:6443/healthz
57+
```
58+
59+
**4. Check network latency to member clusters.** Cross-region member clusters inherently have higher sync latency. If you have recently added cross-region clusters, consider adjusting the threshold.
60+
61+
**5. Check the cluster status controller workqueue:**
62+
63+
```promql
64+
workqueue_depth{name="cluster-status-controller"}
65+
66+
histogram_quantile(0.95,
67+
sum by (le) (rate(workqueue_queue_duration_seconds_bucket{name="cluster-status-controller"}[5m])))
68+
```
69+
70+
**6. Check controller-manager resource usage:**
71+
72+
```bash
73+
kubectl top pod -n karmada-system -l app=karmada-controller-manager
74+
```
75+
76+
**7. Check total cluster count.** The more member clusters registered, the more work the cluster controller must do.
77+
78+
```bash
79+
kubectl get clusters --no-headers | wc -l
80+
```
81+
82+
**8. Check controller-manager logs:**
83+
84+
Check the logs using your logging solution (e.g., kubectl logs, Loki, Elasticsearch):
85+
86+
```bash
87+
kubectl logs -n karmada-system -l app=karmada-controller-manager --tail=200 | grep -i "cluster\|failed\|sync"
88+
```
89+
90+
**9. Check for recent changes.** Were new member clusters added (especially cross-region)? Were there network path changes?
91+
92+
### Mitigation
93+
94+
| Root Cause | Action |
95+
|------------|--------|
96+
| Member cluster API server slow | Address member cluster health |
97+
| High network latency (cross-region) | Adjust the latency threshold to match actual latency; consider cluster-local agents |
98+
| Controller CPU-constrained | Increase CPU limits for controller-manager |
99+
| Many clusters causing contention | Consider increasing controller-manager worker thread count |
100+
| Network partition to specific clusters | Restore network connectivity; investigate network path |
101+
102+
## Related Resources
103+
104+
- [Reliability Engineering Guide](../../administrator/reliability/guide)
105+
- [Alerting on SLOs - Google SRE Workbook](https://sre.google/workbook/alerting-on-slos/)

docs/runbooks/SLO/index.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
---
2+
title: SLO Runbooks
3+
---
4+
5+
# SLO Runbooks
6+
7+
Detailed troubleshooting guides for Service Level Objective (SLO) violations. Each runbook explains the impact, potential causes, and remediation steps for a specific SLO.
8+
9+
## API Server
10+
11+
- [API Server Availability](karmada-apiserver-availability.md)
12+
- [API Server Latency](karmada-apiserver-latency.md)
13+
14+
## Policy Application
15+
16+
- [Policy Apply Availability](policy-apply-availability.md)
17+
- [Policy Apply Latency](policy-apply-latency.md)
18+
19+
## Scheduler
20+
21+
- [Scheduler Availability](karmada-scheduler-availability.md)
22+
- [Scheduler Latency](karmada-scheduler-latency.md)
23+
- [Scheduler Estimator Availability](karmada-scheduler-estimator-availability.md)
24+
- [Scheduler Estimator Latency](karmada-scheduler-estimator-latency.md)
25+
26+
## Resource Propagation
27+
28+
- [Binding Sync Work Availability](binding-sync-work-availability.md)
29+
- [Binding Sync Work Latency](binding-sync-work-latency.md)
30+
- [Work Sync Workload Availability](work-sync-workload-availability.md)
31+
- [Work Sync Workload Latency](work-sync-workload-latency.md)
32+
33+
## Cluster Health
34+
35+
- [Cluster Sync Latency](cluster-sync-latency.md)

0 commit comments

Comments
 (0)