karmada-io
diff --git a/‎docs/administrator/reliability/guide.md‎
Lines changed: 830 additions & 0 deletions b/‎docs/administrator/reliability/guide.md‎
Lines changed: 830 additions & 0 deletions
diff --git a/‎docs/runbooks/SLO/binding-sync-work-availability.md‎
Lines changed: 113 additions & 0 deletions b/‎docs/runbooks/SLO/binding-sync-work-availability.md‎
Lines changed: 113 additions & 0 deletions
diff --git a/‎docs/runbooks/SLO/binding-sync-work-latency.md‎
Lines changed: 104 additions & 0 deletions b/‎docs/runbooks/SLO/binding-sync-work-latency.md‎
Lines changed: 104 additions & 0 deletions
diff --git a/‎docs/runbooks/SLO/cluster-sync-latency.md‎
Lines changed: 105 additions & 0 deletions b/‎docs/runbooks/SLO/cluster-sync-latency.md‎
Lines changed: 105 additions & 0 deletions
diff --git a/‎docs/runbooks/SLO/index.md‎
Lines changed: 35 additions & 0 deletions b/‎docs/runbooks/SLO/index.md‎
Lines changed: 35 additions & 0 deletions
@@ -0,0 +1,113 @@
+---
+title: Binding Sync Work Availability
+sidebar_label: Binding Sync Work Availability
+description: Binding Sync Work Availability
+---
+
+# Binding to Work Sync Availability SLO Error Budget Burn Rate Exceeded
+
+## Understanding This Alert
+
+This alert fires when the SLO's error budget is being consumed faster than sustainable — **page** alerts indicate urgent issues requiring immediate action, while **ticket** alerts should be addressed during business hours. See the [Reliability Engineering Guide](../../administrator/reliability/guide#slo-alerting-framework) for details on burn rates, time windows, and severity thresholds.
+
+## What This Alert Means
+
+Karmada is failing to create Work resources from your ResourceBindings or ClusterResourceBindings (Stage 4: Work Creation & Override Application). After the scheduler assigns workloads to clusters, the binding controller converts those scheduling decisions into Work objects that drive the actual deployment. This step is failing.
+
+## Impact
+
+- **Workloads scheduled but not propagated** - The scheduler picked clusters, but resources can't move to the next pipeline stage
+- **Deployments are stuck** - New applications won't reach member clusters
+- **Updates blocked** - Changes to existing workloads won't propagate
+
+This is a **critical** issue that blocks workload propagation after scheduling. User intent (resources created in Karmada) never materializes in member clusters.
+
+## Possible Causes
+
+- Conflicts with existing Work resources
+- Invalid resource templates in the binding
+- API server issues preventing Work creation
+- Missing execution namespaces (one per member cluster)
+- RBAC permission issues
+- Override policy application failures
+
+## Remediation
+
+**1. Review the Sloth SLO dashboards in Grafana.** These dashboards are your primary tool for understanding the scope and timeline of the issue:
+
+- **[SLO Details Dashboard](https://grafana.com/grafana/dashboards/14348)** — Drill into this specific SLO to see the current burn rate, error budget remaining, monthly burndown chart, and alert state. Use this to confirm the alert and understand when the issue began.
+- **[SLO Overview Dashboard](https://grafana.com/grafana/dashboards/14643)** — Check the fleet-wide view to see if other SLOs are also burning budget, which may indicate a broader systemic issue.
+
+
+**2. Check Kubernetes events for binding sync failures:**
+
+```bash
+kubectl get events -n karmada-system --field-selector reason=SyncWorkFailed --sort-by='.lastTimestamp'
+```
+
+Review the event messages to understand why the sync failed and determine the fix.
+
+**3. Check the controller-manager logs.** The binding controller runs in the karmada-controller-manager.
+
+Check the logs using your logging solution (e.g., kubectl logs, Loki, Elasticsearch):
+
+```bash
+kubectl logs -n karmada-system -l app=karmada-controller-manager --tail=200 | grep -i "binding\|work\|error\|failed"
+```
+
+**4. Check execution namespace existence.** Work objects are created in execution namespaces (one per member cluster). Missing namespaces cause Work creation failures.
+
+```bash
+kubectl get namespaces | grep karmada-es-
+
+kubectl get clusters -o jsonpath='{.items[*].metadata.name}' | tr ' ' '\n' | while read cluster; do
+  ns="karmada-es-${cluster}"
+  kubectl get namespace "$ns" &>/dev/null || echo "MISSING: $ns"
+done
+```
+
+**5. Check RBAC permissions:**
+
+```bash
+kubectl logs -n karmada-system -l app=karmada-controller-manager --tail=200 | grep -i "forbidden\|unauthorized\|rbac"
+```
+
+**6. Check override policy application.** Override policies are applied during Work creation; failures cause binding sync errors.
+
+```bash
+kubectl get overridepolicies -A
+kubectl get clusteroverridepolicies
+```
+
+**7. Check the binding sync error rate:**
+
+```promql
+sum by (result) (rate(binding_sync_work_duration_seconds_count[5m]))
+```
+
+**8. Check the binding controller workqueue:**
+
+```promql
+workqueue_depth{name=~"binding-controller|cluster-resource-binding-controller"}
+
+rate(workqueue_retries_total{name=~"binding-controller|cluster-resource-binding-controller"}[5m])
+/
+rate(workqueue_adds_total{name=~"binding-controller|cluster-resource-binding-controller"}[5m])
+```
+
+**9. Check for recent changes.** Were override policies modified? Were new member clusters registered? Were execution namespaces deleted?
+
+### Mitigation
+
+| Symptom | Action |
+|---------|--------|
+| Missing execution namespaces | Recreate namespaces or trigger cluster re-registration |
+| RBAC errors | Review and fix controller-manager RBAC roles |
+| Override policy failures | Review and fix OverridePolicy configurations |
+| API server errors | Address API server issues first (see [Karmada API Server Availability](./karmada-apiserver-availability)) |
+| Controller workqueue stuck | Restart the controller-manager after fixing the root cause |
+
+## Related Resources
+
+- [Reliability Engineering Guide](../../administrator/reliability/guide)
+- [Alerting on SLOs - Google SRE Workbook](https://sre.google/workbook/alerting-on-slos/)
@@ -0,0 +1,104 @@
+---
+title: Binding Sync Work Latency
+sidebar_label: Binding Sync Work Latency
+description: Binding Sync Work Latency
+---
+
+# Binding to Work Sync Latency SLO Error Budget Burn Rate Exceeded
+
+## Understanding This Alert
+
+This alert fires when the SLO's error budget is being consumed faster than sustainable. This SLO uses **ticket-level alerts only** — investigate and address during normal business hours. See the [Reliability Engineering Guide](../../administrator/reliability/guide#slo-alerting-framework) for details on burn rates, time windows, and severity thresholds.
+
+## What This Alert Means
+
+The process of converting ResourceBindings into Work resources (Stage 4: Work Creation & Override Application) is taking longer than expected. While operations are succeeding, they are exceeding the configured latency threshold at an elevated rate, which delays the overall resource propagation pipeline.
+
+## Impact
+
+- **Slower deployments** - Resources take longer to move from scheduling decisions to propagation tasks
+- **Delayed updates** - Changes to workloads propagate more slowly to member clusters
+- **End-to-end pipeline slowdown** - Downstream steps (actual workload deployment) are delayed
+
+Binding sync latency determines how quickly scheduling decisions translate into deployable Work objects.
+
+## Possible Causes
+
+- High volume of resources being propagated simultaneously
+- API server slowness affecting Work object creation
+- Complex resource templates requiring more processing time
+- Complex override policies with many rules
+- Large resource manifests taking longer to process
+- Controller-manager under resource pressure
+
+## Remediation
+
+**1. Review the Sloth SLO dashboards in Grafana.** These dashboards are your primary tool for understanding the scope and timeline of the issue:
+
+- **[SLO Details Dashboard](https://grafana.com/grafana/dashboards/14348)** — Drill into this specific SLO to see the current burn rate, error budget remaining, monthly burndown chart, and alert state. Use this to confirm the alert and understand when the issue began.
+- **[SLO Overview Dashboard](https://grafana.com/grafana/dashboards/14643)** — Check the fleet-wide view to see if other SLOs are also burning budget, which may indicate a broader systemic issue.
+
+
+**2. Check binding sync duration breakdown:**
+
+```promql
+histogram_quantile(0.95,
+  sum by (le, result) (rate(binding_sync_work_duration_seconds_bucket[5m])))
+```
+
+**3. Check for complex override policies.** Override policy evaluation happens during binding sync; complex policies with many rules are a common cause of latency.
+
+```bash
+kubectl get clusteroverridepolicies -o yaml | grep -c "overrideRules"
+kubectl get overridepolicies -A -o yaml | grep -c "overrideRules"
+```
+
+**4. Check the binding controller workqueue:**
+
+```promql
+workqueue_depth{name=~"binding-controller|cluster-resource-binding-controller"}
+
+histogram_quantile(0.95,
+  sum by (le) (rate(workqueue_queue_duration_seconds_bucket{name=~"binding-controller|cluster-resource-binding-controller"}[5m])))
+
+histogram_quantile(0.95,
+  sum by (le) (rate(workqueue_work_duration_seconds_bucket{name=~"binding-controller|cluster-resource-binding-controller"}[5m])))
+```
+
+**5. Check binding volume.** A high number of bindings being processed simultaneously increases controller load.
+
+```bash
+kubectl get resourcebindings -A --no-headers | wc -l
+kubectl get clusterresourcebindings --no-headers | wc -l
+```
+
+**6. Check controller-manager resource usage:**
+
+```bash
+kubectl top pod -n karmada-system -l app=karmada-controller-manager
+```
+
+**7. Check controller-manager logs:**
+
+Check the logs using your logging solution (e.g., kubectl logs, Loki, Elasticsearch):
+
+```bash
+kubectl logs -n karmada-system -l app=karmada-controller-manager --tail=200 | grep -i "Sync work\|failed"
+```
+
+**8. Check for recent changes.** Were override policies added or modified? Did the number of propagated resources increase significantly?
+
+### Mitigation
+
+| Root Cause | Action |
+|------------|--------|
+| Complex override policies | Simplify override rules; reduce number of per-resource overrides |
+| Large manifests | Review and reduce ConfigMap/Secret sizes; split large resources |
+| Controller CPU-constrained | Increase CPU limits |
+| API server slow writes | Address API server/etcd performance |
+| Backlog in queue | Check for and resolve any errors causing retries |
+
+## Related Resources
+
+- [Reliability Engineering Guide](../../administrator/reliability/guide)
+- [Alerting on SLOs - Google SRE Workbook](https://sre.google/workbook/alerting-on-slos/)
@@ -0,0 +1,105 @@
+---
+title: Cluster Sync Latency
+sidebar_label: Cluster Sync Latency
+description: Cluster Sync Latency
+---
+
+# Cluster Status Sync Latency SLO Error Budget Burn Rate Exceeded
+
+## Understanding This Alert
+
+This alert fires when the SLO's error budget is being consumed faster than sustainable. This SLO uses **ticket-level alerts only** — investigate and address during normal business hours. See the [Reliability Engineering Guide](../../administrator/reliability/guide#slo-alerting-framework) for details on burn rates, time windows, and severity thresholds.
+
+## What This Alert Means
+
+The process of syncing status information from ready member clusters is taking longer than expected. Karmada periodically pulls cluster status (node count, resource capacity, conditions) from each member cluster. This SLO only tracks syncs for clusters that are in a Ready state.
+
+## Impact
+
+- **Stale cluster information** - The scheduler may use outdated capacity data when making placement decisions
+- **Suboptimal scheduling** - Workloads may be placed on clusters that appear to have capacity but don't
+- **Delayed health detection** - Changes in cluster health take longer to be reflected in the control plane
+- **Delayed failover** - Failed clusters continue receiving new workload placements
+
+## Possible Causes
+
+- Member cluster API server slowness
+- Network latency between the Karmada control plane and member clusters (especially cross-region)
+- Large clusters with many nodes requiring more status data to collect
+- Resource pressure on the controller manager
+- High number of registered member clusters
+
+## Remediation
+
+**1. Review the Sloth SLO dashboards in Grafana.** These dashboards are your primary tool for understanding the scope and timeline of the issue:
+
+- **[SLO Details Dashboard](https://grafana.com/grafana/dashboards/14348)** — Drill into this specific SLO to see the current burn rate, error budget remaining, monthly burndown chart, and alert state. Use this to confirm the alert and understand when the issue began.
+- **[SLO Overview Dashboard](https://grafana.com/grafana/dashboards/14643)** — Check the fleet-wide view to see if other SLOs are also burning budget, which may indicate a broader systemic issue.
+
+
+**2. Identify which specific clusters are slow:**
+
+```promql
+histogram_quantile(0.95,
+  sum by (le, member_cluster) (rate(cluster_sync_status_duration_seconds_bucket[5m])))
+  > 1
+```
+
+**3. Check member cluster API server latency.** The cluster sync controller fetches status from member cluster API servers; their latency is the primary driver.
+
+```bash
+kubectl describe cluster <slow-cluster-name>
+```
+
+```bash
+kubectl run conn-test --rm -it --image=curlimages/curl --namespace=karmada-system -- \
+  curl -w "%{time_total}\n" -s -o /dev/null -k https://<cluster-api-server>:6443/healthz
+```
+
+**4. Check network latency to member clusters.** Cross-region member clusters inherently have higher sync latency. If you have recently added cross-region clusters, consider adjusting the threshold.
+
+**5. Check the cluster status controller workqueue:**
+
+```promql
+workqueue_depth{name="cluster-status-controller"}
+
+histogram_quantile(0.95,
+  sum by (le) (rate(workqueue_queue_duration_seconds_bucket{name="cluster-status-controller"}[5m])))
+```
+
+**6. Check controller-manager resource usage:**
+
+```bash
+kubectl top pod -n karmada-system -l app=karmada-controller-manager
+```
+
+**7. Check total cluster count.** The more member clusters registered, the more work the cluster controller must do.
+
+```bash
+kubectl get clusters --no-headers | wc -l
+```
+
+**8. Check controller-manager logs:**
+
+Check the logs using your logging solution (e.g., kubectl logs, Loki, Elasticsearch):
+
+```bash
+kubectl logs -n karmada-system -l app=karmada-controller-manager --tail=200 | grep -i "cluster\|failed\|sync"
+```
+
+**9. Check for recent changes.** Were new member clusters added (especially cross-region)? Were there network path changes?
+
+### Mitigation
+
+| Root Cause | Action |
+|------------|--------|
+| Member cluster API server slow | Address member cluster health |
+| High network latency (cross-region) | Adjust the latency threshold to match actual latency; consider cluster-local agents |
+| Controller CPU-constrained | Increase CPU limits for controller-manager |
+| Many clusters causing contention | Consider increasing controller-manager worker thread count |
+| Network partition to specific clusters | Restore network connectivity; investigate network path |
+
+## Related Resources
+
+- [Reliability Engineering Guide](../../administrator/reliability/guide)
+- [Alerting on SLOs - Google SRE Workbook](https://sre.google/workbook/alerting-on-slos/)
@@ -0,0 +1,35 @@
+---
+title: SLO Runbooks
+---
+
+# SLO Runbooks
+
+Detailed troubleshooting guides for Service Level Objective (SLO) violations. Each runbook explains the impact, potential causes, and remediation steps for a specific SLO.
+
+## API Server
+
+- [API Server Availability](karmada-apiserver-availability.md)
+- [API Server Latency](karmada-apiserver-latency.md)
+
+## Policy Application
+
+- [Policy Apply Availability](policy-apply-availability.md)
+- [Policy Apply Latency](policy-apply-latency.md)
+
+## Scheduler
+
+- [Scheduler Availability](karmada-scheduler-availability.md)
+- [Scheduler Latency](karmada-scheduler-latency.md)
+- [Scheduler Estimator Availability](karmada-scheduler-estimator-availability.md)
+- [Scheduler Estimator Latency](karmada-scheduler-estimator-latency.md)
+
+## Resource Propagation
+
+- [Binding Sync Work Availability](binding-sync-work-availability.md)
+- [Binding Sync Work Latency](binding-sync-work-latency.md)
+- [Work Sync Workload Availability](work-sync-workload-availability.md)
+- [Work Sync Workload Latency](work-sync-workload-latency.md)
+
+## Cluster Health
+
+- [Cluster Sync Latency](cluster-sync-latency.md)