Reliability Engineering Guide#995
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive Reliability Engineering Guide for Karmada, detailing resource propagation stages, Service Level Objectives (SLOs), and a multi-window alerting framework. The guide is integrated into the documentation sidebar. Feedback suggests defining SLOs for the missing stages (Dependency Propagation and Status Aggregation) identified in the flow diagram and adopting human-centric decimal values for latency thresholds to improve operator intuition.
2a0a120 to
66c14b8
Compare
ec1cafa to
278dc60
Compare
|
Hey @RainbowMango, @zhzhuang-zju . This is finally ready for review. Please take a look. |
There was a problem hiding this comment.
Pull request overview
Adds a Reliability Engineering documentation set for Karmada, including an SLO alerting framework explanation, a downloadable Sloth SLO config, and a set of SLO-specific runbooks, then wires the new docs into the Docusaurus sidebar/navigation.
Changes:
- Add a new Reliability Engineering Guide documenting Karmada’s propagation stages, recommended SLOs, and Sloth-based alerting.
- Add a Sloth SLO configuration YAML for Karmada, intended for download and rule generation.
- Add an SLO Runbooks section (index + individual runbooks) and update
sidebars.jsto surface the new content.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| static/sloth/karmada-slo-config.yaml | Adds Sloth SLO definitions + alert metadata for Karmada reliability monitoring. |
| sidebars.js | Adds “Reliability” under Administrator Guide and a new “Runbooks” section with SLO runbooks. |
| docs/administrator/reliability/guide.md | New end-to-end guide covering reliability concepts, SLOs, and Sloth alerting framework. |
| docs/runbooks/SLO/index.md | New index page organizing SLO runbooks by component/stage. |
| docs/runbooks/SLO/karmada-apiserver-availability.md | Runbook for API server availability SLO alerts. |
| docs/runbooks/SLO/karmada-apiserver-latency.md | Runbook for API server latency SLO alerts. |
| docs/runbooks/SLO/policy-apply-availability.md | Runbook for policy-apply availability SLO alerts. |
| docs/runbooks/SLO/policy-apply-latency.md | Runbook for policy-apply latency SLO alerts. |
| docs/runbooks/SLO/karmada-scheduler-availability.md | Runbook for scheduler availability SLO alerts. |
| docs/runbooks/SLO/karmada-scheduler-latency.md | Runbook for scheduler latency SLO alerts. |
| docs/runbooks/SLO/karmada-scheduler-estimator-availability.md | Runbook for scheduler estimator availability SLO alerts. |
| docs/runbooks/SLO/karmada-scheduler-estimator-latency.md | Runbook for scheduler estimator latency SLO alerts. |
| docs/runbooks/SLO/binding-sync-work-availability.md | Runbook for binding→work availability SLO alerts. |
| docs/runbooks/SLO/binding-sync-work-latency.md | Runbook for binding→work latency SLO alerts. |
| docs/runbooks/SLO/work-sync-workload-availability.md | Runbook for work→workload availability SLO alerts. |
| docs/runbooks/SLO/work-sync-workload-latency.md | Runbook for work→workload latency SLO alerts. |
| docs/runbooks/SLO/cluster-sync-latency.md | Runbook for cluster status sync latency SLO alerts. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/retitle Reliability Engineering Guide typo |
|
Hey @RainbowMango . I just pushed changes to address your comments. Please take another look. |
Signed-off-by: Joe Nathan Abellard <contact@jabellard.com>
RainbowMango
left a comment
There was a problem hiding this comment.
/lgtm
/approve
PS: force-pushed for tidying the commits.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: RainbowMango The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer: