Skip to content

Reliability Engineering Guide#995

Merged
karmada-bot merged 1 commit into
karmada-io:mainfrom
jabellard:reliability
May 11, 2026
Merged

Reliability Engineering Guide#995
karmada-bot merged 1 commit into
karmada-io:mainfrom
jabellard:reliability

Conversation

@jabellard

Copy link
Copy Markdown
Member

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

@karmada-bot karmada-bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 14, 2026
@karmada-bot karmada-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 14, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive Reliability Engineering Guide for Karmada, detailing resource propagation stages, Service Level Objectives (SLOs), and a multi-window alerting framework. The guide is integrated into the documentation sidebar. Feedback suggests defining SLOs for the missing stages (Dependency Propagation and Status Aggregation) identified in the flow diagram and adopting human-centric decimal values for latency thresholds to improve operator intuition.

Comment thread docs/administrator/monitoring/reliability.md Outdated
Comment thread docs/administrator/monitoring/reliability.md Outdated
@karmada-bot karmada-bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 26, 2026
@jabellard jabellard force-pushed the reliability branch 2 times, most recently from 2a0a120 to 66c14b8 Compare April 26, 2026 18:46
@jabellard jabellard force-pushed the reliability branch 2 times, most recently from ec1cafa to 278dc60 Compare May 4, 2026 20:24
@jabellard jabellard changed the title [DRAFT] Reliability Enginerring Guide Reliability Enginerring Guide May 4, 2026
@jabellard jabellard marked this pull request as ready for review May 5, 2026 21:35
@karmada-bot karmada-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 5, 2026
@karmada-bot karmada-bot requested review from Poor12 and samzong May 5, 2026 21:35
@jabellard

Copy link
Copy Markdown
Member Author

Hey @RainbowMango, @zhzhuang-zju . This is finally ready for review. Please take a look.

@RainbowMango RainbowMango left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/assign

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Reliability Engineering documentation set for Karmada, including an SLO alerting framework explanation, a downloadable Sloth SLO config, and a set of SLO-specific runbooks, then wires the new docs into the Docusaurus sidebar/navigation.

Changes:

  • Add a new Reliability Engineering Guide documenting Karmada’s propagation stages, recommended SLOs, and Sloth-based alerting.
  • Add a Sloth SLO configuration YAML for Karmada, intended for download and rule generation.
  • Add an SLO Runbooks section (index + individual runbooks) and update sidebars.js to surface the new content.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
static/sloth/karmada-slo-config.yaml Adds Sloth SLO definitions + alert metadata for Karmada reliability monitoring.
sidebars.js Adds “Reliability” under Administrator Guide and a new “Runbooks” section with SLO runbooks.
docs/administrator/reliability/guide.md New end-to-end guide covering reliability concepts, SLOs, and Sloth alerting framework.
docs/runbooks/SLO/index.md New index page organizing SLO runbooks by component/stage.
docs/runbooks/SLO/karmada-apiserver-availability.md Runbook for API server availability SLO alerts.
docs/runbooks/SLO/karmada-apiserver-latency.md Runbook for API server latency SLO alerts.
docs/runbooks/SLO/policy-apply-availability.md Runbook for policy-apply availability SLO alerts.
docs/runbooks/SLO/policy-apply-latency.md Runbook for policy-apply latency SLO alerts.
docs/runbooks/SLO/karmada-scheduler-availability.md Runbook for scheduler availability SLO alerts.
docs/runbooks/SLO/karmada-scheduler-latency.md Runbook for scheduler latency SLO alerts.
docs/runbooks/SLO/karmada-scheduler-estimator-availability.md Runbook for scheduler estimator availability SLO alerts.
docs/runbooks/SLO/karmada-scheduler-estimator-latency.md Runbook for scheduler estimator latency SLO alerts.
docs/runbooks/SLO/binding-sync-work-availability.md Runbook for binding→work availability SLO alerts.
docs/runbooks/SLO/binding-sync-work-latency.md Runbook for binding→work latency SLO alerts.
docs/runbooks/SLO/work-sync-workload-availability.md Runbook for work→workload availability SLO alerts.
docs/runbooks/SLO/work-sync-workload-latency.md Runbook for work→workload latency SLO alerts.
docs/runbooks/SLO/cluster-sync-latency.md Runbook for cluster status sync latency SLO alerts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread static/sloth/karmada-slo-config.yaml Outdated
Comment thread static/sloth/karmada-slo-config.yaml
Comment thread docs/administrator/reliability/guide.md
Comment thread sidebars.js Outdated
@RainbowMango

Copy link
Copy Markdown
Member

/retitle Reliability Engineering Guide

typo

@karmada-bot karmada-bot changed the title Reliability Enginerring Guide Reliability Engineering Guide May 9, 2026
@jabellard

Copy link
Copy Markdown
Member Author

Hey @RainbowMango . I just pushed changes to address your comments. Please take another look.

Signed-off-by: Joe Nathan Abellard <contact@jabellard.com>

@RainbowMango RainbowMango left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

PS: force-pushed for tidying the commits.

@karmada-bot karmada-bot added the lgtm Indicates that a PR is ready to be merged. label May 11, 2026
@karmada-bot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: RainbowMango

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karmada-bot karmada-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 11, 2026
@karmada-bot karmada-bot merged commit 1f9e421 into karmada-io:main May 11, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants