Skip to content

feat: add 6 prometheus metrics#7616

Open
zhangsquared wants to merge 11 commits into
karmada-io:masterfrom
zhangsquared:prometheus
Open

feat: add 6 prometheus metrics#7616
zhangsquared wants to merge 11 commits into
karmada-io:masterfrom
zhangsquared:prometheus

Conversation

@zhangsquared

@zhangsquared zhangsquared commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

Adds 6 new Prometheus metrics to ClusterStatusController that instrument the existing health probe code path.

New Metrics

cluster_health_probe_success

Raw health probe result (1/0) before threshold adjustment, enabling avg_over_time(cluster_health_probe_success[30d]).

Behavior
Probe Result Gauge Value
Online and healthy 1
Unreachable 0
Online but unhealthy 0

cluster_health_probe_duration_seconds

Histogram isolating the health check HTTP call latency from the full syncClusterStatus() cycle. Custom buckets from 10ms to 10s.

cluster_health_probe_total

Counter of probes categorized by result: success or error. Reveals error rate patterns via rate(cluster_health_probe_total{result="error"}[5m]).

cluster_health_transitions_total

Counter of health state transitions with from_state and to_state labels (True/False/Unknown).
Use case 1:
Clusters with more than 3 transitions in 30 minutes: increase(cluster_health_transitions_total[30m]) > 3
Use case 2:
Clusters that went down in the last 5 minutes: increase(cluster_health_transitions_total{from_state="True",to_state="False"}[5m]) > 0

cluster_condition_last_transition_timestamp_seconds

Unix timestamp of the last condition state transition. Answers "when was the last state change?"

Behavior
Scenario prev current Action
Recovery False True Sets gauge to transition timestamp
Failure True False Sets gauge to transition timestamp
Stable healthy True True No-op, metric unchanged
Stable unhealthy False False No-op, metric unchanged

cluster_ready_since_timestamp_seconds

Unix timestamp when the cluster last became Ready. Enables uptime calculation via time() - metric.

Behavior
Scenario prev current Action
Recovery False True Sets gauge to transition timestamp — uptime begins
Failure True False Sets gauge to 0 — resets uptime
Stable healthy True True No-op — preserves existing timestamp
Stable unhealthy False False No-op — no transition to record

Tests

Unit tests

  • TestRecordClusterHealthProbeSuccess — online+healthy, unreachable, online+unhealthy
  • TestRecordClusterHealthProbeDuration — histogram records observation
  • TestProbeResultLabel — maps (online, healthy) to success/error
  • TestRecordClusterHealthProbeTotal — success and error counters
  • TestRecordClusterHealthTransition — True→False, False→True, no-op on same status
  • TestRecordClusterConditionLastTransition — transition records timestamp, no-op on same status
  • TestRecordClusterReadySince — transition away from Ready sets 0, no-op on same status

E2E test

Added 5 of 6 metrics to the e2e presence test for karmada-controller-manager. cluster_health_transitions_total is excluded because it only emits data on a state transition, and clusters stay healthy throughout the e2e test.

Which issue(s) this PR fixes:

Fixes #7553

Special notes for your reviewer:

Added as comment in the code

Does this PR introduce a user-facing change?:

Add 6 new Prometheus metrics for member cluster health probes: 
- cluster_health_probe_success
- cluster_health_probe_duration_seconds
- cluster_health_probe_total
- cluster_health_transitions_total
- cluster_condition_last_transition_timestamp_seconds
- cluster_ready_since_timestamp_seconds

@karmada-bot karmada-bot added kind/feature Categorizes issue or PR as related to a new feature. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Jun 9, 2026
@karmada-bot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign seanlaii for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gemini-code-assist

Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances observability in the cluster status management by exposing raw health probe results as a Prometheus metric. By tracking whether a cluster is online and healthy before any threshold adjustments are applied, it provides better visibility into the underlying health checks performed by the controller.

Highlights

  • New Prometheus Metric: Introduced a new gauge metric 'cluster_health_probe_success' to track the raw health probe results of member clusters.
  • Metric Recording: Updated the cluster status controller to record the health probe outcome during the synchronization process.
  • Testing and Validation: Added unit tests for the new metric recording logic and updated E2E metric tests to include the new probe success metric.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@karmada-bot karmada-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 9, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new Prometheus gauge metric, cluster_health_probe_success, to record the raw health probe results of member clusters before threshold adjustments. The changes include adding the metric definition, recording logic in the cluster status controller, cleanup handling, unit tests, and E2E test updates. Feedback on the PR highlights an issue where the metric is not recorded if client creation fails and the function returns early, potentially leaving the gauge at a stale value. It is recommended to refactor the status sync function to record this metric within a defer block to ensure it is always updated.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread pkg/controllers/status/cluster_status_controller.go Outdated
@codecov-commenter

codecov-commenter commented Jun 9, 2026

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 81.35593% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 42.11%. Comparing base (f605a6e) to head (f1253a6).
⚠️ Report is 25 commits behind head on master.

Files with missing lines Patch % Lines
pkg/metrics/cluster.go 77.50% 7 Missing and 2 partials ⚠️
...kg/controllers/status/cluster_status_controller.go 89.47% 1 Missing and 1 partial ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7616      +/-   ##
==========================================
- Coverage   42.16%   42.11%   -0.05%     
==========================================
  Files         879      879              
  Lines       54677    54882     +205     
==========================================
+ Hits        23055    23114      +59     
- Misses      29879    30019     +140     
- Partials     1743     1749       +6     
Flag Coverage Δ
unittests 42.11% <81.35%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@zhangsquared zhangsquared changed the title [WIP] Add more prometheus datapoint [WIP] Add 6 prometheus datapoint Jun 9, 2026
@zhangsquared zhangsquared changed the title [WIP] Add 6 prometheus datapoint [WIP] Add 6 prometheus matrics Jun 9, 2026
@karmada-bot karmada-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 9, 2026
Signed-off-by: zhangsquared <hi.zhangzhang@gmail.com>
Signed-off-by: zhangsquared <hi.zhangzhang@gmail.com>
Signed-off-by: zhangsquared <hi.zhangzhang@gmail.com>
Signed-off-by: zhangsquared <hi.zhangzhang@gmail.com>
Signed-off-by: zhangsquared <hi.zhangzhang@gmail.com>
Signed-off-by: zhangsquared <hi.zhangzhang@gmail.com>
@zhangsquared zhangsquared changed the title [WIP] Add 6 prometheus matrics [WIP] Add 6 prometheus metrics Jun 10, 2026
@RainbowMango

Copy link
Copy Markdown
Member

Thanks @zhangsquared for doing this. Please cc me once it is ready for review.

Signed-off-by: zhangsquared <hi.zhangzhang@gmail.com>
Signed-off-by: zhangsquared <hi.zhangzhang@gmail.com>

func (c *ClusterStatusController) syncClusterStatus(ctx context.Context, cluster *clusterv1alpha1.Cluster) error {
start := time.Now()
var online, healthy bool

@zhangsquared zhangsquared Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raw probe

  • cluster_health_probe_success — raw 1/0 from (online, healthy)
  • cluster_health_probe_duration_seconds — times getClusterHealthStatus() call
  • cluster_health_probe_total — counts by success/error from (online, healthy)
  • cluster_health_transitions_total — counts raw state transitions (from_state/to_state: True/False)

Using threshold adjustment

  • cluster_ready_since_timestamp_seconds — timestamp on Ready transition
  • cluster_condition_last_transition_timestamp_seconds — timestamp on any transition


func TestClusterCollectorsLint(t *testing.T) {
for _, c := range ClusterCollectors() {
problems, err := promtestutil.CollectAndLint(c)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metrics linter... @@

Signed-off-by: zhangsquared <hi.zhangzhang@gmail.com>
"cluster_health_probe_duration_seconds_bucket", // health probe latency histogram
"cluster_health_probe_total", // probe count by result
"cluster_ready_since_timestamp_seconds", // uptime timestamp
"cluster_condition_last_transition_timestamp_seconds", // last transition timestamp

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cluster_health_transitions_total is not included in the e2e presence test
It only emits data on a state transition.
In the e2e environment, clusters are already healthy and stay healthy throughout the test, so no transition occurs and the counter is never incremented.

Comment thread pkg/metrics/cluster.go
// HealthStateSuccess indicates the cluster is online and healthy.
HealthStateSuccess = "success"
// HealthStateError indicates the cluster is not healthy or not reachable.
HealthStateError = "error"

@zhangsquared zhangsquared Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Result of cluster_health_probe_total

Signed-off-by: zhangsquared <hi.zhangzhang@gmail.com>
…adjusted one

Signed-off-by: zhangsquared <hi.zhangzhang@gmail.com>
@zhangsquared zhangsquared marked this pull request as ready for review June 22, 2026 13:40
Copilot AI review requested due to automatic review settings June 22, 2026 13:40
@zhangsquared zhangsquared changed the title [WIP] Add 6 prometheus metrics feat: add 6 prometheus metrics Jun 22, 2026
@karmada-bot karmada-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 22, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds new Prometheus metrics to improve observability of member cluster health probing performed by ClusterStatusController, enabling availability/error-rate analysis and transition/uptime-style dashboards from the existing /readyz probe path.

Changes:

  • Added new cluster health probe metrics (success gauge, duration histogram, probe result counter, transition counter, and transition/uptime timestamp gauges).
  • Instrumented ClusterStatusController.syncClusterStatus() to record probe results, duration, and transitions.
  • Added unit tests for the new metrics and updated the e2e metrics presence test to assert the new series exist.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File Description
test/e2e/suites/base/metrics_test.go Extends controller-manager metrics presence assertions to include the new health probe metrics.
pkg/metrics/cluster.go Defines new Prometheus collectors and helper recording functions; wires new collectors into ClusterCollectors() and cleanup.
pkg/metrics/cluster_test.go Adds unit tests covering new metric helpers and collector linting.
pkg/controllers/status/cluster_status_controller.go Adds health probe instrumentation (success/total/duration), transition tracking state, and cleanup on cluster deletion.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +200 to +203
if readyCondition != nil {
metrics.RecordClusterReadySince(cluster.Name, prevReadyStatus, readyCondition.Status, start)
metrics.RecordClusterConditionLastTransition(cluster.Name, prevReadyStatus, readyCondition.Status, start)
}
Comment on lines 223 to +227
observedReadyCondition := generateReadyCondition(online, healthy)
readyCondition := c.clusterConditionCache.thresholdAdjustedReadyCondition(cluster, &observedReadyCondition)
if prev, ok := c.prevProbeResults.Load(cluster.Name); ok {
metrics.RecordClusterHealthTransition(cluster.Name, prev.(metav1.ConditionStatus), observedReadyCondition.Status)
}
c.prevProbeResults.Store(cluster.Name, observedReadyCondition.Status)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i still want to use clusterConditionCache

Comment thread pkg/metrics/cluster.go
Comment on lines +150 to +154
// clusterConditionLastTransition records the unix timestamp of the last condition state transition.
clusterConditionLastTransition = prometheus.NewGaugeVec(prometheus.GaugeOpts{
Name: clusterConditionLastTransitionName,
Help: "Unix timestamp of the last condition state transition for the member cluster.",
}, []string{memberClusterLabel})
Comment on lines +475 to +479
if err := testutil.CollectAndCompare(clusterHealthProbeDuration, strings.NewReader(want), clusterHealthProbeDurationName); err != nil {
// Histogram bucket values are non-deterministic, so just verify the metric exists
// by checking that the error is about values, not about missing metrics
t.Logf("histogram comparison (expected non-exact match): %s", err)
}
Comment thread pkg/metrics/cluster.go
Comment on lines +245 to +249
func RecordClusterReadySince(clusterName string, prevStatus, currentStatus metav1.ConditionStatus, timestamp time.Time) {
if prevStatus == currentStatus {
return
}
if currentStatus == metav1.ConditionTrue {
Comment thread pkg/metrics/cluster.go
Comment on lines +259 to +263
if prevStatus == currentStatus {
return
}
clusterConditionLastTransition.WithLabelValues(clusterName).Set(float64(timestamp.Unix()))
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Observability]: Emit Prometheus metrics for member cluster health probes

5 participants