feat: add 6 prometheus metrics by zhangsquared · Pull Request #7616 · karmada-io/karmada

zhangsquared · 2026-06-09T22:05:30Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Adds 6 new Prometheus metrics to ClusterStatusController that instrument the existing health probe code path.

New Metrics

`cluster_health_probe_success`

Raw health probe result (1/0) before threshold adjustment, enabling avg_over_time(cluster_health_probe_success[30d]).

Behavior

Probe Result	Gauge Value
Online and healthy	`1`
Unreachable	`0`
Online but unhealthy	`0`

`cluster_health_probe_duration_seconds`

Histogram isolating the health check HTTP call latency from the full syncClusterStatus() cycle. Custom buckets from 10ms to 10s.

`cluster_health_probe_total`

Counter of probes categorized by result: success or error. Reveals error rate patterns via rate(cluster_health_probe_total{result="error"}[5m]).

`cluster_health_transitions_total`

Counter of health state transitions with from_state and to_state labels (True/False/Unknown).
Use case 1:
Clusters with more than 3 transitions in 30 minutes: increase(cluster_health_transitions_total[30m]) > 3
Use case 2:
Clusters that went down in the last 5 minutes: increase(cluster_health_transitions_total{from_state="True",to_state="False"}[5m]) > 0

`cluster_condition_last_transition_timestamp_seconds`

Unix timestamp of the last condition state transition. Answers "when was the last state change?"

Behavior

Scenario	prev	current	Action
Recovery	`False`	`True`	Sets gauge to transition timestamp
Failure	`True`	`False`	Sets gauge to transition timestamp
Stable healthy	`True`	`True`	No-op, metric unchanged
Stable unhealthy	`False`	`False`	No-op, metric unchanged

`cluster_ready_since_timestamp_seconds`

Unix timestamp when the cluster last became Ready. Enables uptime calculation via time() - metric.

Behavior

Scenario	prev	current	Action
Recovery	`False`	`True`	Sets gauge to transition timestamp — uptime begins
Failure	`True`	`False`	Sets gauge to `0` — resets uptime
Stable healthy	`True`	`True`	No-op — preserves existing timestamp
Stable unhealthy	`False`	`False`	No-op — no transition to record

Tests

Unit tests

TestRecordClusterHealthProbeSuccess — online+healthy, unreachable, online+unhealthy
TestRecordClusterHealthProbeDuration — histogram records observation
TestProbeResultLabel — maps (online, healthy) to success/error
TestRecordClusterHealthProbeTotal — success and error counters
TestRecordClusterHealthTransition — True→False, False→True, no-op on same status
TestRecordClusterConditionLastTransition — transition records timestamp, no-op on same status
TestRecordClusterReadySince — transition away from Ready sets 0, no-op on same status

E2E test

Added 5 of 6 metrics to the e2e presence test for karmada-controller-manager. cluster_health_transitions_total is excluded because it only emits data on a state transition, and clusters stay healthy throughout the e2e test.

Which issue(s) this PR fixes:

Fixes #7553

Special notes for your reviewer:

Added as comment in the code

Does this PR introduce a user-facing change?:

Add 6 new Prometheus metrics for member cluster health probes: 
- cluster_health_probe_success
- cluster_health_probe_duration_seconds
- cluster_health_probe_total
- cluster_health_transitions_total
- cluster_condition_last_transition_timestamp_seconds
- cluster_ready_since_timestamp_seconds

karmada-bot · 2026-06-09T22:05:34Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign seanlaii for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gemini-code-assist · 2026-06-09T22:05:39Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances observability in the cluster status management by exposing raw health probe results as a Prometheus metric. By tracking whether a cluster is online and healthy before any threshold adjustments are applied, it provides better visibility into the underlying health checks performed by the controller.

Highlights

New Prometheus Metric: Introduced a new gauge metric 'cluster_health_probe_success' to track the raw health probe results of member clusters.
Metric Recording: Updated the cluster status controller to record the health probe outcome during the synchronization process.
Testing and Validation: Added unit tests for the new metric recording logic and updated E2E metric tests to include the new probe success metric.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new Prometheus gauge metric, cluster_health_probe_success, to record the raw health probe results of member clusters before threshold adjustments. The changes include adding the metric definition, recording logic in the cluster status controller, cleanup handling, unit tests, and E2E test updates. Feedback on the PR highlights an issue where the metric is not recorded if client creation fails and the function returns early, potentially leaving the gauge at a stale value. It is recommended to refactor the status sync function to record this metric within a defer block to ensure it is always updated.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

codecov-commenter · 2026-06-09T22:47:14Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 81.35593% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 42.11%. Comparing base (f605a6e) to head (f1253a6).
⚠️ Report is 25 commits behind head on master.

Files with missing lines	Patch %	Lines
pkg/metrics/cluster.go	77.50%	7 Missing and 2 partials ⚠️
...kg/controllers/status/cluster_status_controller.go	89.47%	1 Missing and 1 partial ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #7616      +/-   ##
==========================================
- Coverage   42.16%   42.11%   -0.05%     
==========================================
  Files         879      879              
  Lines       54677    54882     +205     
==========================================
+ Hits        23055    23114      +59     
- Misses      29879    30019     +140     
- Partials     1743     1749       +6

Flag	Coverage Δ
unittests	`42.11% <81.35%> (-0.05%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: zhangsquared <hi.zhangzhang@gmail.com>

RainbowMango · 2026-06-10T03:30:55Z

Thanks @zhangsquared for doing this. Please cc me once it is ready for review.

Signed-off-by: zhangsquared <hi.zhangzhang@gmail.com>

zhangsquared · 2026-06-11T04:27:52Z


 func (c *ClusterStatusController) syncClusterStatus(ctx context.Context, cluster *clusterv1alpha1.Cluster) error {
 	start := time.Now()
+	var online, healthy bool


Raw probe

cluster_health_probe_success — raw 1/0 from (online, healthy)

cluster_health_probe_duration_seconds — times getClusterHealthStatus() call

cluster_health_probe_total — counts by success/error from (online, healthy)

cluster_health_transitions_total — counts raw state transitions (from_state/to_state: True/False)

Using threshold adjustment

cluster_ready_since_timestamp_seconds — timestamp on Ready transition

cluster_condition_last_transition_timestamp_seconds — timestamp on any transition

zhangsquared · 2026-06-11T04:36:31Z

+
+func TestClusterCollectorsLint(t *testing.T) {
+	for _, c := range ClusterCollectors() {
+		problems, err := promtestutil.CollectAndLint(c)


metrics linter... @@

Signed-off-by: zhangsquared <hi.zhangzhang@gmail.com>

zhangsquared · 2026-06-11T04:50:49Z

+			"cluster_health_probe_duration_seconds_bucket",        // health probe latency histogram
+			"cluster_health_probe_total",                          // probe count by result
+			"cluster_ready_since_timestamp_seconds",               // uptime timestamp
+			"cluster_condition_last_transition_timestamp_seconds", // last transition timestamp


cluster_health_transitions_total is not included in the e2e presence test
It only emits data on a state transition.
In the e2e environment, clusters are already healthy and stay healthy throughout the test, so no transition occurs and the counter is never incremented.

zhangsquared · 2026-06-22T02:39:29Z

+	// HealthStateSuccess indicates the cluster is online and healthy.
+	HealthStateSuccess = "success"
+	// HealthStateError indicates the cluster is not healthy or not reachable.
+	HealthStateError = "error"


Result of cluster_health_probe_total

Signed-off-by: zhangsquared <hi.zhangzhang@gmail.com>

…adjusted one Signed-off-by: zhangsquared <hi.zhangzhang@gmail.com>

Copilot

Pull request overview

This PR adds new Prometheus metrics to improve observability of member cluster health probing performed by ClusterStatusController, enabling availability/error-rate analysis and transition/uptime-style dashboards from the existing /readyz probe path.

Changes:

Added new cluster health probe metrics (success gauge, duration histogram, probe result counter, transition counter, and transition/uptime timestamp gauges).
Instrumented ClusterStatusController.syncClusterStatus() to record probe results, duration, and transitions.
Added unit tests for the new metrics and updated the e2e metrics presence test to assert the new series exist.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File	Description
`test/e2e/suites/base/metrics_test.go`	Extends controller-manager metrics presence assertions to include the new health probe metrics.
`pkg/metrics/cluster.go`	Defines new Prometheus collectors and helper recording functions; wires new collectors into `ClusterCollectors()` and cleanup.
`pkg/metrics/cluster_test.go`	Adds unit tests covering new metric helpers and collector linting.
`pkg/controllers/status/cluster_status_controller.go`	Adds health probe instrumentation (success/total/duration), transition tracking state, and cleanup on cluster deletion.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+		if readyCondition != nil {
+			metrics.RecordClusterReadySince(cluster.Name, prevReadyStatus, readyCondition.Status, start)
+			metrics.RecordClusterConditionLastTransition(cluster.Name, prevReadyStatus, readyCondition.Status, start)
+		}


zhangsquared · 2026-06-23T01:46:24Z

 	observedReadyCondition := generateReadyCondition(online, healthy)
-	readyCondition := c.clusterConditionCache.thresholdAdjustedReadyCondition(cluster, &observedReadyCondition)
+	if prev, ok := c.prevProbeResults.Load(cluster.Name); ok {
+		metrics.RecordClusterHealthTransition(cluster.Name, prev.(metav1.ConditionStatus), observedReadyCondition.Status)
+	}
+	c.prevProbeResults.Store(cluster.Name, observedReadyCondition.Status)


i still want to use clusterConditionCache

+	// clusterConditionLastTransition records the unix timestamp of the last condition state transition.
+	clusterConditionLastTransition = prometheus.NewGaugeVec(prometheus.GaugeOpts{
+		Name: clusterConditionLastTransitionName,
+		Help: "Unix timestamp of the last condition state transition for the member cluster.",
+	}, []string{memberClusterLabel})


+	if err := testutil.CollectAndCompare(clusterHealthProbeDuration, strings.NewReader(want), clusterHealthProbeDurationName); err != nil {
+		// Histogram bucket values are non-deterministic, so just verify the metric exists
+		// by checking that the error is about values, not about missing metrics
+		t.Logf("histogram comparison (expected non-exact match): %s", err)
+	}


+func RecordClusterReadySince(clusterName string, prevStatus, currentStatus metav1.ConditionStatus, timestamp time.Time) {
+	if prevStatus == currentStatus {
+		return
+	}
+	if currentStatus == metav1.ConditionTrue {


+	if prevStatus == currentStatus {
+		return
+	}
+	clusterConditionLastTransition.WithLabelValues(clusterName).Set(float64(timestamp.Unix()))
+}


karmada-bot added kind/feature Categorizes issue or PR as related to a new feature. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Jun 9, 2026

karmada-bot requested review from CharlesQQ and chaunceyjiang June 9, 2026 22:05

karmada-bot requested a review from XiShanYongYe-Chang June 9, 2026 22:05

karmada-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 9, 2026

gemini-code-assist Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread pkg/controllers/status/cluster_status_controller.go Outdated

zhangsquared changed the title ~~[WIP] Add more prometheus datapoint~~ [WIP] Add 6 prometheus datapoint Jun 9, 2026

zhangsquared changed the title ~~[WIP] Add 6 prometheus datapoint~~ [WIP] Add 6 prometheus matrics Jun 9, 2026

karmada-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 9, 2026

zhangsquared added 3 commits June 9, 2026 18:57

Add cluster_health_probe_success

ff8122b

Signed-off-by: zhangsquared <hi.zhangzhang@gmail.com>

Move to defer block

b58e122

Signed-off-by: zhangsquared <hi.zhangzhang@gmail.com>

Add cluster_ready_since_timestamp_seconds

30e77d3

Signed-off-by: zhangsquared <hi.zhangzhang@gmail.com>

zhangsquared force-pushed the prometheus branch from 7f0361e to 30e77d3 Compare June 9, 2026 22:57

zhangsquared added 3 commits June 9, 2026 20:03

Add cluster_condition_last_transition_timestamp_seconds

9f568e8

Signed-off-by: zhangsquared <hi.zhangzhang@gmail.com>

Add cluster_health_probe_duration_seconds

26a0d0d

Signed-off-by: zhangsquared <hi.zhangzhang@gmail.com>

cluster_health_probe_total

525e135

Signed-off-by: zhangsquared <hi.zhangzhang@gmail.com>

zhangsquared changed the title ~~[WIP] Add 6 prometheus matrics~~ [WIP] Add 6 prometheus metrics Jun 10, 2026

zhangsquared added 2 commits June 10, 2026 23:34

Add lint tool and fix lint error

da0adbb

Signed-off-by: zhangsquared <hi.zhangzhang@gmail.com>

Add cluster_health_transitions_total

b106c3e

Signed-off-by: zhangsquared <hi.zhangzhang@gmail.com>

zhangsquared commented Jun 11, 2026

View reviewed changes

Add to e2e test to let CI validate to test

f5b659a

Signed-off-by: zhangsquared <hi.zhangzhang@gmail.com>

zhangsquared commented Jun 11, 2026

View reviewed changes

zhangsquared commented Jun 22, 2026

View reviewed changes

zhangsquared added 2 commits June 21, 2026 23:22

state: success/error => true/false/unknown

000818b

Signed-off-by: zhangsquared <hi.zhangzhang@gmail.com>

cluster_health_transitions_total use raw probe instead of threadhold …

f1253a6

…adjusted one Signed-off-by: zhangsquared <hi.zhangzhang@gmail.com>

zhangsquared marked this pull request as ready for review June 22, 2026 13:40

Copilot AI review requested due to automatic review settings June 22, 2026 13:40

karmada-bot requested review from jabellard and mszacillo June 22, 2026 13:40

Copilot started reviewing on behalf of zhangsquared June 22, 2026 13:41 View session

zhangsquared changed the title ~~[WIP] Add 6 prometheus metrics~~ feat: add 6 prometheus metrics Jun 22, 2026

karmada-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 22, 2026

Copilot AI reviewed Jun 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add 6 prometheus metrics#7616

feat: add 6 prometheus metrics#7616
zhangsquared wants to merge 11 commits into
karmada-io:masterfrom
zhangsquared:prometheus

zhangsquared commented Jun 9, 2026 •

edited

Loading

Uh oh!

karmada-bot commented Jun 9, 2026

Uh oh!

gemini-code-assist Bot commented Jun 9, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

codecov-commenter commented Jun 9, 2026 •

edited

Loading

Uh oh!

RainbowMango commented Jun 10, 2026

Uh oh!

zhangsquared Jun 11, 2026 •

edited

Loading

Uh oh!

zhangsquared Jun 11, 2026

Uh oh!

zhangsquared Jun 11, 2026

Uh oh!

zhangsquared Jun 22, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

zhangsquared Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

zhangsquared commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New Metrics

cluster_health_probe_success

cluster_health_probe_duration_seconds

cluster_health_probe_total

cluster_health_transitions_total

cluster_condition_last_transition_timestamp_seconds

cluster_ready_since_timestamp_seconds

Tests

Unit tests

E2E test

Uh oh!

karmada-bot commented Jun 9, 2026

Uh oh!

gemini-code-assist Bot commented Jun 9, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

codecov-commenter commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

RainbowMango commented Jun 10, 2026

Uh oh!

zhangsquared Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhangsquared Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

zhangsquared Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

zhangsquared Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

zhangsquared Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zhangsquared commented Jun 9, 2026 •

edited

Loading

`cluster_health_probe_success`

`cluster_health_probe_duration_seconds`

`cluster_health_probe_total`

`cluster_health_transitions_total`

`cluster_condition_last_transition_timestamp_seconds`

`cluster_ready_since_timestamp_seconds`

codecov-commenter commented Jun 9, 2026 •

edited

Loading

zhangsquared Jun 11, 2026 •

edited

Loading

zhangsquared Jun 22, 2026 •

edited

Loading