feat: add 6 prometheus metrics#7616
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances observability in the cluster status management by exposing raw health probe results as a Prometheus metric. By tracking whether a cluster is online and healthy before any threshold adjustments are applied, it provides better visibility into the underlying health checks performed by the controller. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new Prometheus gauge metric, cluster_health_probe_success, to record the raw health probe results of member clusters before threshold adjustments. The changes include adding the metric definition, recording logic in the cluster status controller, cleanup handling, unit tests, and E2E test updates. Feedback on the PR highlights an issue where the metric is not recorded if client creation fails and the function returns early, potentially leaving the gauge at a stale value. It is recommended to refactor the status sync function to record this metric within a defer block to ensure it is always updated.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #7616 +/- ##
==========================================
- Coverage 42.16% 42.11% -0.05%
==========================================
Files 879 879
Lines 54677 54882 +205
==========================================
+ Hits 23055 23114 +59
- Misses 29879 30019 +140
- Partials 1743 1749 +6
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Signed-off-by: zhangsquared <hi.zhangzhang@gmail.com>
Signed-off-by: zhangsquared <hi.zhangzhang@gmail.com>
Signed-off-by: zhangsquared <hi.zhangzhang@gmail.com>
Signed-off-by: zhangsquared <hi.zhangzhang@gmail.com>
Signed-off-by: zhangsquared <hi.zhangzhang@gmail.com>
Signed-off-by: zhangsquared <hi.zhangzhang@gmail.com>
|
Thanks @zhangsquared for doing this. Please cc me once it is ready for review. |
Signed-off-by: zhangsquared <hi.zhangzhang@gmail.com>
Signed-off-by: zhangsquared <hi.zhangzhang@gmail.com>
|
|
||
| func (c *ClusterStatusController) syncClusterStatus(ctx context.Context, cluster *clusterv1alpha1.Cluster) error { | ||
| start := time.Now() | ||
| var online, healthy bool |
There was a problem hiding this comment.
Raw probe
cluster_health_probe_success— raw 1/0 from(online, healthy)cluster_health_probe_duration_seconds— timesgetClusterHealthStatus()callcluster_health_probe_total— counts bysuccess/errorfrom(online, healthy)cluster_health_transitions_total— counts raw state transitions (from_state/to_state:True/False)
Using threshold adjustment
cluster_ready_since_timestamp_seconds— timestamp on Ready transitioncluster_condition_last_transition_timestamp_seconds— timestamp on any transition
|
|
||
| func TestClusterCollectorsLint(t *testing.T) { | ||
| for _, c := range ClusterCollectors() { | ||
| problems, err := promtestutil.CollectAndLint(c) |
There was a problem hiding this comment.
metrics linter... @@
Signed-off-by: zhangsquared <hi.zhangzhang@gmail.com>
| "cluster_health_probe_duration_seconds_bucket", // health probe latency histogram | ||
| "cluster_health_probe_total", // probe count by result | ||
| "cluster_ready_since_timestamp_seconds", // uptime timestamp | ||
| "cluster_condition_last_transition_timestamp_seconds", // last transition timestamp |
There was a problem hiding this comment.
cluster_health_transitions_total is not included in the e2e presence test
It only emits data on a state transition.
In the e2e environment, clusters are already healthy and stay healthy throughout the test, so no transition occurs and the counter is never incremented.
| // HealthStateSuccess indicates the cluster is online and healthy. | ||
| HealthStateSuccess = "success" | ||
| // HealthStateError indicates the cluster is not healthy or not reachable. | ||
| HealthStateError = "error" |
There was a problem hiding this comment.
Result of cluster_health_probe_total
Signed-off-by: zhangsquared <hi.zhangzhang@gmail.com>
…adjusted one Signed-off-by: zhangsquared <hi.zhangzhang@gmail.com>
There was a problem hiding this comment.
Pull request overview
This PR adds new Prometheus metrics to improve observability of member cluster health probing performed by ClusterStatusController, enabling availability/error-rate analysis and transition/uptime-style dashboards from the existing /readyz probe path.
Changes:
- Added new cluster health probe metrics (success gauge, duration histogram, probe result counter, transition counter, and transition/uptime timestamp gauges).
- Instrumented
ClusterStatusController.syncClusterStatus()to record probe results, duration, and transitions. - Added unit tests for the new metrics and updated the e2e metrics presence test to assert the new series exist.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
test/e2e/suites/base/metrics_test.go |
Extends controller-manager metrics presence assertions to include the new health probe metrics. |
pkg/metrics/cluster.go |
Defines new Prometheus collectors and helper recording functions; wires new collectors into ClusterCollectors() and cleanup. |
pkg/metrics/cluster_test.go |
Adds unit tests covering new metric helpers and collector linting. |
pkg/controllers/status/cluster_status_controller.go |
Adds health probe instrumentation (success/total/duration), transition tracking state, and cleanup on cluster deletion. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if readyCondition != nil { | ||
| metrics.RecordClusterReadySince(cluster.Name, prevReadyStatus, readyCondition.Status, start) | ||
| metrics.RecordClusterConditionLastTransition(cluster.Name, prevReadyStatus, readyCondition.Status, start) | ||
| } |
| observedReadyCondition := generateReadyCondition(online, healthy) | ||
| readyCondition := c.clusterConditionCache.thresholdAdjustedReadyCondition(cluster, &observedReadyCondition) | ||
| if prev, ok := c.prevProbeResults.Load(cluster.Name); ok { | ||
| metrics.RecordClusterHealthTransition(cluster.Name, prev.(metav1.ConditionStatus), observedReadyCondition.Status) | ||
| } | ||
| c.prevProbeResults.Store(cluster.Name, observedReadyCondition.Status) |
There was a problem hiding this comment.
i still want to use clusterConditionCache
| // clusterConditionLastTransition records the unix timestamp of the last condition state transition. | ||
| clusterConditionLastTransition = prometheus.NewGaugeVec(prometheus.GaugeOpts{ | ||
| Name: clusterConditionLastTransitionName, | ||
| Help: "Unix timestamp of the last condition state transition for the member cluster.", | ||
| }, []string{memberClusterLabel}) |
| if err := testutil.CollectAndCompare(clusterHealthProbeDuration, strings.NewReader(want), clusterHealthProbeDurationName); err != nil { | ||
| // Histogram bucket values are non-deterministic, so just verify the metric exists | ||
| // by checking that the error is about values, not about missing metrics | ||
| t.Logf("histogram comparison (expected non-exact match): %s", err) | ||
| } |
| func RecordClusterReadySince(clusterName string, prevStatus, currentStatus metav1.ConditionStatus, timestamp time.Time) { | ||
| if prevStatus == currentStatus { | ||
| return | ||
| } | ||
| if currentStatus == metav1.ConditionTrue { |
| if prevStatus == currentStatus { | ||
| return | ||
| } | ||
| clusterConditionLastTransition.WithLabelValues(clusterName).Set(float64(timestamp.Unix())) | ||
| } |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Adds 6 new Prometheus metrics to
ClusterStatusControllerthat instrument the existing health probe code path.New Metrics
cluster_health_probe_successRaw health probe result (1/0) before threshold adjustment, enabling
avg_over_time(cluster_health_probe_success[30d]).Behavior
100cluster_health_probe_duration_secondsHistogram isolating the health check HTTP call latency from the full
syncClusterStatus()cycle. Custom buckets from 10ms to 10s.cluster_health_probe_totalCounter of probes categorized by result:
successorerror. Reveals error rate patterns viarate(cluster_health_probe_total{result="error"}[5m]).cluster_health_transitions_totalCounter of health state transitions with
from_stateandto_statelabels (True/False/Unknown).Use case 1:
Clusters with more than 3 transitions in 30 minutes:
increase(cluster_health_transitions_total[30m]) > 3Use case 2:
Clusters that went down in the last 5 minutes:
increase(cluster_health_transitions_total{from_state="True",to_state="False"}[5m]) > 0cluster_condition_last_transition_timestamp_secondsUnix timestamp of the last condition state transition. Answers "when was the last state change?"
Behavior
FalseTrueTrueFalseTrueTrueFalseFalsecluster_ready_since_timestamp_secondsUnix timestamp when the cluster last became Ready. Enables uptime calculation via
time() - metric.Behavior
FalseTrueTrueFalse0— resets uptimeTrueTrueFalseFalseTests
Unit tests
TestRecordClusterHealthProbeSuccess— online+healthy, unreachable, online+unhealthyTestRecordClusterHealthProbeDuration— histogram records observationTestProbeResultLabel— maps (online, healthy) to success/errorTestRecordClusterHealthProbeTotal— success and error countersTestRecordClusterHealthTransition— True→False, False→True, no-op on same statusTestRecordClusterConditionLastTransition— transition records timestamp, no-op on same statusTestRecordClusterReadySince— transition away from Ready sets 0, no-op on same statusE2E test
Added 5 of 6 metrics to the e2e presence test for karmada-controller-manager.
cluster_health_transitions_totalis excluded because it only emits data on a state transition, and clusters stay healthy throughout the e2e test.Which issue(s) this PR fixes:
Fixes #7553
Special notes for your reviewer:
Added as comment in the code
Does this PR introduce a user-facing change?: