feat: add operator-specific Prometheus metrics by daanvinken · Pull Request #159 · valkey-io/valkey-operator

daanvinken · 2026-05-04T14:37:44Z

Description

Adds custom Prometheus metrics for ValkeyCluster observability. The default controller-runtime metrics (reconcile counts, workqueue depth) don't expose operator-specific information.

New metrics:

valkey_operator_cluster_state_info - gauge with state label (Initializing, Ready, Reconciling, Degraded, Failed), value 1 for current state, 0 for all others
valkey_operator_cluster_shards - total shard count per cluster
valkey_operator_cluster_shards_ready - ready shard count per cluster
valkey_operator_failovers_total - counter for proactive failover events
valkey_operator_slot_migration_batches_total - counter for completed slot migration batches (scale-out and scale-in)

Gauge metrics are cleaned up when a ValkeyCluster is deleted. All metrics (gauges and counters) are cleaned up when a ValkeyCluster is deleted to prevent cardinality leaks.

Uses valkey_cluster label instead of cluster to avoid conflicts in multi-cluster Prometheus setups (context). Uses ClusterState constants from API types to avoid string mismatches.

Testing

Unit tests pass. Deployed to a local kind cluster with a 3-shard 1-replica cluster and scraped the metrics endpoint:

$ curl -s http://localhost:8443/metrics | grep valkey_operator_cluster

# HELP valkey_operator_cluster_state_info Information about a ValkeyCluster. Value is 1 for the current state, 0 for all others.
# TYPE valkey_operator_cluster_state_info gauge
valkey_operator_cluster_state_info{valkey_cluster="failover-test",namespace="default",state="Degraded"} 0
valkey_operator_cluster_state_info{valkey_cluster="failover-test",namespace="default",state="Failed"} 0
valkey_operator_cluster_state_info{valkey_cluster="failover-test",namespace="default",state="Initializing"} 0
valkey_operator_cluster_state_info{valkey_cluster="failover-test",namespace="default",state="Ready"} 1
valkey_operator_cluster_state_info{valkey_cluster="failover-test",namespace="default",state="Reconciling"} 0
# HELP valkey_operator_cluster_shards Total number of shards in a ValkeyCluster.
# TYPE valkey_operator_cluster_shards gauge
valkey_operator_cluster_shards{valkey_cluster="failover-test",namespace="default"} 3
# HELP valkey_operator_cluster_shards_ready Number of ready shards in a ValkeyCluster.
# TYPE valkey_operator_cluster_shards_ready gauge
valkey_operator_cluster_shards_ready{valkey_cluster="failover-test",namespace="default"} 3

Preisschild · 2026-05-04T15:33:09Z

It might make sense to avoid using cluster (instead something like valkey_cluster) as label, since the cluster label is often used (and hardcoded in various open source dashboards) to refer to a kubernetes cluster, which might lead to conflicts in case the valkey-operator runs in a multi-k8s-cluster - centralized prometheus setup.

There is a similar issue in CloudNativePG:
cloudnative-pg/cloudnative-pg#2501

daanvinken · 2026-05-10T03:09:40Z

Ah yes good call, we are relabelling for that exact use case as well. Returning from PTO 05/26 - will have a look

Add custom metrics for ValkeyCluster observability: - valkey_operator_cluster_info: gauge with state label (Ready, Reconciling, Degraded, Failed) - valkey_operator_cluster_shards: total shard count per cluster - valkey_operator_cluster_shards_ready: ready shard count per cluster - valkey_operator_failovers_total: counter for proactive failover events - valkey_operator_slot_migrations_total: counter for slot migration batches Signed-off-by: Daan Vinken <daanvinken@tythus.com>

- Rename "cluster" label to "valkey_cluster" to avoid conflicts in multi-cluster Prometheus setups - Use ClusterState constants from API types instead of hardcoded strings, and include missing Initializing state Signed-off-by: Daan Vinken <daanvinken@tythus.com>

greptile-apps · 2026-05-28T20:01:02Z

Greptile Summary

This PR introduces five operator-specific Prometheus metrics for ValkeyCluster observability: three gauges tracking cluster state and shard counts, and two counters for failover and slot migration events. Metrics are registered on startup via promauto, updated in updateStatus, and cleaned up (including counters) on cluster deletion via the IsNotFound reconcile path.

internal/controller/metrics.go — new file defining all five metrics, updateClusterMetrics, and deleteClusterMetrics; all use target_namespace as the label key for the Kubernetes namespace, which differs from namespace=\"default\" shown in the PR description's verified scrape output.
valkeycluster_controller.go — wires metric updates into updateStatus (before the Kubernetes status patch is confirmed) and metric cleanup into the IsNotFound early-return path.
api/v1alpha1/valkeycluster_types.go — adds exported ClusterStates slice consumed by both updateClusterMetrics and deleteClusterMetrics.

Confidence Score: 5/5

Safe to merge; all changes are additive metric instrumentation with no impact on reconciliation logic or cluster management behaviour.

The reconciliation logic itself is untouched — all new code paths are pure metric writes that cannot affect cluster state. Gauge cleanup on deletion is correctly placed in the IsNotFound branch, counter and gauge label values match their registration order at every call site, and the promauto registration pattern is idiomatic for controller-runtime.

internal/controller/metrics.go is worth a second look to confirm target_namespace is the intended label name, given the discrepancy with the PR description scrape output.

Important Files Changed

Filename	Overview
internal/controller/metrics.go	New file introducing five Prometheus metrics (3 gauges, 2 counters) with update/delete helpers; the namespace label key is `target_namespace` which contradicts the verified scrape output in the PR description.
api/v1alpha1/valkeycluster_types.go	Adds exported `ClusterStates` slice of all possible cluster states; safe addition but the exported mutable var could be inadvertently modified by consumers of the API package.
internal/controller/valkeycluster_controller.go	Wires `updateClusterMetrics` into `updateStatus` and `deleteClusterMetrics` into the `IsNotFound` path; metrics are updated slightly before the Kubernetes status patch is confirmed.
internal/controller/failover.go	Increments `failoversTotal` counter correctly after a successful proactive failover is confirmed; straightforward and correct.

Sequence Diagram

sequenceDiagram
    participant K8s as Kubernetes API
    participant R as Reconciler
    participant M as metrics.go
    participant P as Prometheus Registry

    K8s->>R: Reconcile(req)
    alt Cluster not found (deleted)
        R->>M: deleteClusterMetrics(name, ns)
        M->>P: clusterStateInfo.DeleteLabelValues(...)
        M->>P: clusterShards.DeleteLabelValues(...)
        M->>P: clusterShardsReady.DeleteLabelValues(...)
        M->>P: failoversTotal.DeletePartialMatch(...)
        M->>P: slotMigrationBatchesTotal.DeleteLabelValues(...)
    else Cluster exists
        R->>K8s: Get cluster
        R->>R: reconcile steps...
        R->>R: updateStatus(cluster)
        R->>M: updateClusterMetrics(current)
        M->>P: clusterStateInfo.Set(1 or 0 per state)
        M->>P: clusterShards.Set(shards)
        M->>P: clusterShardsReady.Set(readyShards)
        R->>K8s: Status().Patch(current)
    end

    note over R,M: failoversTotal.Inc() called in failover.go on successful proactive failover
    note over R,M: slotMigrationBatchesTotal.Inc() called in rebalanceSlots / drainExcessShards

_{Reviews (5): Last reviewed commit: "fix: rename namespace label to target_na..." | Re-trigger Greptile}

- Add deleteClusterMetrics to remove stale gauge label sets when a ValkeyCluster is deleted, preventing phantom entries in dashboards - Fix clusterInfo help text to accurately describe the 0/1 state pattern Signed-off-by: Daan Vinken <daanvinken@tythus.com>

bjosv

Thanks for putting time on this!
Added some comments, the main thing is to include scale-in migrations in slotMigrationsTotal, unless I missed a reason for not including that.

- Rename cluster_info to cluster_state_info for clarity - Rename slot_migrations_total to slot_migration_batches_total to accurately reflect that it counts batches, not individual slots - Count scale-in drain batches in addition to scale-out rebalance batches Signed-off-by: Daan Vinken <daanvinken@tythus.com>

daanvinken · 2026-06-02T14:45:06Z

Thanks @bjosv ! All should be addressed

bjosv

LGTM!

SuperQ · 2026-06-07T10:29:07Z

+	"github.com/prometheus/client_golang/prometheus"
+	"sigs.k8s.io/controller-runtime/pkg/metrics"
+
+	valkeyiov1alpha1 "valkey.io/valkey-operator/api/v1alpha1"
+)
+
+var (
+	clusterInfo = prometheus.NewGaugeVec(
+		prometheus.GaugeOpts{
+			Name: "valkey_operator_cluster_state_info",
+			Help: "Information about a ValkeyCluster. Value is 1 for the current state, 0 for all others.",
+		},
+		[]string{"valkey_cluster", "namespace", "state"},
+	)
+
+	clusterShards = prometheus.NewGaugeVec(


You can eliminate the init() function with promauto.

Suggested change

"github.com/prometheus/client_golang/prometheus"

"sigs.k8s.io/controller-runtime/pkg/metrics"

valkeyiov1alpha1 "valkey.io/valkey-operator/api/v1alpha1"

)

var (

clusterInfo = prometheus.NewGaugeVec(

prometheus.GaugeOpts{

Name: "valkey_operator_cluster_state_info",

Help: "Information about a ValkeyCluster. Value is 1 for the current state, 0 for all others.",

},

[]string{"valkey_cluster", "namespace", "state"},

)

clusterShards = prometheus.NewGaugeVec(

"github.com/prometheus/client_golang/prometheus"

"github.com/prometheus/client_golang/prometheus/promauto"

"sigs.k8s.io/controller-runtime/pkg/metrics"

valkeyiov1alpha1 "valkey.io/valkey-operator/api/v1alpha1"

)

var (

clusterInfo = promauto.NewGaugeVec(

prometheus.GaugeOpts{

Name: "valkey_operator_cluster_state_info",

Help: "Information about a ValkeyCluster. Value is 1 for the current state, 0 for all others.",

},

[]string{"valkey_cluster", "namespace", "state"},

)

clusterShards = promauto.NewGaugeVec(

Ah, that's nice

SuperQ · 2026-06-07T10:31:36Z

+}
+
+// clusterStates lists all possible ValkeyCluster states for metric cleanup.
+var clusterStates = []valkeyiov1alpha1.ClusterState{


I wonder if we should define this list in valkey.io/valkey-operator/api/v1alpha1 so we don't have to keep it in sync here.

SuperQ

One thing I recommend is to init the label values early in Reconcile() call

Something like this:

func (r *ValkeyClusterReconciler) setupClusterMetrics(cluster *valkeyiov1alpha1.ValkeyCluster) {
   failoversTotal.WithLabelValues(cluster.Name, cluster.Namespace)
   slotMigrationBatchesTotal.WithLabelValues(cluster.Name, cluster.Namespace)
}

It's OK to call this function multiple times as it's thread safe and pretty cheap.

I don't actually remember if the controller reconcile object sticks around after each loop, otherwise we could cache the WithLabelValues() on ValkeyClusterReconciler struct with MustCurryWith.

SuperQ · 2026-06-09T20:03:51Z

@bjosv If you want to merge this as-is, I can followup with my suggested improvements.

- Use promauto.With(metrics.Registry) instead of manual init() - Move ClusterStates list to api/v1alpha1 to keep it in sync with type definitions - Delete counters on cluster cleanup to prevent cardinality leaks Signed-off-by: Daan Vinken <daanvinken@tythus.com>

daanvinken · 2026-06-10T09:23:22Z

Thanks folks, should be addressed.

SuperQ · 2026-06-10T10:20:30Z

+			Name: "valkey_operator_cluster_state_info",
+			Help: "Information about a ValkeyCluster. Value is 1 for the current state, 0 for all others.",
+		},
+		[]string{"valkey_cluster", "namespace", "state"},


Similar to what @Preisschild said about cluster, namespace is used for the namespace the controller is running in.

I would probably call this target_namespace or valkey_cluster_namespace.

Yeah good call

Avoids conflict with Prometheus service discovery's built-in namespace label. Signed-off-by: Daan Vinken <daanvinken@tythus.com>

bjosv reviewed May 8, 2026

View reviewed changes

Comment thread internal/controller/metrics.go Outdated

daanvinken added 2 commits May 28, 2026 21:51

daanvinken force-pushed the feat/operator-metrics branch from 4024890 to 9f41bf0 Compare May 28, 2026 19:57

greptile-apps Bot reviewed May 28, 2026

View reviewed changes

Comment thread internal/controller/metrics.go

Comment thread internal/controller/metrics.go Outdated

Comment thread internal/controller/metrics.go Outdated

daanvinken requested a review from bjosv June 1, 2026 07:23

bjosv reviewed Jun 1, 2026

View reviewed changes

Comment thread internal/controller/metrics.go Outdated

Comment thread internal/controller/valkeycluster_controller.go Outdated

Comment thread internal/controller/metrics.go

Comment thread internal/controller/metrics.go Outdated

bjosv previously approved these changes Jun 7, 2026

View reviewed changes

SuperQ reviewed Jun 7, 2026

View reviewed changes

daanvinken dismissed bjosv’s stale review via 6f8b472 June 10, 2026 09:22

greptile-apps Bot reviewed Jun 10, 2026

View reviewed changes

Comment thread internal/controller/metrics.go

SuperQ reviewed Jun 10, 2026

View reviewed changes

fix: rename namespace label to target_namespace

ed90a03

Avoids conflict with Prometheus service discovery's built-in namespace label. Signed-off-by: Daan Vinken <daanvinken@tythus.com>

Conversation

daanvinken commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Preisschild commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

daanvinken commented May 10, 2026

Uh oh!

greptile-apps Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bjosv left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

daanvinken commented Jun 2, 2026

Uh oh!

bjosv left a comment

Choose a reason for hiding this comment

Uh oh!

SuperQ Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

bjosv Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

SuperQ Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

SuperQ left a comment

Choose a reason for hiding this comment

Uh oh!

SuperQ commented Jun 9, 2026

Uh oh!

daanvinken commented Jun 10, 2026

Uh oh!

Uh oh!

SuperQ Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

daanvinken Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

daanvinken commented May 4, 2026 •

edited

Loading

Preisschild commented May 4, 2026 •

edited

Loading

greptile-apps Bot commented May 28, 2026 •

edited

Loading

bjosv left a comment •

edited

Loading

SuperQ Jun 10, 2026 •

edited

Loading