Skip to content

feat: add operator-specific Prometheus metrics#159

Open
daanvinken wants to merge 6 commits into
valkey-io:mainfrom
daanvinken:feat/operator-metrics
Open

feat: add operator-specific Prometheus metrics#159
daanvinken wants to merge 6 commits into
valkey-io:mainfrom
daanvinken:feat/operator-metrics

Conversation

@daanvinken

@daanvinken daanvinken commented May 4, 2026

Copy link
Copy Markdown
Contributor

Description

Adds custom Prometheus metrics for ValkeyCluster observability. The default controller-runtime metrics (reconcile counts, workqueue depth) don't expose operator-specific information.

New metrics:

  • valkey_operator_cluster_state_info - gauge with state label (Initializing, Ready, Reconciling, Degraded, Failed), value 1 for current state, 0 for all others
  • valkey_operator_cluster_shards - total shard count per cluster
  • valkey_operator_cluster_shards_ready - ready shard count per cluster
  • valkey_operator_failovers_total - counter for proactive failover events
  • valkey_operator_slot_migration_batches_total - counter for completed slot migration batches (scale-out and scale-in)

Gauge metrics are cleaned up when a ValkeyCluster is deleted. All metrics (gauges and counters) are cleaned up when a ValkeyCluster is deleted to prevent cardinality leaks.

Uses valkey_cluster label instead of cluster to avoid conflicts in multi-cluster Prometheus setups (context). Uses ClusterState constants from API types to avoid string mismatches.

Testing

Unit tests pass. Deployed to a local kind cluster with a 3-shard 1-replica cluster and scraped the metrics endpoint:

$ curl -s http://localhost:8443/metrics | grep valkey_operator_cluster

# HELP valkey_operator_cluster_state_info Information about a ValkeyCluster. Value is 1 for the current state, 0 for all others.
# TYPE valkey_operator_cluster_state_info gauge
valkey_operator_cluster_state_info{valkey_cluster="failover-test",namespace="default",state="Degraded"} 0
valkey_operator_cluster_state_info{valkey_cluster="failover-test",namespace="default",state="Failed"} 0
valkey_operator_cluster_state_info{valkey_cluster="failover-test",namespace="default",state="Initializing"} 0
valkey_operator_cluster_state_info{valkey_cluster="failover-test",namespace="default",state="Ready"} 1
valkey_operator_cluster_state_info{valkey_cluster="failover-test",namespace="default",state="Reconciling"} 0
# HELP valkey_operator_cluster_shards Total number of shards in a ValkeyCluster.
# TYPE valkey_operator_cluster_shards gauge
valkey_operator_cluster_shards{valkey_cluster="failover-test",namespace="default"} 3
# HELP valkey_operator_cluster_shards_ready Number of ready shards in a ValkeyCluster.
# TYPE valkey_operator_cluster_shards_ready gauge
valkey_operator_cluster_shards_ready{valkey_cluster="failover-test",namespace="default"} 3

@Preisschild

Preisschild commented May 4, 2026

Copy link
Copy Markdown

It might make sense to avoid using cluster (instead something like valkey_cluster) as label, since the cluster label is often used (and hardcoded in various open source dashboards) to refer to a kubernetes cluster, which might lead to conflicts in case the valkey-operator runs in a multi-k8s-cluster - centralized prometheus setup.

There is a similar issue in CloudNativePG:
cloudnative-pg/cloudnative-pg#2501

Comment thread internal/controller/metrics.go Outdated
@daanvinken

Copy link
Copy Markdown
Contributor Author

Ah yes good call, we are relabelling for that exact use case as well. Returning from PTO 05/26 - will have a look

Add custom metrics for ValkeyCluster observability:
- valkey_operator_cluster_info: gauge with state label (Ready, Reconciling, Degraded, Failed)
- valkey_operator_cluster_shards: total shard count per cluster
- valkey_operator_cluster_shards_ready: ready shard count per cluster
- valkey_operator_failovers_total: counter for proactive failover events
- valkey_operator_slot_migrations_total: counter for slot migration batches

Signed-off-by: Daan Vinken <daanvinken@tythus.com>
- Rename "cluster" label to "valkey_cluster" to avoid conflicts
  in multi-cluster Prometheus setups
- Use ClusterState constants from API types instead of hardcoded
  strings, and include missing Initializing state

Signed-off-by: Daan Vinken <daanvinken@tythus.com>
@daanvinken daanvinken force-pushed the feat/operator-metrics branch from 4024890 to 9f41bf0 Compare May 28, 2026 19:57
@greptile-apps

greptile-apps Bot commented May 28, 2026

Copy link
Copy Markdown

Greptile Summary

This PR introduces five operator-specific Prometheus metrics for ValkeyCluster observability: three gauges tracking cluster state and shard counts, and two counters for failover and slot migration events. Metrics are registered on startup via promauto, updated in updateStatus, and cleaned up (including counters) on cluster deletion via the IsNotFound reconcile path.

  • internal/controller/metrics.go — new file defining all five metrics, updateClusterMetrics, and deleteClusterMetrics; all use target_namespace as the label key for the Kubernetes namespace, which differs from namespace=\"default\" shown in the PR description's verified scrape output.
  • valkeycluster_controller.go — wires metric updates into updateStatus (before the Kubernetes status patch is confirmed) and metric cleanup into the IsNotFound early-return path.
  • api/v1alpha1/valkeycluster_types.go — adds exported ClusterStates slice consumed by both updateClusterMetrics and deleteClusterMetrics.

Confidence Score: 5/5

Safe to merge; all changes are additive metric instrumentation with no impact on reconciliation logic or cluster management behaviour.

The reconciliation logic itself is untouched — all new code paths are pure metric writes that cannot affect cluster state. Gauge cleanup on deletion is correctly placed in the IsNotFound branch, counter and gauge label values match their registration order at every call site, and the promauto registration pattern is idiomatic for controller-runtime.

internal/controller/metrics.go is worth a second look to confirm target_namespace is the intended label name, given the discrepancy with the PR description scrape output.

Important Files Changed

Filename Overview
internal/controller/metrics.go New file introducing five Prometheus metrics (3 gauges, 2 counters) with update/delete helpers; the namespace label key is target_namespace which contradicts the verified scrape output in the PR description.
api/v1alpha1/valkeycluster_types.go Adds exported ClusterStates slice of all possible cluster states; safe addition but the exported mutable var could be inadvertently modified by consumers of the API package.
internal/controller/valkeycluster_controller.go Wires updateClusterMetrics into updateStatus and deleteClusterMetrics into the IsNotFound path; metrics are updated slightly before the Kubernetes status patch is confirmed.
internal/controller/failover.go Increments failoversTotal counter correctly after a successful proactive failover is confirmed; straightforward and correct.

Sequence Diagram

sequenceDiagram
    participant K8s as Kubernetes API
    participant R as Reconciler
    participant M as metrics.go
    participant P as Prometheus Registry

    K8s->>R: Reconcile(req)
    alt Cluster not found (deleted)
        R->>M: deleteClusterMetrics(name, ns)
        M->>P: clusterStateInfo.DeleteLabelValues(...)
        M->>P: clusterShards.DeleteLabelValues(...)
        M->>P: clusterShardsReady.DeleteLabelValues(...)
        M->>P: failoversTotal.DeletePartialMatch(...)
        M->>P: slotMigrationBatchesTotal.DeleteLabelValues(...)
    else Cluster exists
        R->>K8s: Get cluster
        R->>R: reconcile steps...
        R->>R: updateStatus(cluster)
        R->>M: updateClusterMetrics(current)
        M->>P: clusterStateInfo.Set(1 or 0 per state)
        M->>P: clusterShards.Set(shards)
        M->>P: clusterShardsReady.Set(readyShards)
        R->>K8s: Status().Patch(current)
    end

    note over R,M: failoversTotal.Inc() called in failover.go on successful proactive failover
    note over R,M: slotMigrationBatchesTotal.Inc() called in rebalanceSlots / drainExcessShards
Loading

Reviews (5): Last reviewed commit: "fix: rename namespace label to target_na..." | Re-trigger Greptile

Comment thread internal/controller/metrics.go
Comment thread internal/controller/metrics.go Outdated
Comment thread internal/controller/metrics.go Outdated
- Add deleteClusterMetrics to remove stale gauge label sets when a
  ValkeyCluster is deleted, preventing phantom entries in dashboards
- Fix clusterInfo help text to accurately describe the 0/1 state pattern

Signed-off-by: Daan Vinken <daanvinken@tythus.com>
@daanvinken daanvinken requested a review from bjosv June 1, 2026 07:23

@bjosv bjosv left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting time on this!
Added some comments, the main thing is to include scale-in migrations in slotMigrationsTotal, unless I missed a reason for not including that.

Comment thread internal/controller/metrics.go Outdated
Comment thread internal/controller/valkeycluster_controller.go Outdated
Comment thread internal/controller/metrics.go
Comment thread internal/controller/metrics.go Outdated
- Rename cluster_info to cluster_state_info for clarity
- Rename slot_migrations_total to slot_migration_batches_total to
  accurately reflect that it counts batches, not individual slots
- Count scale-in drain batches in addition to scale-out rebalance batches

Signed-off-by: Daan Vinken <daanvinken@tythus.com>
@daanvinken

Copy link
Copy Markdown
Contributor Author

Thanks @bjosv ! All should be addressed

bjosv
bjosv previously approved these changes Jun 7, 2026

@bjosv bjosv left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Comment thread internal/controller/metrics.go Outdated
Comment on lines +20 to +35
"github.com/prometheus/client_golang/prometheus"
"sigs.k8s.io/controller-runtime/pkg/metrics"

valkeyiov1alpha1 "valkey.io/valkey-operator/api/v1alpha1"
)

var (
clusterInfo = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "valkey_operator_cluster_state_info",
Help: "Information about a ValkeyCluster. Value is 1 for the current state, 0 for all others.",
},
[]string{"valkey_cluster", "namespace", "state"},
)

clusterShards = prometheus.NewGaugeVec(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can eliminate the init() function with promauto.

Suggested change
"github.com/prometheus/client_golang/prometheus"
"sigs.k8s.io/controller-runtime/pkg/metrics"
valkeyiov1alpha1 "valkey.io/valkey-operator/api/v1alpha1"
)
var (
clusterInfo = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "valkey_operator_cluster_state_info",
Help: "Information about a ValkeyCluster. Value is 1 for the current state, 0 for all others.",
},
[]string{"valkey_cluster", "namespace", "state"},
)
clusterShards = prometheus.NewGaugeVec(
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"sigs.k8s.io/controller-runtime/pkg/metrics"
valkeyiov1alpha1 "valkey.io/valkey-operator/api/v1alpha1"
)
var (
clusterInfo = promauto.NewGaugeVec(
prometheus.GaugeOpts{
Name: "valkey_operator_cluster_state_info",
Help: "Information about a ValkeyCluster. Value is 1 for the current state, 0 for all others.",
},
[]string{"valkey_cluster", "namespace", "state"},
)
clusterShards = promauto.NewGaugeVec(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's nice

Comment thread internal/controller/metrics.go Outdated
}

// clusterStates lists all possible ValkeyCluster states for metric cleanup.
var clusterStates = []valkeyiov1alpha1.ClusterState{

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should define this list in valkey.io/valkey-operator/api/v1alpha1 so we don't have to keep it in sync here.

@SuperQ SuperQ left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I recommend is to init the label values early in Reconcile() call

Something like this:

func (r *ValkeyClusterReconciler) setupClusterMetrics(cluster *valkeyiov1alpha1.ValkeyCluster) {
   failoversTotal.WithLabelValues(cluster.Name, cluster.Namespace)
   slotMigrationBatchesTotal.WithLabelValues(cluster.Name, cluster.Namespace)
}

It's OK to call this function multiple times as it's thread safe and pretty cheap.

I don't actually remember if the controller reconcile object sticks around after each loop, otherwise we could cache the WithLabelValues() on ValkeyClusterReconciler struct with MustCurryWith.

@SuperQ

SuperQ commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

@bjosv If you want to merge this as-is, I can followup with my suggested improvements.

- Use promauto.With(metrics.Registry) instead of manual init()
- Move ClusterStates list to api/v1alpha1 to keep it in sync with type
  definitions
- Delete counters on cluster cleanup to prevent cardinality leaks

Signed-off-by: Daan Vinken <daanvinken@tythus.com>
@daanvinken

Copy link
Copy Markdown
Contributor Author

Thanks folks, should be addressed.

Comment thread internal/controller/metrics.go
Comment thread internal/controller/metrics.go Outdated
Name: "valkey_operator_cluster_state_info",
Help: "Information about a ValkeyCluster. Value is 1 for the current state, 0 for all others.",
},
[]string{"valkey_cluster", "namespace", "state"},

@SuperQ SuperQ Jun 10, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to what @Preisschild said about cluster, namespace is used for the namespace the controller is running in.

I would probably call this target_namespace or valkey_cluster_namespace.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good call

Avoids conflict with Prometheus service discovery's built-in namespace
label.

Signed-off-by: Daan Vinken <daanvinken@tythus.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants