feat: add operator-specific Prometheus metrics#159
Conversation
|
It might make sense to avoid using There is a similar issue in CloudNativePG: |
|
Ah yes good call, we are relabelling for that exact use case as well. Returning from PTO 05/26 - will have a look |
Add custom metrics for ValkeyCluster observability: - valkey_operator_cluster_info: gauge with state label (Ready, Reconciling, Degraded, Failed) - valkey_operator_cluster_shards: total shard count per cluster - valkey_operator_cluster_shards_ready: ready shard count per cluster - valkey_operator_failovers_total: counter for proactive failover events - valkey_operator_slot_migrations_total: counter for slot migration batches Signed-off-by: Daan Vinken <daanvinken@tythus.com>
- Rename "cluster" label to "valkey_cluster" to avoid conflicts in multi-cluster Prometheus setups - Use ClusterState constants from API types instead of hardcoded strings, and include missing Initializing state Signed-off-by: Daan Vinken <daanvinken@tythus.com>
4024890 to
9f41bf0
Compare
|
| Filename | Overview |
|---|---|
| internal/controller/metrics.go | New file introducing five Prometheus metrics (3 gauges, 2 counters) with update/delete helpers; the namespace label key is target_namespace which contradicts the verified scrape output in the PR description. |
| api/v1alpha1/valkeycluster_types.go | Adds exported ClusterStates slice of all possible cluster states; safe addition but the exported mutable var could be inadvertently modified by consumers of the API package. |
| internal/controller/valkeycluster_controller.go | Wires updateClusterMetrics into updateStatus and deleteClusterMetrics into the IsNotFound path; metrics are updated slightly before the Kubernetes status patch is confirmed. |
| internal/controller/failover.go | Increments failoversTotal counter correctly after a successful proactive failover is confirmed; straightforward and correct. |
Sequence Diagram
sequenceDiagram
participant K8s as Kubernetes API
participant R as Reconciler
participant M as metrics.go
participant P as Prometheus Registry
K8s->>R: Reconcile(req)
alt Cluster not found (deleted)
R->>M: deleteClusterMetrics(name, ns)
M->>P: clusterStateInfo.DeleteLabelValues(...)
M->>P: clusterShards.DeleteLabelValues(...)
M->>P: clusterShardsReady.DeleteLabelValues(...)
M->>P: failoversTotal.DeletePartialMatch(...)
M->>P: slotMigrationBatchesTotal.DeleteLabelValues(...)
else Cluster exists
R->>K8s: Get cluster
R->>R: reconcile steps...
R->>R: updateStatus(cluster)
R->>M: updateClusterMetrics(current)
M->>P: clusterStateInfo.Set(1 or 0 per state)
M->>P: clusterShards.Set(shards)
M->>P: clusterShardsReady.Set(readyShards)
R->>K8s: Status().Patch(current)
end
note over R,M: failoversTotal.Inc() called in failover.go on successful proactive failover
note over R,M: slotMigrationBatchesTotal.Inc() called in rebalanceSlots / drainExcessShards
Reviews (5): Last reviewed commit: "fix: rename namespace label to target_na..." | Re-trigger Greptile
- Add deleteClusterMetrics to remove stale gauge label sets when a ValkeyCluster is deleted, preventing phantom entries in dashboards - Fix clusterInfo help text to accurately describe the 0/1 state pattern Signed-off-by: Daan Vinken <daanvinken@tythus.com>
- Rename cluster_info to cluster_state_info for clarity - Rename slot_migrations_total to slot_migration_batches_total to accurately reflect that it counts batches, not individual slots - Count scale-in drain batches in addition to scale-out rebalance batches Signed-off-by: Daan Vinken <daanvinken@tythus.com>
|
Thanks @bjosv ! All should be addressed |
| "github.com/prometheus/client_golang/prometheus" | ||
| "sigs.k8s.io/controller-runtime/pkg/metrics" | ||
|
|
||
| valkeyiov1alpha1 "valkey.io/valkey-operator/api/v1alpha1" | ||
| ) | ||
|
|
||
| var ( | ||
| clusterInfo = prometheus.NewGaugeVec( | ||
| prometheus.GaugeOpts{ | ||
| Name: "valkey_operator_cluster_state_info", | ||
| Help: "Information about a ValkeyCluster. Value is 1 for the current state, 0 for all others.", | ||
| }, | ||
| []string{"valkey_cluster", "namespace", "state"}, | ||
| ) | ||
|
|
||
| clusterShards = prometheus.NewGaugeVec( |
There was a problem hiding this comment.
You can eliminate the init() function with promauto.
| "github.com/prometheus/client_golang/prometheus" | |
| "sigs.k8s.io/controller-runtime/pkg/metrics" | |
| valkeyiov1alpha1 "valkey.io/valkey-operator/api/v1alpha1" | |
| ) | |
| var ( | |
| clusterInfo = prometheus.NewGaugeVec( | |
| prometheus.GaugeOpts{ | |
| Name: "valkey_operator_cluster_state_info", | |
| Help: "Information about a ValkeyCluster. Value is 1 for the current state, 0 for all others.", | |
| }, | |
| []string{"valkey_cluster", "namespace", "state"}, | |
| ) | |
| clusterShards = prometheus.NewGaugeVec( | |
| "github.com/prometheus/client_golang/prometheus" | |
| "github.com/prometheus/client_golang/prometheus/promauto" | |
| "sigs.k8s.io/controller-runtime/pkg/metrics" | |
| valkeyiov1alpha1 "valkey.io/valkey-operator/api/v1alpha1" | |
| ) | |
| var ( | |
| clusterInfo = promauto.NewGaugeVec( | |
| prometheus.GaugeOpts{ | |
| Name: "valkey_operator_cluster_state_info", | |
| Help: "Information about a ValkeyCluster. Value is 1 for the current state, 0 for all others.", | |
| }, | |
| []string{"valkey_cluster", "namespace", "state"}, | |
| ) | |
| clusterShards = promauto.NewGaugeVec( |
| } | ||
|
|
||
| // clusterStates lists all possible ValkeyCluster states for metric cleanup. | ||
| var clusterStates = []valkeyiov1alpha1.ClusterState{ |
There was a problem hiding this comment.
I wonder if we should define this list in valkey.io/valkey-operator/api/v1alpha1 so we don't have to keep it in sync here.
SuperQ
left a comment
There was a problem hiding this comment.
One thing I recommend is to init the label values early in Reconcile() call
Something like this:
func (r *ValkeyClusterReconciler) setupClusterMetrics(cluster *valkeyiov1alpha1.ValkeyCluster) {
failoversTotal.WithLabelValues(cluster.Name, cluster.Namespace)
slotMigrationBatchesTotal.WithLabelValues(cluster.Name, cluster.Namespace)
}It's OK to call this function multiple times as it's thread safe and pretty cheap.
I don't actually remember if the controller reconcile object sticks around after each loop, otherwise we could cache the WithLabelValues() on ValkeyClusterReconciler struct with MustCurryWith.
|
@bjosv If you want to merge this as-is, I can followup with my suggested improvements. |
- Use promauto.With(metrics.Registry) instead of manual init() - Move ClusterStates list to api/v1alpha1 to keep it in sync with type definitions - Delete counters on cluster cleanup to prevent cardinality leaks Signed-off-by: Daan Vinken <daanvinken@tythus.com>
|
Thanks folks, should be addressed. |
| Name: "valkey_operator_cluster_state_info", | ||
| Help: "Information about a ValkeyCluster. Value is 1 for the current state, 0 for all others.", | ||
| }, | ||
| []string{"valkey_cluster", "namespace", "state"}, |
There was a problem hiding this comment.
Similar to what @Preisschild said about cluster, namespace is used for the namespace the controller is running in.
I would probably call this target_namespace or valkey_cluster_namespace.
Avoids conflict with Prometheus service discovery's built-in namespace label. Signed-off-by: Daan Vinken <daanvinken@tythus.com>
Description
Adds custom Prometheus metrics for ValkeyCluster observability. The default controller-runtime metrics (reconcile counts, workqueue depth) don't expose operator-specific information.
New metrics:
valkey_operator_cluster_state_info- gauge with state label (Initializing, Ready, Reconciling, Degraded, Failed), value 1 for current state, 0 for all othersvalkey_operator_cluster_shards- total shard count per clustervalkey_operator_cluster_shards_ready- ready shard count per clustervalkey_operator_failovers_total- counter for proactive failover eventsvalkey_operator_slot_migration_batches_total- counter for completed slot migration batches (scale-out and scale-in)Gauge metrics are cleaned up when a ValkeyCluster is deleted. All metrics (gauges and counters) are cleaned up when a ValkeyCluster is deleted to prevent cardinality leaks.
Uses
valkey_clusterlabel instead ofclusterto avoid conflicts in multi-cluster Prometheus setups (context). UsesClusterStateconstants from API types to avoid string mismatches.Testing
Unit tests pass. Deployed to a local kind cluster with a 3-shard 1-replica cluster and scraped the metrics endpoint: