feat(PVO11Y-4761): Add Kanary signal exporter

FaisalAl-Rayes · FaisalAl-Rayes · commit 50efa287172c · 2025-05-30T15:58:44.000+02:00
This commit introduces a new exporter to fetch Kanary signal values
from a database and expose them as Prometheus metrics.

Includes:
- Core exporter logic for PostgreSQL.
- `kanary_signal` metric with `cluster` and `status` labels.
- `kanary_interruption` metric with `cluster` and `reason` labels
- Use PedanticRegistry to control exposed Prometheus metrics. This ensures only explicitly registered application metrics are exported.
- README.md

Signed-off-by: Faisal Al-Rayes &lt;falrayes@redhat.com&gt;
diff --git a/exporters/kanaryexporter/README.md b/exporters/kanaryexporter/README.md
@@ -0,0 +1,67 @@
+## Kanary Exporter
+
+This `kanaryexporter` (implemented in `kanaryexporter.go`) is a Prometheus exporter designed to monitor Kanary signal values derived from a PostgreSQL database. It exposes two primary metrics: `kanary_up` and `kanary_error`.
+
+---
+
+### `kanary_up` Metric
+
+The `kanary_up` metric indicates the health status of a target cluster based on recent KPI error readings. It is a `prometheus.GaugeVec` with the label `instance` (the name of the monitored cluster).
+
+* **Metric Name:** `kanary_up`
+* **Value `1` (UP):**
+    * If data processing is successful and at least one of the last `requiredRecentReadings` KPI error counts is `0` or less (should not be possible but here just in case).
+    * **Crucially, also set to `1` if the `kanary_error` metric for the instance is `1` (see below).**
+* **Value `0` (DOWN):**
+    * If data processing is successful and **all** of the last `requiredRecentReadings` KPI error counts are strictly greater than `0`.
+
+**How `kanary_up` Works:**
+
+* The exporter requires the PostgreSQL database connection URL via the `CONNECTION_STRING` environment variable.
+* Upon starting, it connects to the specified database.
+* Periodically (every 5 minutes by default, configurable by `scrapeInterval`), it iterates through a predefined list of `targetClusters`.
+* For each cluster, it performs the following checks and actions:
+    1.  **Data Freshness Check:** Verifies that the latest data point in the database for the cluster is not older than `delayInSeconds` (default is 10800 seconds / 3 hours). If it's too old, this is treated as an error (`no_test_results`), `kanary_error` is set to `1`, and `kanary_up` is set to `1`.
+    2.  **Sufficient Data Check:** Ensures there's a minimum amount of data available for the cluster. If not (e.g., `countQuery` returns 0), this is treated as an error (`db_error`), `kanary_error` is set to `1`, and `kanary_up` is set to `1`.
+    3.  **KPI Error Retrieval:** If the above checks pass, it queries for the last `requiredRecentReadings` (default is 3) of `__results_measurements_KPI_errors`. The query filters for entries where `label_values->>'.metadata.env.MEMBER_CLUSTER'` matches the target cluster and `(label_values->>'.repo_type' = 'nodejs-devfile-sample' OR NOT (label_values ? '.repo_type'))`. Results are ordered by `start` time descending.
+    4.  **Metric Value Logic (if no errors in steps 1-3):**
+        * If any issue occurs during data fetching, parsing, or if an insufficient number of readings (less than `requiredRecentReadings`) are returned, it's considered an error (`db_error`), `kanary_error` is set to `1`, and `kanary_up` is set to `1`.
+        * If data is retrieved successfully:
+            * If **all** retrieved KPI error counts are `> 0`, `kanary_up` is set to `0` (DOWN).
+            * If **at least one** KPI error count is `<= 0`, `kanary_up` is set to `1` (UP).
+
+---
+
+### `kanary_error` Metric
+
+The `kanary_error` metric is a `prometheus.GaugeVec` with labels `instance` and `reason`. It acts as a binary indicator (`0` or `1`) to signal issues encountered during the data fetching or processing pipeline for the `kanary_up` signal.
+
+* **Metric Name:** `kanary_error`
+  * **Error reasons:**
+      * `"no_test_results"`: The latest data point for the cluster is older than `delayInSeconds` (e.g., no test results within the last 3 hours).
+      * `"db_error"`: Indicates a problem related to database interaction or data validation. This includes:
+          * Database query execution failures.
+          * Errors scanning rows from the result set.
+          * Errors during row iteration.
+          * An insufficient number of data points returned by the query (e.g., fewer than `requiredRecentReadings`, or `datapointsCountQuery` initially found no data).
+          * A retrieved `kpi_error_value` is NULL or an empty string.
+          * Failure to parse a `kpi_error_value` as an integer.
+
+**Impact of `kanary_error`:**
+
+* When `kanary_error` is `1` for a given `instance`, it signals a problem with the data pipeline for that instance.
+* In this state, the `kanary_up` metric for that `instance` is actively set to `1`. This behavior ensures that system alerts or dashboards dependent on `kanary_up` will still indicate an 'up' state when the underlying data source for determining the true Kanary signal is unreliable, rather than potentially showing 'down' due to missing data.
+
+---
+
+**General Exporter Behavior:**
+
+* The exporter uses globally defined `prometheus.GaugeVec` instances (`kanaryUpMetric`, `kanaryErrorMetric`).
+* These metrics are registered with a `prometheus.NewPedanticRegistry()`.
+* Metrics are exposed via `promhttp.HandlerFor()` on the `/metrics` endpoint (port 8000 by default).
+* The main loop fetches and exports metrics upon startup and then periodically (default every 300 seconds).
+
+The o11y team provides this Kanary exporter and its configuration as a reference:
+
+* [Kanary Exporter code](https://github.com/redhat-appstudio/o11y/tree/main/exporters/kanaryexporter.go)
+* [Kanary Exporter and Service Monitor Kubernetes Resources](https://github.com/redhat-appstudio/o11y/tree/main/config/exporters/monitoring/kanary/base)
diff --git a/exporters/kanaryexporter/kanaryexporter.go b/exporters/kanaryexporter/kanaryexporter.go
@@ -0,0 +1,276 @@
+package main
+
+import (
+	"database/sql"
+	"fmt"
+	"log"
+	"net/http"
+	"os"
+	"strconv"
+	"time"
+
+	_ "github.com/lib/pq" // PostgreSQL driver
+	"github.com/prometheus/client_golang/prometheus"
+	"github.com/prometheus/client_golang/prometheus/promhttp"
+)
+
+const (
+	tableName = "data"
+	// dbURLEnvVar is the environment variable name for the database connection string.
+	dbURLEnvVar = "CONNECTION_STRING"
+	// requiredRecentReadings is the number of recent KPI error readings to consider for the kanary_up.
+	requiredRecentReadings = 3
+	// Amount of time in seconds allowed without a new entry in the database
+	delayInSeconds = 10800
+)
+
+var (
+	kanaryUpMetric = prometheus.NewGaugeVec(
+		prometheus.GaugeOpts{
+			Name: "kanary_up",
+			// Help string updated to reflect the new logic: UP if at least one error reading <= 0, DOWN if all > 0.
+			Help: fmt.Sprintf("Kanary signal: 1 if at least one of last %d error readings is 0 or less, 0 if all last %d error readings are greater than 0. Only updated when %d valid data points are available.", requiredRecentReadings, requiredRecentReadings, requiredRecentReadings),
+		},
+		[]string{"instance"},
+	)
+
+	kanaryErrorMetric = prometheus.NewGaugeVec(
+		prometheus.GaugeOpts{
+			Name: "kanary_error",
+			Help: fmt.Sprintf("Binary indicator of an error in processing for Kanary signal (1 if interrupted, 0 otherwise). An error prevents kanary_up from being updated. The 'reason' label provides details on the error type."),
+		},
+		[]string{"instance", "reason"},
+	)
+
+	// targetClusters is the list of cluster API URLs to monitor.
+	targetClusters = []string{
+		// Private Clusters
+		"stone-stage-p01",
+		"stone-prod-p01",
+		"stone-prod-p02",
+		// Public Clusters
+		"stone-stg-rh01",
+		"stone-prd-rh01",
+	}
+
+    // --- Db Queries
+
+    // Entries in Db check
+	datapointsCountQuery = fmt.Sprintf(`
+		SELECT COUNT(*)
+		FROM %s
+		WHERE
+			label_values->>'.metadata.env.MEMBER_CLUSTER' LIKE $1
+			AND ( label_values->>'.repo_type' = 'nodejs-devfile-sample' OR NOT (label_values ? '.repo_type') );
+	`, tableName)
+
+    // Delay check
+    delayCheckQuery = fmt.Sprintf(`
+		WITH LatestRowByStart AS (
+			SELECT
+				-- The replace is to comply with the Standard ISO 8601 format
+				EXTRACT(epoch FROM (REPLACE(label_values->>'.ended', ',', '.'))::timestamptz) AS ended_epoch,
+				(EXTRACT(epoch FROM CURRENT_TIMESTAMP) - %d) AS earliest_allowed_ended_epoch
+			FROM
+				%s
+			WHERE
+				label_values->>'.metadata.env.MEMBER_CLUSTER' LIKE $1
+				AND (label_values->>'.repo_type' = 'nodejs-devfile-sample' OR NOT (label_values ? '.repo_type'))
+			ORDER BY
+				EXTRACT(epoch FROM start) DESC
+			LIMIT 1
+		)
+		SELECT COUNT(*)
+		FROM LatestRowByStart
+		WHERE ended_epoch >= earliest_allowed_ended_epoch
+	`, delayInSeconds, tableName)
+
+    // Query fetches KPI error values, filtering by member cluster using LIKE and optionally by repo_type.
+	dataQuery = fmt.Sprintf(`
+		SELECT
+			label_values->>'__results_measurements_KPI_errors' AS kpi_error_value
+		FROM
+			%s
+		WHERE
+			label_values->>'.metadata.env.MEMBER_CLUSTER' LIKE $1
+			AND ( label_values->>'.repo_type' = 'nodejs-devfile-sample' OR NOT (label_values ? '.repo_type') )
+		ORDER BY
+			EXTRACT(epoch FROM start) DESC
+		LIMIT %d;
+	`, tableName, requiredRecentReadings)
+)
+
+// getKpiErrorReadings fetches and validates the last 'requiredRecentReadings' KPI error counts for a given cluster.
+// It returns a slice of int64 values if successful, an internal status string for the error reason, and an error if any issue occurs.
+func getKpiErrorReadings(db *sql.DB, clusterName string) ([]int64, string, error) {
+	clusterSubStringPattern := "%" + clusterName + "%"
+
+    // Sufficient datapoints for cluster in db check
+    var datapointsCount int
+	err := db.QueryRow(datapointsCountQuery, clusterSubStringPattern).Scan(&datapointsCount)
+	if err != nil {
+		if err == sql.ErrNoRows {
+			datapointsCount = 0
+		} else {
+			return nil, "db_error", fmt.Errorf("database count query failed for cluster %s: %w", clusterName, err)
+		}
+	}
+
+	if datapointsCount == 0 {
+		return nil, "db_error", fmt.Errorf("database count query failed for cluster %s: %w", clusterName, err)
+	}
+
+
+    // Latest datapoint entry not delayed check
+	var delayConditionMetCount int
+	err = db.QueryRow(delayCheckQuery, clusterSubStringPattern).Scan(&delayConditionMetCount)
+	if err != nil {
+		return nil, "db_error", fmt.Errorf("delay condition check query failed for cluster %s: %w", clusterName, err)
+	}
+
+	if delayConditionMetCount == 0 {
+		return nil, "no_test_results", fmt.Errorf("last datapoint for cluster %s was older than %d hours", clusterName, (delayInSeconds/60)/60)
+	}
+
+    // KPI Errors check for the health of the given cluster
+	rows, err := db.Query(dataQuery, clusterSubStringPattern)
+	if err != nil {
+		return nil, "db_error", fmt.Errorf("database query failed for cluster %s: %w", clusterName, err)
+	}
+	defer rows.Close()
+
+	var parsedErrorReadings []int64
+	var rawValuesForLog []string
+
+	for rows.Next() {
+		var kpiErrorValueStr sql.NullString
+		if err := rows.Scan(&kpiErrorValueStr); err != nil {
+			// Error during row scan is considered a database error.
+			return nil, "db_error", fmt.Errorf("failed to scan row for cluster %s: %w", clusterName, err)
+		}
+
+		if !kpiErrorValueStr.Valid || kpiErrorValueStr.String == "" {
+			// NULL or empty values are data processing issues.
+			return nil, "db_error", fmt.Errorf("found NULL or empty kpi_error_value in one of the last %d rows for cluster %s. Raw values so far: %v", requiredRecentReadings, clusterName, rawValuesForLog)
+		}
+		rawValuesForLog = append(rawValuesForLog, kpiErrorValueStr.String)
+
+		kpiErrorCount, err := strconv.ParseInt(kpiErrorValueStr.String, 10, 64)
+		if err != nil {
+			// Failure to parse the error count is a data processing issue.
+			return nil, "db_error", fmt.Errorf("failed to parse kpi_error_value '%s' as integer for cluster %s: %w", kpiErrorValueStr.String, clusterName, err)
+		}
+		parsedErrorReadings = append(parsedErrorReadings, kpiErrorCount)
+	}
+
+	if err := rows.Err(); err != nil {
+		// Errors encountered during iteration (e.g., network issues during streaming) are database errors.
+		return nil, "db_error", fmt.Errorf("error during row iteration for cluster %s: %w", clusterName, err)
+	}
+
+	if len(parsedErrorReadings) < requiredRecentReadings {
+		// Not enough data points is considered a database/query issue.
+		return nil, "db_error", fmt.Errorf("expected %d data points for cluster %s, but query returned %d. Raw values: %v", requiredRecentReadings, clusterName, len(parsedErrorReadings), rawValuesForLog)
+	}
+
+	return parsedErrorReadings, "data_ok", nil
+}
+
+// fetchAndExportMetrics orchestrates fetching data and updating Prometheus metrics for all target clusters.
+func fetchAndExportMetrics(db *sql.DB) {
+	for _, clusterName := range targetClusters {
+		reasonForError := ""
+		kpiErrorReadings, internalStatusMsg, err := getKpiErrorReadings(db, clusterName)
+
+		if err != nil {
+			// An error occurred (DB error, parse error, insufficient data, etc.).
+			reasonForError = internalStatusMsg
+			log.Printf("Error for cluster '%s': %s. Error details: %v", clusterName, reasonForError, err)
+			log.Printf("kanary_up metric will be set to 1 for cluster %s due to kanary_error != 0: %s. Error details: %v", clusterName, reasonForError, err)
+			kanaryErrorMetric.WithLabelValues(clusterName, reasonForError).Set(1)
+			// Keep kanary up, incase of kanary_error
+            kanaryUpMetric.WithLabelValues(clusterName).Set(1)
+		} else {
+			// Successfully retrieved and parsed data; no error.
+			kanaryErrorMetric.WithLabelValues(clusterName, "db_error").Set(0)
+			kanaryErrorMetric.WithLabelValues(clusterName, "no_test_results").Set(0)
+
+			// Determine signal status: UP if at least one error reading is <= 0, DOWN if all are > 0.
+			down := true
+			if len(kpiErrorReadings) == 0 {
+				// This case should ideally be caught by getKpiErrorReadings returning an error.
+				// If it somehow occurs, treat as not all readings being strictly positive.
+				down = false
+			} else {
+				for _, errorCount := range kpiErrorReadings {
+					if errorCount <= 0 {
+						down = false
+						break
+					}
+				}
+			}
+
+			if down {
+				// All recent error readings are > 0, so the signal is DOWN.
+				kanaryUpMetric.WithLabelValues(clusterName).Set(0)
+				log.Printf("KO: Kanary signal for cluster '%s' is DOWN (all last %d error readings > 0): %v. Error status: %s", clusterName, requiredRecentReadings, kpiErrorReadings, reasonForError)
+			} else {
+				// At least one recent error reading is <= 0, so the signal is UP.
+				kanaryUpMetric.WithLabelValues(clusterName).Set(1)
+				log.Printf("OK: Kanary signal for cluster '%s' is UP: %v.", clusterName, kpiErrorReadings)
+			}
+		}
+	}
+}
+
+func main() {
+	databaseURL := os.Getenv(dbURLEnvVar)
+	if databaseURL == "" {
+		log.Fatalf("FATAL: Environment variable %s is not set or is empty. Example: export %s=\"postgres://user:pass@host:port/db?sslmode=disable\"", dbURLEnvVar, dbURLEnvVar)
+	}
+
+	db, err := sql.Open("postgres", databaseURL)
+	if err != nil {
+		log.Fatalf("FATAL: Error connecting to the database using DSN from %s: %v", dbURLEnvVar, err)
+	}
+	defer db.Close()
+
+	if err = db.Ping(); err != nil {
+		log.Fatalf("FATAL: Error pinging database: %v", err)
+	}
+	log.Println("Successfully connected to the database.")
+
+	// Create a new PedanticRegistry.
+	reg := prometheus.NewPedanticRegistry()
+
+	// Register metrics with the new PedanticRegistry.
+	reg.MustRegister(kanaryUpMetric)
+	reg.MustRegister(kanaryErrorMetric)
+
+	// Expose the registered metrics via HTTP.
+	// Use promhttp.HandlerFor to specify the PedanticRegistry.
+	http.Handle("/metrics", promhttp.HandlerFor(reg, promhttp.HandlerOpts{}))
+
+	go func() {
+		log.Println("Prometheus exporter starting on :8000/metrics ...")
+		if err := http.ListenAndServe(":8000", nil); err != nil {
+			log.Fatalf("FATAL: Error starting Prometheus HTTP server: %v", err)
+		}
+	}()
+
+	log.Println("Performing initial metrics fetch...")
+	fetchAndExportMetrics(db)
+	log.Println("Initial metrics fetch complete.")
+
+	// Periodically fetch metrics. The interval could be made configurable.
+	scrapeInterval := 300 * time.Second
+	log.Printf("Starting periodic metrics fetch every %v.", scrapeInterval)
+	ticker := time.NewTicker(scrapeInterval)
+	defer ticker.Stop()
+
+	for range ticker.C {
+		log.Println("Fetching and exporting metrics...")
+		fetchAndExportMetrics(db)
+		log.Println("Metrics fetch complete.")
+	}
+}
diff --git a/go.mod b/go.mod
@@ -4,7 +4,10 @@ go 1.23.0
 
 toolchain go1.24.3
 
-require github.com/prometheus/client_golang v1.21.1
+require (
+	github.com/lib/pq v1.10.9
+	github.com/prometheus/client_golang v1.21.1
+)
 
 require (
 	github.com/emicklei/go-restful/v3 v3.12.2 // indirect
diff --git a/go.sum b/go.sum
@@ -52,6 +52,8 @@ github.com/kr/text v0.2.0 h1:5Nx0Ya0ZqY2ygV366QzturHI13Jq95ApcVaJBhpS+AY=
 github.com/kr/text v0.2.0/go.mod h1:eLer722TekiGuMkidMxC/pM04lWEeraHUUmBw8l2grE=
 github.com/kylelemons/godebug v1.1.0 h1:RPNrshWIDI6G2gRW9EHilWtl7Z6Sb1BR0xunSBf0SNc=
 github.com/kylelemons/godebug v1.1.0/go.mod h1:9/0rRGxNHcop5bhtWyNeEfOS8JIWk580+fNqagV/RAw=
+github.com/lib/pq v1.10.9 h1:YXG7RB+JIjhP29X+OtkiDnYaXQwpS4JEWq7dtCCRUEw=
+github.com/lib/pq v1.10.9/go.mod h1:AlVN5x4E4T544tWzH6hKfbfQvm3HdbOxrmggDNAPY9o=
 github.com/mailru/easyjson v0.9.0 h1:PrnmzHw7262yW8sTBwxi1PdJA3Iw/EKBa8psRf7d9a4=
 github.com/mailru/easyjson v0.9.0/go.mod h1:1+xMtQp2MRNVL/V1bOzuP3aP8VNwRW55fQUto+XFtTU=
 github.com/modern-go/concurrent v0.0.0-20180228061459-e0a39a4cb421/go.mod h1:6dJC0mAP4ikYIbvyc7fijjWJddQyLn8Ig3JB5CqoB9Q=