From 40aab90470efe02dfcec7d869bf5e633c861e496 Mon Sep 17 00:00:00 2001 From: pokom Date: Fri, 3 Nov 2023 09:39:42 -0400 Subject: [PATCH 1/5] feat(stackdriver_exporter): Add ErrorLogger for promhttp I had recently experienced #103 and #166 in production and it took quite some time to recognize there was a problem with `stackdriver_exporter` because nothing was logged out to indiciate problems gathering metrics. From my perspective, the pod was healthy and online and I could curl `/metrics` to get results. Grafana Agent however was getting errors when scraping, specifically errors like so: ``` [from Gatherer #2] collected metric "stackdriver_gce_instance_compute_googleapis_com_instance_disk_write_bytes_count" { label:{name:"device_name" value:"REDACTED_FOR_SECURITY"} label:{name:"device_type" value:"permanent"} label:{name:"instance_id" value:"2924941021702260446"} label:{name:"instance_name" value:"REDACTED_FOR_SECURITY"} label:{name:"project_id" value:"REDACTED_FOR_SECURITY"} label:{name:"storage_type" value:"pd-ssd"} label:{name:"unit" value:"By"} label:{name:"zone" value:"us-central1-a"} counter:{value:0} timestamp_ms:1698871080000} was collected before with the same name and label values ``` To help identify the root cause I've added the ability to opt into logging out errors that come from the handler. Specifically, I've created the struct `customPromErrorLogger` that implements the `promhttp.http.Logger` interface. There is a new flag: `monitoring.enable-promhttp-custom-logger` which if it is set to true, then we create an instance of `customPromErrorLogger` and use it as the value for ErrorLogger in `promhttp.Handler{}`. Otherwise, `stackdriver_exporter` works as it did before and does not log out errors collectoing metrics. - refs #103, #166 Signed-off-by: pokom --- README.md | 47 +++++++++++++++++++++-------------------- stackdriver_exporter.go | 27 ++++++++++++++++++++++- 2 files changed, 50 insertions(+), 24 deletions(-) diff --git a/README.md b/README.md index 2103dd7..080b780 100644 --- a/README.md +++ b/README.md @@ -76,29 +76,30 @@ If you are still using the legacy [Access scopes][access-scopes], the `https://w ### Flags -| Flag | Required | Default | Description | -| ----------------------------------- | -------- |---------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| `google.project-id` | No | GCloud SDK auto-discovery | Comma seperated list of Google Project IDs | -| `google.projects.filter` | No | | GCloud projects filter expression. See more [here](https://cloud.google.com/sdk/gcloud/reference/projects/list). | -| `monitoring.metrics-ingest-delay` | No | | Offsets metric collection by a delay appropriate for each metric type, e.g. because bigquery metrics are slow to appear | -| `monitoring.drop-delegated-projects` | No | No | Drop metrics from attached projects and fetch `project_id` only. | -| `monitoring.metrics-type-prefixes` | Yes | | Comma separated Google Stackdriver Monitoring Metric Type prefixes (see [example][metrics-prefix-example] and [available metrics][metrics-list]) | -| `monitoring.metrics-interval` | No | `5m` | Metric's timestamp interval to request from the Google Stackdriver Monitoring Metrics API. Only the most recent data point is used | -| `monitoring.metrics-offset` | No | `0s` | Offset (into the past) for the metric's timestamp interval to request from the Google Stackdriver Monitoring Metrics API, to handle latency in published metrics | -| `monitoring.filters` | No | | Formatted string to allow filtering on certain metrics type | -| `monitoring.aggregate-deltas` | No | | If enabled will treat all DELTA metrics as an in-memory counter instead of a gauge. Be sure to read [what to know about aggregating DELTA metrics](#what-to-know-about-aggregating-delta-metrics) | -| `monitoring.aggregate-deltas-ttl` | No | `30m` | How long should a delta metric continue to be exported and stored after GCP stops producing it. Read [slow moving metrics](#slow-moving-metrics) to understand the problem this attempts to solve | -| `monitoring.descriptor-cache-ttl` | No | `0s` | How long should the metric descriptors for a prefixed be cached for | -| `stackdriver.max-retries` | No | `0` | Max number of retries that should be attempted on 503 errors from stackdriver. | -| `stackdriver.http-timeout` | No | `10s` | How long should stackdriver_exporter wait for a result from the Stackdriver API. | -| `stackdriver.max-backoff=` | No | | Max time between each request in an exp backoff scenario. | -| `stackdriver.backoff-jitter` | No | `1s` | The amount of jitter to introduce in a exp backoff scenario. | -| `stackdriver.retry-statuses` | No | `503` | The HTTP statuses that should trigger a retry. | -| `web.config.file` | No | | [EXPERIMENTAL] Path to configuration file that can enable TLS or authentication. | -| `web.listen-address` | No | `:9255` | Address to listen on for web interface and telemetry Repeatable for multiple addresses. | -| `web.systemd-socket` | No | | Use systemd socket activation listeners instead of port listeners (Linux only). | -| `web.stackdriver-telemetry-path` | No | `/metrics` | Path under which to expose Stackdriver metrics. | -| `web.telemetry-path` | No | `/metrics` | Path under which to expose Prometheus metrics | +| Flag | Required | Default | Description | +|-------------------------------| -------- |---------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `google.project-id` | No | GCloud SDK auto-discovery | Comma seperated list of Google Project IDs | +| `google.projects.filter` | No | | GCloud projects filter expression. See more [here](https://cloud.google.com/sdk/gcloud/reference/projects/list). | +| `monitoring.metrics-ingest-delay` | No | | Offsets metric collection by a delay appropriate for each metric type, e.g. because bigquery metrics are slow to appear | +| `monitoring.drop-delegated-projects` | No | No | Drop metrics from attached projects and fetch `project_id` only. | +| `monitoring.metrics-type-prefixes` | Yes | | Comma separated Google Stackdriver Monitoring Metric Type prefixes (see [example][metrics-prefix-example] and [available metrics][metrics-list]) | +| `monitoring.metrics-interval` | No | `5m` | Metric's timestamp interval to request from the Google Stackdriver Monitoring Metrics API. Only the most recent data point is used | +| `monitoring.metrics-offset` | No | `0s` | Offset (into the past) for the metric's timestamp interval to request from the Google Stackdriver Monitoring Metrics API, to handle latency in published metrics | +| `monitoring.filters` | No | | Formatted string to allow filtering on certain metrics type | +| `monitoring.aggregate-deltas` | No | | If enabled will treat all DELTA metrics as an in-memory counter instead of a gauge. Be sure to read [what to know about aggregating DELTA metrics](#what-to-know-about-aggregating-delta-metrics) | +| `monitoring.aggregate-deltas-ttl` | No | `30m` | How long should a delta metric continue to be exported and stored after GCP stops producing it. Read [slow moving metrics](#slow-moving-metrics) to understand the problem this attempts to solve | +| `monitoring.descriptor-cache-ttl` | No | `0s` | How long should the metric descriptors for a prefixed be cached for | +| `monitoring.enable-promhttp-custom-logger` | No | False | If enabled will create a custom error logging handler for promhttp | +| `stackdriver.max-retries` | No | `0` | Max number of retries that should be attempted on 503 errors from stackdriver. | +| `stackdriver.http-timeout` | No | `10s` | How long should stackdriver_exporter wait for a result from the Stackdriver API. | +| `stackdriver.max-backoff=` | No | | Max time between each request in an exp backoff scenario. | +| `stackdriver.backoff-jitter` | No | `1s` | The amount of jitter to introduce in a exp backoff scenario. | +| `stackdriver.retry-statuses` | No | `503` | The HTTP statuses that should trigger a retry. | +| `web.config.file` | No | | [EXPERIMENTAL] Path to configuration file that can enable TLS or authentication. | +| `web.listen-address` | No | `:9255` | Address to listen on for web interface and telemetry Repeatable for multiple addresses. | +| `web.systemd-socket` | No | | Use systemd socket activation listeners instead of port listeners (Linux only). | +| `web.stackdriver-telemetry-path` | No | `/metrics` | Path under which to expose Stackdriver metrics. | +| `web.telemetry-path` | No | `/metrics` | Path under which to expose Prometheus metrics | ### TLS and basic authentication diff --git a/stackdriver_exporter.go b/stackdriver_exporter.go index 65fe631..b7c3792 100644 --- a/stackdriver_exporter.go +++ b/stackdriver_exporter.go @@ -125,6 +125,10 @@ var ( monitoringDescriptorCacheOnlyGoogle = kingpin.Flag( "monitoring.descriptor-cache-only-google", "Only cache descriptors for *.googleapis.com metrics", ).Default("true").Bool() + + monitoringEnablePromHttpCustomLogger = kingpin.Flag( + "monitoring.enable-promhttp-custom-logger", "Enable custom logger for promhttp", + ).Default("false").Bool() ) func init() { @@ -236,7 +240,14 @@ func (h *handler) innerHandler(filters map[string]bool) http.Handler { } // Delegate http serving to Prometheus client library, which will call collector.Collect. - return promhttp.HandlerFor(gatherers, promhttp.HandlerOpts{}) + opts := promhttp.HandlerOpts{} + if *monitoringEnablePromHttpCustomLogger { + h.logger.Log("msg", "Enabling custom logger for promhttp") + opts = promhttp.HandlerOpts{ + ErrorLog: NewPromHttpCustomLogger(h.logger), + } + } + return promhttp.HandlerFor(gatherers, opts) } // filterMetricTypePrefixes filters the initial list of metric type prefixes, with the ones coming from an individual @@ -365,3 +376,17 @@ func parseMetricExtraFilters() []collectors.MetricFilter { } return extraFilters } + +type customPromErrorLogger struct { + logger log.Logger +} + +func (l *customPromErrorLogger) Println(v ...interface{}) { + level.Error(l.logger).Log("msg", fmt.Sprint(v...)) +} + +func NewPromHttpCustomLogger(logger log.Logger) *customPromErrorLogger { + return &customPromErrorLogger{ + logger: logger, + } +} From fe9a710736713b90f8e47b072ad05e5670cb2960 Mon Sep 17 00:00:00 2001 From: pokom Date: Mon, 18 Mar 2024 09:47:00 -0400 Subject: [PATCH 2/5] Use level.Info(...) instead of directly calling logger Signed-off-by: pokom --- stackdriver_exporter.go | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/stackdriver_exporter.go b/stackdriver_exporter.go index b7c3792..c0f22f6 100644 --- a/stackdriver_exporter.go +++ b/stackdriver_exporter.go @@ -242,7 +242,7 @@ func (h *handler) innerHandler(filters map[string]bool) http.Handler { // Delegate http serving to Prometheus client library, which will call collector.Collect. opts := promhttp.HandlerOpts{} if *monitoringEnablePromHttpCustomLogger { - h.logger.Log("msg", "Enabling custom logger for promhttp") + level.Info(h.logger).Log("msg", "Enabling custom logger for promhttp") opts = promhttp.HandlerOpts{ ErrorLog: NewPromHttpCustomLogger(h.logger), } From 2bea4c50b33c0fe371ea3e33c85d12c2e58475fb Mon Sep 17 00:00:00 2001 From: pokom Date: Mon, 3 Jun 2024 13:15:33 -0400 Subject: [PATCH 3/5] Simplify options to use stdlog interface Signed-off-by: pokom --- stackdriver_exporter.go | 28 ++-------------------------- 1 file changed, 2 insertions(+), 26 deletions(-) diff --git a/stackdriver_exporter.go b/stackdriver_exporter.go index c0f22f6..7f1aa09 100644 --- a/stackdriver_exporter.go +++ b/stackdriver_exporter.go @@ -15,6 +15,7 @@ package main import ( "fmt" + stdlog "log" "net/http" "os" "strings" @@ -125,10 +126,6 @@ var ( monitoringDescriptorCacheOnlyGoogle = kingpin.Flag( "monitoring.descriptor-cache-only-google", "Only cache descriptors for *.googleapis.com metrics", ).Default("true").Bool() - - monitoringEnablePromHttpCustomLogger = kingpin.Flag( - "monitoring.enable-promhttp-custom-logger", "Enable custom logger for promhttp", - ).Default("false").Bool() ) func init() { @@ -238,15 +235,8 @@ func (h *handler) innerHandler(filters map[string]bool) http.Handler { registry, } } - + opts := promhttp.HandlerOpts{ErrorLog: stdlog.New(log.NewStdlibAdapter(level.Error(h.logger)), "", 0)} // Delegate http serving to Prometheus client library, which will call collector.Collect. - opts := promhttp.HandlerOpts{} - if *monitoringEnablePromHttpCustomLogger { - level.Info(h.logger).Log("msg", "Enabling custom logger for promhttp") - opts = promhttp.HandlerOpts{ - ErrorLog: NewPromHttpCustomLogger(h.logger), - } - } return promhttp.HandlerFor(gatherers, opts) } @@ -376,17 +366,3 @@ func parseMetricExtraFilters() []collectors.MetricFilter { } return extraFilters } - -type customPromErrorLogger struct { - logger log.Logger -} - -func (l *customPromErrorLogger) Println(v ...interface{}) { - level.Error(l.logger).Log("msg", fmt.Sprint(v...)) -} - -func NewPromHttpCustomLogger(logger log.Logger) *customPromErrorLogger { - return &customPromErrorLogger{ - logger: logger, - } -} From 0e4f42d01ab976a5bebf6243b618babc685ae8f6 Mon Sep 17 00:00:00 2001 From: pokom Date: Mon, 3 Jun 2024 13:17:42 -0400 Subject: [PATCH 4/5] Remove unused flag from readme Signed-off-by: pokom --- README.md | 1 - 1 file changed, 1 deletion(-) diff --git a/README.md b/README.md index 080b780..ba9403b 100644 --- a/README.md +++ b/README.md @@ -89,7 +89,6 @@ If you are still using the legacy [Access scopes][access-scopes], the `https://w | `monitoring.aggregate-deltas` | No | | If enabled will treat all DELTA metrics as an in-memory counter instead of a gauge. Be sure to read [what to know about aggregating DELTA metrics](#what-to-know-about-aggregating-delta-metrics) | | `monitoring.aggregate-deltas-ttl` | No | `30m` | How long should a delta metric continue to be exported and stored after GCP stops producing it. Read [slow moving metrics](#slow-moving-metrics) to understand the problem this attempts to solve | | `monitoring.descriptor-cache-ttl` | No | `0s` | How long should the metric descriptors for a prefixed be cached for | -| `monitoring.enable-promhttp-custom-logger` | No | False | If enabled will create a custom error logging handler for promhttp | | `stackdriver.max-retries` | No | `0` | Max number of retries that should be attempted on 503 errors from stackdriver. | | `stackdriver.http-timeout` | No | `10s` | How long should stackdriver_exporter wait for a result from the Stackdriver API. | | `stackdriver.max-backoff=` | No | | Max time between each request in an exp backoff scenario. | From 9bb10a7861303adcb04b68f817e6af4e1323dcd5 Mon Sep 17 00:00:00 2001 From: pokom Date: Mon, 3 Jun 2024 13:23:52 -0400 Subject: [PATCH 5/5] Undo formatting changes to README table Signed-off-by: pokom --- README.md | 46 +++++++++++++++++++++++----------------------- 1 file changed, 23 insertions(+), 23 deletions(-) diff --git a/README.md b/README.md index ba9403b..2103dd7 100644 --- a/README.md +++ b/README.md @@ -76,29 +76,29 @@ If you are still using the legacy [Access scopes][access-scopes], the `https://w ### Flags -| Flag | Required | Default | Description | -|-------------------------------| -------- |---------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| `google.project-id` | No | GCloud SDK auto-discovery | Comma seperated list of Google Project IDs | -| `google.projects.filter` | No | | GCloud projects filter expression. See more [here](https://cloud.google.com/sdk/gcloud/reference/projects/list). | -| `monitoring.metrics-ingest-delay` | No | | Offsets metric collection by a delay appropriate for each metric type, e.g. because bigquery metrics are slow to appear | -| `monitoring.drop-delegated-projects` | No | No | Drop metrics from attached projects and fetch `project_id` only. | -| `monitoring.metrics-type-prefixes` | Yes | | Comma separated Google Stackdriver Monitoring Metric Type prefixes (see [example][metrics-prefix-example] and [available metrics][metrics-list]) | -| `monitoring.metrics-interval` | No | `5m` | Metric's timestamp interval to request from the Google Stackdriver Monitoring Metrics API. Only the most recent data point is used | -| `monitoring.metrics-offset` | No | `0s` | Offset (into the past) for the metric's timestamp interval to request from the Google Stackdriver Monitoring Metrics API, to handle latency in published metrics | -| `monitoring.filters` | No | | Formatted string to allow filtering on certain metrics type | -| `monitoring.aggregate-deltas` | No | | If enabled will treat all DELTA metrics as an in-memory counter instead of a gauge. Be sure to read [what to know about aggregating DELTA metrics](#what-to-know-about-aggregating-delta-metrics) | -| `monitoring.aggregate-deltas-ttl` | No | `30m` | How long should a delta metric continue to be exported and stored after GCP stops producing it. Read [slow moving metrics](#slow-moving-metrics) to understand the problem this attempts to solve | -| `monitoring.descriptor-cache-ttl` | No | `0s` | How long should the metric descriptors for a prefixed be cached for | -| `stackdriver.max-retries` | No | `0` | Max number of retries that should be attempted on 503 errors from stackdriver. | -| `stackdriver.http-timeout` | No | `10s` | How long should stackdriver_exporter wait for a result from the Stackdriver API. | -| `stackdriver.max-backoff=` | No | | Max time between each request in an exp backoff scenario. | -| `stackdriver.backoff-jitter` | No | `1s` | The amount of jitter to introduce in a exp backoff scenario. | -| `stackdriver.retry-statuses` | No | `503` | The HTTP statuses that should trigger a retry. | -| `web.config.file` | No | | [EXPERIMENTAL] Path to configuration file that can enable TLS or authentication. | -| `web.listen-address` | No | `:9255` | Address to listen on for web interface and telemetry Repeatable for multiple addresses. | -| `web.systemd-socket` | No | | Use systemd socket activation listeners instead of port listeners (Linux only). | -| `web.stackdriver-telemetry-path` | No | `/metrics` | Path under which to expose Stackdriver metrics. | -| `web.telemetry-path` | No | `/metrics` | Path under which to expose Prometheus metrics | +| Flag | Required | Default | Description | +| ----------------------------------- | -------- |---------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `google.project-id` | No | GCloud SDK auto-discovery | Comma seperated list of Google Project IDs | +| `google.projects.filter` | No | | GCloud projects filter expression. See more [here](https://cloud.google.com/sdk/gcloud/reference/projects/list). | +| `monitoring.metrics-ingest-delay` | No | | Offsets metric collection by a delay appropriate for each metric type, e.g. because bigquery metrics are slow to appear | +| `monitoring.drop-delegated-projects` | No | No | Drop metrics from attached projects and fetch `project_id` only. | +| `monitoring.metrics-type-prefixes` | Yes | | Comma separated Google Stackdriver Monitoring Metric Type prefixes (see [example][metrics-prefix-example] and [available metrics][metrics-list]) | +| `monitoring.metrics-interval` | No | `5m` | Metric's timestamp interval to request from the Google Stackdriver Monitoring Metrics API. Only the most recent data point is used | +| `monitoring.metrics-offset` | No | `0s` | Offset (into the past) for the metric's timestamp interval to request from the Google Stackdriver Monitoring Metrics API, to handle latency in published metrics | +| `monitoring.filters` | No | | Formatted string to allow filtering on certain metrics type | +| `monitoring.aggregate-deltas` | No | | If enabled will treat all DELTA metrics as an in-memory counter instead of a gauge. Be sure to read [what to know about aggregating DELTA metrics](#what-to-know-about-aggregating-delta-metrics) | +| `monitoring.aggregate-deltas-ttl` | No | `30m` | How long should a delta metric continue to be exported and stored after GCP stops producing it. Read [slow moving metrics](#slow-moving-metrics) to understand the problem this attempts to solve | +| `monitoring.descriptor-cache-ttl` | No | `0s` | How long should the metric descriptors for a prefixed be cached for | +| `stackdriver.max-retries` | No | `0` | Max number of retries that should be attempted on 503 errors from stackdriver. | +| `stackdriver.http-timeout` | No | `10s` | How long should stackdriver_exporter wait for a result from the Stackdriver API. | +| `stackdriver.max-backoff=` | No | | Max time between each request in an exp backoff scenario. | +| `stackdriver.backoff-jitter` | No | `1s` | The amount of jitter to introduce in a exp backoff scenario. | +| `stackdriver.retry-statuses` | No | `503` | The HTTP statuses that should trigger a retry. | +| `web.config.file` | No | | [EXPERIMENTAL] Path to configuration file that can enable TLS or authentication. | +| `web.listen-address` | No | `:9255` | Address to listen on for web interface and telemetry Repeatable for multiple addresses. | +| `web.systemd-socket` | No | | Use systemd socket activation listeners instead of port listeners (Linux only). | +| `web.stackdriver-telemetry-path` | No | `/metrics` | Path under which to expose Stackdriver metrics. | +| `web.telemetry-path` | No | `/metrics` | Path under which to expose Prometheus metrics | ### TLS and basic authentication