Skip to content

Commit 04af087

Browse files
committed
feat(stackdriver_exporter): Add ErrorLogger for promhttp
I had recently experienced #103 and #166 in production and it took quite some time to recognize there was a problem with `stackdriver_exporter` because nothing was logged out to indiciate problems gathering metrics. From my perspective, the pod was healthy and online and I could curl `/metrics` to get results. Grafana Agent however was getting errors when scraping, specifically errors like so: ``` [from Gatherer #2] collected metric "stackdriver_gce_instance_compute_googleapis_com_instance_disk_write_bytes_count" { label:{name:"device_name" value:"REDACTED_FOR_SECURITY"} label:{name:"device_type" value:"permanent"} label:{name:"instance_id" value:"2924941021702260446"} label:{name:"instance_name" value:"REDACTED_FOR_SECURITY"} label:{name:"project_id" value:"REDACTED_FOR_SECURITY"} label:{name:"storage_type" value:"pd-ssd"} label:{name:"unit" value:"By"} label:{name:"zone" value:"us-central1-a"} counter:{value:0} timestamp_ms:1698871080000} was collected before with the same name and label values ``` To help identify the root cause I've added the ability to opt into logging out errors that come from the handler. Specifically, I've created the struct `customPromErrorLogger` that implements the `promhttp.http.Logger` interface. There is a new flag: `monitoring.enable-promhttp-custom-logger` which if it is set to true, then we create an instance of `customPromErrorLogger` and use it as the value for ErrorLogger in `promhttp.Handler{}`. Otherwise, `stackdriver_exporter` works as it did before and does not log out errors collectoing metrics. - refs #103, #166
1 parent 8b01e7d commit 04af087

File tree

2 files changed

+50
-24
lines changed

2 files changed

+50
-24
lines changed

README.md

+24-23
Original file line numberDiff line numberDiff line change
@@ -76,29 +76,30 @@ If you are still using the legacy [Access scopes][access-scopes], the `https://w
7676

7777
### Flags
7878

79-
| Flag | Required | Default | Description |
80-
| ----------------------------------- | -------- |---------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
81-
| `google.project-id` | No | GCloud SDK auto-discovery | Comma seperated list of Google Project IDs |
82-
| `google.projects.filter` | No | | GCloud projects filter expression. See more [here](https://cloud.google.com/sdk/gcloud/reference/projects/list). |
83-
| `monitoring.metrics-ingest-delay` | No | | Offsets metric collection by a delay appropriate for each metric type, e.g. because bigquery metrics are slow to appear |
84-
| `monitoring.drop-delegated-projects` | No | No | Drop metrics from attached projects and fetch `project_id` only. |
85-
| `monitoring.metrics-type-prefixes` | Yes | | Comma separated Google Stackdriver Monitoring Metric Type prefixes (see [example][metrics-prefix-example] and [available metrics][metrics-list]) |
86-
| `monitoring.metrics-interval` | No | `5m` | Metric's timestamp interval to request from the Google Stackdriver Monitoring Metrics API. Only the most recent data point is used |
87-
| `monitoring.metrics-offset` | No | `0s` | Offset (into the past) for the metric's timestamp interval to request from the Google Stackdriver Monitoring Metrics API, to handle latency in published metrics |
88-
| `monitoring.filters` | No | | Formatted string to allow filtering on certain metrics type |
89-
| `monitoring.aggregate-deltas` | No | | If enabled will treat all DELTA metrics as an in-memory counter instead of a gauge. Be sure to read [what to know about aggregating DELTA metrics](#what-to-know-about-aggregating-delta-metrics) |
90-
| `monitoring.aggregate-deltas-ttl` | No | `30m` | How long should a delta metric continue to be exported and stored after GCP stops producing it. Read [slow moving metrics](#slow-moving-metrics) to understand the problem this attempts to solve |
91-
| `monitoring.descriptor-cache-ttl` | No | `0s` | How long should the metric descriptors for a prefixed be cached for |
92-
| `stackdriver.max-retries` | No | `0` | Max number of retries that should be attempted on 503 errors from stackdriver. |
93-
| `stackdriver.http-timeout` | No | `10s` | How long should stackdriver_exporter wait for a result from the Stackdriver API. |
94-
| `stackdriver.max-backoff=` | No | | Max time between each request in an exp backoff scenario. |
95-
| `stackdriver.backoff-jitter` | No | `1s` | The amount of jitter to introduce in a exp backoff scenario. |
96-
| `stackdriver.retry-statuses` | No | `503` | The HTTP statuses that should trigger a retry. |
97-
| `web.config.file` | No | | [EXPERIMENTAL] Path to configuration file that can enable TLS or authentication. |
98-
| `web.listen-address` | No | `:9255` | Address to listen on for web interface and telemetry Repeatable for multiple addresses. |
99-
| `web.systemd-socket` | No | | Use systemd socket activation listeners instead of port listeners (Linux only). |
100-
| `web.stackdriver-telemetry-path` | No | `/metrics` | Path under which to expose Stackdriver metrics. |
101-
| `web.telemetry-path` | No | `/metrics` | Path under which to expose Prometheus metrics |
79+
| Flag | Required | Default | Description |
80+
|-------------------------------| -------- |---------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
81+
| `google.project-id` | No | GCloud SDK auto-discovery | Comma seperated list of Google Project IDs |
82+
| `google.projects.filter` | No | | GCloud projects filter expression. See more [here](https://cloud.google.com/sdk/gcloud/reference/projects/list). |
83+
| `monitoring.metrics-ingest-delay` | No | | Offsets metric collection by a delay appropriate for each metric type, e.g. because bigquery metrics are slow to appear |
84+
| `monitoring.drop-delegated-projects` | No | No | Drop metrics from attached projects and fetch `project_id` only. |
85+
| `monitoring.metrics-type-prefixes` | Yes | | Comma separated Google Stackdriver Monitoring Metric Type prefixes (see [example][metrics-prefix-example] and [available metrics][metrics-list]) |
86+
| `monitoring.metrics-interval` | No | `5m` | Metric's timestamp interval to request from the Google Stackdriver Monitoring Metrics API. Only the most recent data point is used |
87+
| `monitoring.metrics-offset` | No | `0s` | Offset (into the past) for the metric's timestamp interval to request from the Google Stackdriver Monitoring Metrics API, to handle latency in published metrics |
88+
| `monitoring.filters` | No | | Formatted string to allow filtering on certain metrics type |
89+
| `monitoring.aggregate-deltas` | No | | If enabled will treat all DELTA metrics as an in-memory counter instead of a gauge. Be sure to read [what to know about aggregating DELTA metrics](#what-to-know-about-aggregating-delta-metrics) |
90+
| `monitoring.aggregate-deltas-ttl` | No | `30m` | How long should a delta metric continue to be exported and stored after GCP stops producing it. Read [slow moving metrics](#slow-moving-metrics) to understand the problem this attempts to solve |
91+
| `monitoring.descriptor-cache-ttl` | No | `0s` | How long should the metric descriptors for a prefixed be cached for |
92+
| `monitoring.enable-promhttp-custom-logger` | No | False | If enabled will create a custom error logging handler for promhttp |
93+
| `stackdriver.max-retries` | No | `0` | Max number of retries that should be attempted on 503 errors from stackdriver. |
94+
| `stackdriver.http-timeout` | No | `10s` | How long should stackdriver_exporter wait for a result from the Stackdriver API. |
95+
| `stackdriver.max-backoff=` | No | | Max time between each request in an exp backoff scenario. |
96+
| `stackdriver.backoff-jitter` | No | `1s` | The amount of jitter to introduce in a exp backoff scenario. |
97+
| `stackdriver.retry-statuses` | No | `503` | The HTTP statuses that should trigger a retry. |
98+
| `web.config.file` | No | | [EXPERIMENTAL] Path to configuration file that can enable TLS or authentication. |
99+
| `web.listen-address` | No | `:9255` | Address to listen on for web interface and telemetry Repeatable for multiple addresses. |
100+
| `web.systemd-socket` | No | | Use systemd socket activation listeners instead of port listeners (Linux only). |
101+
| `web.stackdriver-telemetry-path` | No | `/metrics` | Path under which to expose Stackdriver metrics. |
102+
| `web.telemetry-path` | No | `/metrics` | Path under which to expose Prometheus metrics |
102103

103104
### TLS and basic authentication
104105

stackdriver_exporter.go

+26-1
Original file line numberDiff line numberDiff line change
@@ -125,6 +125,10 @@ var (
125125
monitoringDescriptorCacheOnlyGoogle = kingpin.Flag(
126126
"monitoring.descriptor-cache-only-google", "Only cache descriptors for *.googleapis.com metrics",
127127
).Default("true").Bool()
128+
129+
monitoringEnablePromHttpCustomLogger = kingpin.Flag(
130+
"monitoring.enable-promhttp-custom-logger", "Enable custom logger for promhttp",
131+
).Default("false").Bool()
128132
)
129133

130134
func init() {
@@ -236,7 +240,14 @@ func (h *handler) innerHandler(filters map[string]bool) http.Handler {
236240
}
237241

238242
// Delegate http serving to Prometheus client library, which will call collector.Collect.
239-
return promhttp.HandlerFor(gatherers, promhttp.HandlerOpts{})
243+
opts := promhttp.HandlerOpts{}
244+
if *monitoringEnablePromHttpCustomLogger {
245+
h.logger.Log("msg", "Enabling custom logger for promhttp")
246+
opts = promhttp.HandlerOpts{
247+
ErrorLog: NewPromHttpCustomLogger(h.logger),
248+
}
249+
}
250+
return promhttp.HandlerFor(gatherers, opts)
240251
}
241252

242253
// filterMetricTypePrefixes filters the initial list of metric type prefixes, with the ones coming from an individual
@@ -365,3 +376,17 @@ func parseMetricExtraFilters() []collectors.MetricFilter {
365376
}
366377
return extraFilters
367378
}
379+
380+
type customPromErrorLogger struct {
381+
logger log.Logger
382+
}
383+
384+
func (l *customPromErrorLogger) Println(v ...interface{}) {
385+
level.Error(l.logger).Log("msg", fmt.Sprint(v...))
386+
}
387+
388+
func NewPromHttpCustomLogger(logger log.Logger) *customPromErrorLogger {
389+
return &customPromErrorLogger{
390+
logger: logger,
391+
}
392+
}

0 commit comments

Comments
 (0)