Skip to content

RFC: Standardize Collector Error Handling #869

@stephan-rayner

Description

@stephan-rayner

Background

Collector error handling is inconsistent across the codebase at both initialization and scrape time. The design intent is:

  • A failing collector should not block other collectors from the same provider
  • A provider can fail if none of its collectors initialize successfully
  • Failures must be observable without relying solely on log monitoring

In practice, 13 collectors implement at least four distinct behaviours at each phase, and there is no agreed standard. This issue exists to reach consensus on what that standard should be across both phases.

Note: This issue is about operational metrics - metrics that describe the health of the exporter itself (e.g. cloudcost_exporter_collector_scrape_errors_total) - not the cost rate metrics that collectors emit (e.g. cloudcost_aws_ec2_instance_cpu_usd_per_core_hour).


Current State

Initialization

Collector New() returns error? What happens on failure
AWS EC2 YES Returns error; provider logs & skips
AWS S3 YES Partial - bad regions logged as warning, no error returned
GCP GCS YES Returns error; provider logs & skips
GCP GKE YES Returns error; provider logs & skips
GCP CLB YES Returns error; provider logs & skips
GCP Cloud SQL YES Returns error; provider logs & skips
Azure AKS YES Returns error; provider fails entirely
AWS RDS NO Never fails on init - defers all errors to scrape time
AWS NAT Gateway NO Never fails on init - background refresh, errors logged
AWS ELB NO Never fails on init - defers all errors to scrape time
AWS MSK NO Never fails on init - background refresh, errors logged
AWS VPC NO Swallows init error, logs it, always returns a collector
GCP VPC NO Swallows init error, logs it, always returns a collector

Four Patterns in Practice:

  1. Returns error -> provider logs & skips the collector (6 collectors)
  2. Returns error -> provider fails entirely (1 collector: Azure AKS)
  3. Never fails on init -> defers to scrape or background refresh (4 collectors)
  4. Swallows init error -> logs it but always returns a healthy-looking collector (2 collectors)

Scrape Time

Quick Clarification of "Partial metrics on failure" Column:
The column called "Partial metrics on failure" refers to whether a collector that hits an error mid-scrape still emits some cost rate metrics (e.g. cloudcost_gcp_gke_instance_cpu_usd_per_core_hour) before giving up.

For example, if the GKE collector fails to list instances in us-east1 but succeeds in us-west1 and europe-west1, a YES means you still get metrics for those two regions. A NO means you get nothing at all for that scrape.

Collector Collect() returns error? Partial metrics on failure? Stale/cached values served?
AWS EC2 Only if client missing; region errors logged & skipped YES YES (pricing maps via background ticker)
AWS S3 YES NO YES (last successful billing data retained)
AWS RDS YES (pricing validation only); region errors logged & skipped YES YES (in-memory pricing cache)
AWS NAT Gateway NO - always returns nil NO YES (snapshot from background ticker)
AWS ELB YES YES YES (conditional on scrape interval)
AWS VPC YES (context cancellation only); pricing misses logged & skipped YES YES (background ticker, 24h)
AWS MSK YES (context cancellation only); unpriceable clusters skipped YES YES (snapshot from background ticker)
GCP GCS YES (service lookup); export errors logged & skipped YES YES (interval-throttled refresh)
GCP GKE YES (zone listing); per-zone errors logged & skipped YES YES (pricing map via background ticker)
GCP CLB YES (forwarding rule fetch); per-region errors logged & skipped YES YES (pricing map via background ticker, 24h)
GCP VPC YES (context cancellation only); pricing misses logged at debug YES YES (background ticker, 24h)
GCP Cloud SQL YES; project-level error blocks entire collection YES (per-instance) YES (SKU cache via background ticker, 24h)
Azure AKS NO - always returns nil; unpriceable VMs/disks logged & skipped YES YES (3 separate background tickers)

Four Patterns in Practice:

  1. Returns error on hard failures only; silently skips per-region or per-resource errors (most collectors)
  2. Never returns an error; all failures are silent (AWS NAT Gateway, Azure AKS)
  3. Returns error that blocks the entire collection for that provider service (GCP Cloud SQL on project list failure)
  4. Returns error only on context cancellation, not on data failures (AWS VPC, AWS MSK, GCP VPC)

Unlike initialization, scrape-time failures are recurring. A collector that fails on init fails once. A collector that fails on scrape fails on every scrape interval until the underlying issue is resolved.


The Core Question

At both phases, the same two tensions apply:

Resilience: a failing collector should not prevent healthy collectors from running.

Observability: a failing collector must be detectable - ideally via a metric so it can be alerted on, not just a log line.

The current debate is whether to surface failures via an error return (allowing the provider to propagate or track it) or via a logged metric increment (keeping function signatures clean while still making failures queryable). This question applies equally to init and scrape, and the answer should be consistent across both.


Options

Option A - Return Error, Provider Logs and Skips

At init, New() returns an error and the provider logs it and skips the collector. At scrape, Collect() returns an error and the provider logs it. Other collectors are unaffected at both phases. Failures are only observable via logs.

Example: If the EC2 collector fails to initialize, the AWS provider logs the error and continues without EC2. S3, RDS, and other collectors still run. The failure is only visible in the exporter logs - there is no metric to alert on.

  • Pro:
    • Resilient.
    • Clean Go error handling.
    • Consistent with the majority of existing collectors.
  • Con:
    • Silent in Mimir.
    • No metric to alert on.
    • Log-based alerting is fragile.

Option B - Return Error, Log, Skip, and Increment a Metric

Same as A, but the provider also increments an error counter (e.g. cloudcost_exporter_collector_init_errors_total, cloudcost_exporter_collector_scrape_errors_total) labelled by collector name.

Example: If the EC2 collector fails to initialize, the AWS provider logs the error, skips EC2, and increments cloudcost_exporter_collector_init_errors_total{collector="ec2"}. S3, RDS, and other collectors still run. An alert can fire on that counter without any log monitoring.

  • Pro:
    • Resilient and alertable.
    • Errors are queryable via Mimir.
    • Consistent observability across both phases.
  • Con:
    • Requires new or extended metrics.
    • The gatherer may already track some of this; needs investigation before adding duplication.

Option C - Never Fail, Defer All Errors

At init, New() always succeeds. At scrape, Collect() re-emits stale cached values or serves background-refreshed data rather than returning an error.

Example: If the EC2 collector's pricing API is down at init, it starts anyway with an empty pricing map. At scrape time it serves whatever cached data it has, or emits no metrics for regions it cannot price. There is no error anywhere in the system - the collector appears healthy.

  • Pro:
    • Simplest signatures.
    • No partial-initialization states.
    • No gaps in metric output.
  • Con:
    • A broken collector silently appears healthy.
    • Stale values are misleading.
    • Alerting on cost changes becomes unreliable.
    • Already the pattern for NAT Gateway and MSK, but inconsistent with the majority.

Option D - Fail Fast, Fail the Provider or Scrape

Any collector failure fails the entire provider at init, or fails the entire scrape at scrape time.

Example: If the EC2 collector fails to initialize, the entire AWS provider fails to start - S3, RDS, and all other AWS collectors stop running too. At scrape time, a single EC2 error causes Mimir to mark the entire scrape as failed and discard all metrics from that provider.

  • Pro:
    • No silent failures.
    • Obvious signal that something is wrong.
  • Con:
    • One broken collector takes down all metrics for that provider.
    • Contradicts the stated design intent.

Open Questions

  1. What is the right observability mechanism? Logs, Mimir metric, or both? If a metric, is there an existing one that fits or does a new one need to be defined?

  2. Should the standard be the same for init and scrape? The consequence of a silent failure differs: an init failure is a one-time event, a scrape failure recurs indefinitely.

  3. What should happen when a collector that defers init work to the background (NAT Gateway, MSK) fails its first refresh? Is "never fail on init" acceptable for these, or should they also return an error?

  4. Should the provider fail if zero collectors initialize successfully? This is the one case where a provider-level failure may still be appropriate regardless of the chosen option.

  5. How should partial scrapes be handled? If a collector emits 5 metrics then fails, should those 5 be kept or discarded?

  6. What is the acceptable staleness window? If a collector fails repeatedly, at what point should an alert fire? This may inform whether a counter or a gauge (time since last successful scrape) is the right metric shape.


References

Proposed Direction

I propose we persue Option B as it keeps the exporter resilient to collector failures, gets us the most observability. This is a little more work than Option A as no collector implements it now, but I think that is the best choice.

As for deployment, rolling this out in Dev and leaving it that way for a little while makes sense to me to get a feel for the error volume. Since this came out of a meeting which also discussed alert fatigue it would be unfortunate, if we induce alert fatigue. Also getting the tuning and thresholds right with this change will be important and could take some revision. Doing it in dev for a little while will allow us time to get that tuning right without making a bunch of noise for on-callers.

Metadata

Metadata

Labels

No fields configured for Enhancement.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions