Background
Collector error handling is inconsistent across the codebase at both initialization and scrape time. The design intent is:
- A failing collector should not block other collectors from the same provider
- A provider can fail if none of its collectors initialize successfully
- Failures must be observable without relying solely on log monitoring
In practice, 13 collectors implement at least four distinct behaviours at each phase, and there is no agreed standard. This issue exists to reach consensus on what that standard should be across both phases.
Note: This issue is about operational metrics - metrics that describe the health of the exporter itself (e.g. cloudcost_exporter_collector_scrape_errors_total) - not the cost rate metrics that collectors emit (e.g. cloudcost_aws_ec2_instance_cpu_usd_per_core_hour).
Current State
Initialization
| Collector |
New() returns error? |
What happens on failure |
| AWS EC2 |
YES |
Returns error; provider logs & skips |
| AWS S3 |
YES |
Partial - bad regions logged as warning, no error returned |
| GCP GCS |
YES |
Returns error; provider logs & skips |
| GCP GKE |
YES |
Returns error; provider logs & skips |
| GCP CLB |
YES |
Returns error; provider logs & skips |
| GCP Cloud SQL |
YES |
Returns error; provider logs & skips |
| Azure AKS |
YES |
Returns error; provider fails entirely |
| AWS RDS |
NO |
Never fails on init - defers all errors to scrape time |
| AWS NAT Gateway |
NO |
Never fails on init - background refresh, errors logged |
| AWS ELB |
NO |
Never fails on init - defers all errors to scrape time |
| AWS MSK |
NO |
Never fails on init - background refresh, errors logged |
| AWS VPC |
NO |
Swallows init error, logs it, always returns a collector |
| GCP VPC |
NO |
Swallows init error, logs it, always returns a collector |
Four Patterns in Practice:
- Returns error -> provider logs & skips the collector (6 collectors)
- Returns error -> provider fails entirely (1 collector: Azure AKS)
- Never fails on init -> defers to scrape or background refresh (4 collectors)
- Swallows init error -> logs it but always returns a healthy-looking collector (2 collectors)
Scrape Time
Quick Clarification of "Partial metrics on failure" Column:
The column called "Partial metrics on failure" refers to whether a collector that hits an error mid-scrape still emits some cost rate metrics (e.g. cloudcost_gcp_gke_instance_cpu_usd_per_core_hour) before giving up.
For example, if the GKE collector fails to list instances in us-east1 but succeeds in us-west1 and europe-west1, a YES means you still get metrics for those two regions. A NO means you get nothing at all for that scrape.
| Collector |
Collect() returns error? |
Partial metrics on failure? |
Stale/cached values served? |
| AWS EC2 |
Only if client missing; region errors logged & skipped |
YES |
YES (pricing maps via background ticker) |
| AWS S3 |
YES |
NO |
YES (last successful billing data retained) |
| AWS RDS |
YES (pricing validation only); region errors logged & skipped |
YES |
YES (in-memory pricing cache) |
| AWS NAT Gateway |
NO - always returns nil |
NO |
YES (snapshot from background ticker) |
| AWS ELB |
YES |
YES |
YES (conditional on scrape interval) |
| AWS VPC |
YES (context cancellation only); pricing misses logged & skipped |
YES |
YES (background ticker, 24h) |
| AWS MSK |
YES (context cancellation only); unpriceable clusters skipped |
YES |
YES (snapshot from background ticker) |
| GCP GCS |
YES (service lookup); export errors logged & skipped |
YES |
YES (interval-throttled refresh) |
| GCP GKE |
YES (zone listing); per-zone errors logged & skipped |
YES |
YES (pricing map via background ticker) |
| GCP CLB |
YES (forwarding rule fetch); per-region errors logged & skipped |
YES |
YES (pricing map via background ticker, 24h) |
| GCP VPC |
YES (context cancellation only); pricing misses logged at debug |
YES |
YES (background ticker, 24h) |
| GCP Cloud SQL |
YES; project-level error blocks entire collection |
YES (per-instance) |
YES (SKU cache via background ticker, 24h) |
| Azure AKS |
NO - always returns nil; unpriceable VMs/disks logged & skipped |
YES |
YES (3 separate background tickers) |
Four Patterns in Practice:
- Returns error on hard failures only; silently skips per-region or per-resource errors (most collectors)
- Never returns an error; all failures are silent (AWS NAT Gateway, Azure AKS)
- Returns error that blocks the entire collection for that provider service (GCP Cloud SQL on project list failure)
- Returns error only on context cancellation, not on data failures (AWS VPC, AWS MSK, GCP VPC)
Unlike initialization, scrape-time failures are recurring. A collector that fails on init fails once. A collector that fails on scrape fails on every scrape interval until the underlying issue is resolved.
The Core Question
At both phases, the same two tensions apply:
Resilience: a failing collector should not prevent healthy collectors from running.
Observability: a failing collector must be detectable - ideally via a metric so it can be alerted on, not just a log line.
The current debate is whether to surface failures via an error return (allowing the provider to propagate or track it) or via a logged metric increment (keeping function signatures clean while still making failures queryable). This question applies equally to init and scrape, and the answer should be consistent across both.
Options
Option A - Return Error, Provider Logs and Skips
At init, New() returns an error and the provider logs it and skips the collector. At scrape, Collect() returns an error and the provider logs it. Other collectors are unaffected at both phases. Failures are only observable via logs.
Example: If the EC2 collector fails to initialize, the AWS provider logs the error and continues without EC2. S3, RDS, and other collectors still run. The failure is only visible in the exporter logs - there is no metric to alert on.
- Pro:
- Resilient.
- Clean Go error handling.
- Consistent with the majority of existing collectors.
- Con:
- Silent in Mimir.
- No metric to alert on.
- Log-based alerting is fragile.
Option B - Return Error, Log, Skip, and Increment a Metric
Same as A, but the provider also increments an error counter (e.g. cloudcost_exporter_collector_init_errors_total, cloudcost_exporter_collector_scrape_errors_total) labelled by collector name.
Example: If the EC2 collector fails to initialize, the AWS provider logs the error, skips EC2, and increments cloudcost_exporter_collector_init_errors_total{collector="ec2"}. S3, RDS, and other collectors still run. An alert can fire on that counter without any log monitoring.
- Pro:
- Resilient and alertable.
- Errors are queryable via Mimir.
- Consistent observability across both phases.
- Con:
- Requires new or extended metrics.
- The gatherer may already track some of this; needs investigation before adding duplication.
Option C - Never Fail, Defer All Errors
At init, New() always succeeds. At scrape, Collect() re-emits stale cached values or serves background-refreshed data rather than returning an error.
Example: If the EC2 collector's pricing API is down at init, it starts anyway with an empty pricing map. At scrape time it serves whatever cached data it has, or emits no metrics for regions it cannot price. There is no error anywhere in the system - the collector appears healthy.
- Pro:
- Simplest signatures.
- No partial-initialization states.
- No gaps in metric output.
- Con:
- A broken collector silently appears healthy.
- Stale values are misleading.
- Alerting on cost changes becomes unreliable.
- Already the pattern for NAT Gateway and MSK, but inconsistent with the majority.
Option D - Fail Fast, Fail the Provider or Scrape
Any collector failure fails the entire provider at init, or fails the entire scrape at scrape time.
Example: If the EC2 collector fails to initialize, the entire AWS provider fails to start - S3, RDS, and all other AWS collectors stop running too. At scrape time, a single EC2 error causes Mimir to mark the entire scrape as failed and discard all metrics from that provider.
- Pro:
- No silent failures.
- Obvious signal that something is wrong.
- Con:
- One broken collector takes down all metrics for that provider.
- Contradicts the stated design intent.
Open Questions
-
What is the right observability mechanism? Logs, Mimir metric, or both? If a metric, is there an existing one that fits or does a new one need to be defined?
-
Should the standard be the same for init and scrape? The consequence of a silent failure differs: an init failure is a one-time event, a scrape failure recurs indefinitely.
-
What should happen when a collector that defers init work to the background (NAT Gateway, MSK) fails its first refresh? Is "never fail on init" acceptable for these, or should they also return an error?
-
Should the provider fail if zero collectors initialize successfully? This is the one case where a provider-level failure may still be appropriate regardless of the chosen option.
-
How should partial scrapes be handled? If a collector emits 5 metrics then fails, should those 5 be kept or discarded?
-
What is the acceptable staleness window? If a collector fails repeatedly, at what point should an alert fire? This may inform whether a counter or a gauge (time since last successful scrape) is the right metric shape.
References
Proposed Direction
I propose we persue Option B as it keeps the exporter resilient to collector failures, gets us the most observability. This is a little more work than Option A as no collector implements it now, but I think that is the best choice.
As for deployment, rolling this out in Dev and leaving it that way for a little while makes sense to me to get a feel for the error volume. Since this came out of a meeting which also discussed alert fatigue it would be unfortunate, if we induce alert fatigue. Also getting the tuning and thresholds right with this change will be important and could take some revision. Doing it in dev for a little while will allow us time to get that tuning right without making a bunch of noise for on-callers.
Background
Collector error handling is inconsistent across the codebase at both initialization and scrape time. The design intent is:
In practice, 13 collectors implement at least four distinct behaviours at each phase, and there is no agreed standard. This issue exists to reach consensus on what that standard should be across both phases.
Current State
Initialization
New()returns error?Four Patterns in Practice:
Scrape Time
Quick Clarification of "Partial metrics on failure" Column:
The column called "Partial metrics on failure" refers to whether a collector that hits an error mid-scrape still emits some cost rate metrics (e.g.
cloudcost_gcp_gke_instance_cpu_usd_per_core_hour) before giving up.For example, if the GKE collector fails to list instances in
us-east1but succeeds inus-west1andeurope-west1, a YES means you still get metrics for those two regions. A NO means you get nothing at all for that scrape.Collect()returns error?Four Patterns in Practice:
Unlike initialization, scrape-time failures are recurring. A collector that fails on init fails once. A collector that fails on scrape fails on every scrape interval until the underlying issue is resolved.
The Core Question
At both phases, the same two tensions apply:
Resilience: a failing collector should not prevent healthy collectors from running.
Observability: a failing collector must be detectable - ideally via a metric so it can be alerted on, not just a log line.
The current debate is whether to surface failures via an error return (allowing the provider to propagate or track it) or via a logged metric increment (keeping function signatures clean while still making failures queryable). This question applies equally to init and scrape, and the answer should be consistent across both.
Options
Option A - Return Error, Provider Logs and Skips
At init,
New()returns an error and the provider logs it and skips the collector. At scrape,Collect()returns an error and the provider logs it. Other collectors are unaffected at both phases. Failures are only observable via logs.Option B - Return Error, Log, Skip, and Increment a Metric
Same as A, but the provider also increments an error counter (e.g.
cloudcost_exporter_collector_init_errors_total,cloudcost_exporter_collector_scrape_errors_total) labelled by collector name.Option C - Never Fail, Defer All Errors
At init,
New()always succeeds. At scrape,Collect()re-emits stale cached values or serves background-refreshed data rather than returning an error.Option D - Fail Fast, Fail the Provider or Scrape
Any collector failure fails the entire provider at init, or fails the entire scrape at scrape time.
Open Questions
What is the right observability mechanism? Logs, Mimir metric, or both? If a metric, is there an existing one that fits or does a new one need to be defined?
Should the standard be the same for init and scrape? The consequence of a silent failure differs: an init failure is a one-time event, a scrape failure recurs indefinitely.
What should happen when a collector that defers init work to the background (NAT Gateway, MSK) fails its first refresh? Is "never fail on init" acceptable for these, or should they also return an error?
Should the provider fail if zero collectors initialize successfully? This is the one case where a provider-level failure may still be appropriate regardless of the chosen option.
How should partial scrapes be handled? If a collector emits 5 metrics then fails, should those 5 be kept or discarded?
What is the acceptable staleness window? If a collector fails repeatedly, at what point should an alert fire? This may inform whether a counter or a gauge (time since last successful scrape) is the right metric shape.
References
Proposed Direction
I propose we persue Option B as it keeps the exporter resilient to collector failures, gets us the most observability. This is a little more work than Option A as no collector implements it now, but I think that is the best choice.
As for deployment, rolling this out in Dev and leaving it that way for a little while makes sense to me to get a feel for the error volume. Since this came out of a meeting which also discussed alert fatigue it would be unfortunate, if we induce alert fatigue. Also getting the tuning and thresholds right with this change will be important and could take some revision. Doing it in dev for a little while will allow us time to get that tuning right without making a bunch of noise for on-callers.