RFC: Standardize Collector Error Handling


## Background

Collector error handling is inconsistent across the codebase at both initialization and scrape time. The design intent is:

- A failing collector should not block other collectors from the same provider
- A provider can fail if none of its collectors initialize successfully
- Failures must be observable without relying solely on log monitoring

In practice, 13 collectors implement at least four distinct behaviours at each phase, and there is no agreed standard. This issue exists to reach consensus on what that standard should be across both phases.

> **Note:** This issue is about operational metrics - metrics that describe the health of the exporter itself (e.g. `cloudcost_exporter_collector_scrape_errors_total`) - not the cost rate metrics that collectors emit (e.g. `cloudcost_aws_ec2_instance_cpu_usd_per_core_hour`).

---

## Current State

### Initialization

| Collector | `New()` returns error? | What happens on failure |
|---|---|---|
| AWS EC2 | YES | Returns error; provider logs & skips |
| AWS S3 | YES | Partial - bad regions logged as warning, no error returned |
| GCP GCS | YES | Returns error; provider logs & skips |
| GCP GKE | YES | Returns error; provider logs & skips |
| GCP CLB | YES | Returns error; provider logs & skips |
| GCP Cloud SQL | YES | Returns error; provider logs & skips |
| Azure AKS | YES | Returns error; **provider fails entirely** |
| AWS RDS | NO | Never fails on init - defers all errors to scrape time |
| AWS NAT Gateway | NO | Never fails on init - background refresh, errors logged |
| AWS ELB | NO | Never fails on init - defers all errors to scrape time |
| AWS MSK | NO | Never fails on init - background refresh, errors logged |
| AWS VPC | NO | Swallows init error, logs it, always returns a collector |
| GCP VPC | NO | Swallows init error, logs it, always returns a collector |

**Four Patterns in Practice:**
1. Returns error -> provider logs & skips the collector (6 collectors)
2. Returns error -> provider fails entirely (1 collector: Azure AKS)
3. Never fails on init -> defers to scrape or background refresh (4 collectors)
4. Swallows init error -> logs it but always returns a healthy-looking collector (2 collectors)

### Scrape Time

**Quick Clarification of "Partial metrics on failure" Column**:
The column called "Partial metrics on failure" refers to whether a collector that hits an error mid-scrape still emits some cost rate metrics (e.g. `cloudcost_gcp_gke_instance_cpu_usd_per_core_hour`) before giving up. 

For example, if the GKE collector fails to list instances in `us-east1` but succeeds in `us-west1` and `europe-west1`, a YES means you still get metrics for those two regions. A NO means you get nothing at all for that scrape.

| Collector | `Collect()` returns error? | Partial metrics on failure? | Stale/cached values served? |
|---|---|---|---|
| AWS EC2 | Only if client missing; region errors logged & skipped | YES | YES (pricing maps via background ticker) |
| AWS S3 | YES | NO | YES (last successful billing data retained) |
| AWS RDS | YES (pricing validation only); region errors logged & skipped | YES | YES (in-memory pricing cache) |
| AWS NAT Gateway | NO - always returns nil | NO | YES (snapshot from background ticker) |
| AWS ELB | YES | YES | YES (conditional on scrape interval) |
| AWS VPC | YES (context cancellation only); pricing misses logged & skipped | YES | YES (background ticker, 24h) |
| AWS MSK | YES (context cancellation only); unpriceable clusters skipped | YES | YES (snapshot from background ticker) |
| GCP GCS | YES (service lookup); export errors logged & skipped | YES | YES (interval-throttled refresh) |
| GCP GKE | YES (zone listing); per-zone errors logged & skipped | YES | YES (pricing map via background ticker) |
| GCP CLB | YES (forwarding rule fetch); per-region errors logged & skipped | YES | YES (pricing map via background ticker, 24h) |
| GCP VPC | YES (context cancellation only); pricing misses logged at debug | YES | YES (background ticker, 24h) |
| GCP Cloud SQL | YES; project-level error blocks entire collection | YES (per-instance) | YES (SKU cache via background ticker, 24h) |
| Azure AKS | NO - always returns nil; unpriceable VMs/disks logged & skipped | YES | YES (3 separate background tickers) |

**Four Patterns in Practice:**
1. Returns error on hard failures only; silently skips per-region or per-resource errors (most collectors)
2. Never returns an error; all failures are silent (AWS NAT Gateway, Azure AKS)
3. Returns error that blocks the entire collection for that provider service (GCP Cloud SQL on project list failure)
4. Returns error only on context cancellation, not on data failures (AWS VPC, AWS MSK, GCP VPC)

Unlike initialization, scrape-time failures are recurring. A collector that fails on init fails once. A collector that fails on scrape fails on every scrape interval until the underlying issue is resolved.

---

## The Core Question

At both phases, the same two tensions apply:

**Resilience**: a failing collector should not prevent healthy collectors from running.

**Observability**: a failing collector must be detectable - ideally via a metric so it can be alerted on, not just a log line.

The current debate is whether to surface failures via an error return (allowing the provider to propagate or track it) or via a logged metric increment (keeping function signatures clean while still making failures queryable). This question applies equally to init and scrape, and the answer should be consistent across both.

---

## Options

### Option A - Return Error, Provider Logs and Skips
At init, `New()` returns an error and the provider logs it and skips the collector. At scrape, `Collect()` returns an error and the provider logs it. Other collectors are unaffected at both phases. Failures are only observable via logs.

> **Example:** If the EC2 collector fails to initialize, the AWS provider logs the error and continues without EC2. S3, RDS, and other collectors still run. The failure is only visible in the exporter logs - there is no metric to alert on.

- **Pro:**
  - Resilient.
  - Clean Go error handling.
  - Consistent with the majority of existing collectors.
- **Con:**
  - Silent in Mimir.
  - No metric to alert on.
  - Log-based alerting is fragile.

### Option B - Return Error, Log, Skip, and Increment a Metric
Same as A, but the provider also increments an error counter (e.g. `cloudcost_exporter_collector_init_errors_total`, `cloudcost_exporter_collector_scrape_errors_total`) labelled by collector name.

> **Example:** If the EC2 collector fails to initialize, the AWS provider logs the error, skips EC2, and increments `cloudcost_exporter_collector_init_errors_total{collector="ec2"}`. S3, RDS, and other collectors still run. An alert can fire on that counter without any log monitoring.

- **Pro:**
  - Resilient and alertable.
  - Errors are queryable via Mimir.
  - Consistent observability across both phases.
- **Con:**
  - Requires new or extended metrics.
  - The gatherer may already track some of this; needs investigation before adding duplication.

### Option C - Never Fail, Defer All Errors
At init, `New()` always succeeds. At scrape, `Collect()` re-emits stale cached values or serves background-refreshed data rather than returning an error.

> **Example:** If the EC2 collector's pricing API is down at init, it starts anyway with an empty pricing map. At scrape time it serves whatever cached data it has, or emits no metrics for regions it cannot price. There is no error anywhere in the system - the collector appears healthy.

- **Pro:**
  - Simplest signatures.
  - No partial-initialization states.
  - No gaps in metric output.
- **Con:**
  - A broken collector silently appears healthy.
  - Stale values are misleading.
  - Alerting on cost changes becomes unreliable.
  - Already the pattern for NAT Gateway and MSK, but inconsistent with the majority.

### Option D - Fail Fast, Fail the Provider or Scrape
Any collector failure fails the entire provider at init, or fails the entire scrape at scrape time.

> **Example:** If the EC2 collector fails to initialize, the entire AWS provider fails to start - S3, RDS, and all other AWS collectors stop running too. At scrape time, a single EC2 error causes Mimir to mark the entire scrape as failed and discard all metrics from that provider.

- **Pro:**
  - No silent failures.
  - Obvious signal that something is wrong.
- **Con:**
  - One broken collector takes down all metrics for that provider.
  - Contradicts the stated design intent.

---

## Open Questions

1. **What is the right observability mechanism?** Logs, Mimir metric, or both? If a metric, is there an existing one that fits or does a new one need to be defined?

2. **Should the standard be the same for init and scrape?** The consequence of a silent failure differs: an init failure is a one-time event, a scrape failure recurs indefinitely.

3. **What should happen when a collector that defers init work to the background (NAT Gateway, MSK) fails its first refresh?** Is "never fail on init" acceptable for these, or should they also return an error?

4. **Should the provider fail if zero collectors initialize successfully?** This is the one case where a provider-level failure may still be appropriate regardless of the chosen option.

5. **How should partial scrapes be handled?** If a collector emits 5 metrics then fails, should those 5 be kept or discarded?

6. **What is the acceptable staleness window?** If a collector fails repeatedly, at what point should an alert fire? This may inform whether a counter or a gauge (time since last successful scrape) is the right metric shape.

---

## References

- https://github.com/grafana/cloudcost-exporter/pull/863
- https://github.com/grafana/cloudcost-exporter/issues/716
- https://github.com/grafana/cloudcost-exporter/pull/664#discussion_r2549625724

## Proposed Direction

I propose we persue Option B as it keeps the exporter resilient to collector failures, gets us the most observability. This is a little more work than Option A as no collector implements it now, but I think that is the best choice.

As for deployment, rolling this out in Dev and leaving it that way for a little while makes sense to me to get a feel for the error volume. Since this came out of a meeting which also discussed alert fatigue it would be unfortunate, if we induce alert fatigue. Also getting the tuning and thresholds right with this change will be important and could take some revision. Doing it in dev for a little while will allow us time to get that tuning right without making a bunch of noise for on-callers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Standardize Collector Error Handling #869

Background

Current State

Initialization

Scrape Time

The Core Question

Options

Option A - Return Error, Provider Logs and Skips

Option B - Return Error, Log, Skip, and Increment a Metric

Option C - Never Fail, Defer All Errors

Option D - Fail Fast, Fail the Provider or Scrape

Open Questions

References

Proposed Direction

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Collector	`New()` returns error?	What happens on failure
AWS EC2	YES	Returns error; provider logs & skips
AWS S3	YES	Partial - bad regions logged as warning, no error returned
GCP GCS	YES	Returns error; provider logs & skips
GCP GKE	YES	Returns error; provider logs & skips
GCP CLB	YES	Returns error; provider logs & skips
GCP Cloud SQL	YES	Returns error; provider logs & skips
Azure AKS	YES	Returns error; provider fails entirely
AWS RDS	NO	Never fails on init - defers all errors to scrape time
AWS NAT Gateway	NO	Never fails on init - background refresh, errors logged
AWS ELB	NO	Never fails on init - defers all errors to scrape time
AWS MSK	NO	Never fails on init - background refresh, errors logged
AWS VPC	NO	Swallows init error, logs it, always returns a collector
GCP VPC	NO	Swallows init error, logs it, always returns a collector

Collector	`Collect()` returns error?	Partial metrics on failure?	Stale/cached values served?
AWS EC2	Only if client missing; region errors logged & skipped	YES	YES (pricing maps via background ticker)
AWS S3	YES	NO	YES (last successful billing data retained)
AWS RDS	YES (pricing validation only); region errors logged & skipped	YES	YES (in-memory pricing cache)
AWS NAT Gateway	NO - always returns nil	NO	YES (snapshot from background ticker)
AWS ELB	YES	YES	YES (conditional on scrape interval)
AWS VPC	YES (context cancellation only); pricing misses logged & skipped	YES	YES (background ticker, 24h)
AWS MSK	YES (context cancellation only); unpriceable clusters skipped	YES	YES (snapshot from background ticker)
GCP GCS	YES (service lookup); export errors logged & skipped	YES	YES (interval-throttled refresh)
GCP GKE	YES (zone listing); per-zone errors logged & skipped	YES	YES (pricing map via background ticker)
GCP CLB	YES (forwarding rule fetch); per-region errors logged & skipped	YES	YES (pricing map via background ticker, 24h)
GCP VPC	YES (context cancellation only); pricing misses logged at debug	YES	YES (background ticker, 24h)
GCP Cloud SQL	YES; project-level error blocks entire collection	YES (per-instance)	YES (SKU cache via background ticker, 24h)
Azure AKS	NO - always returns nil; unpriceable VMs/disks logged & skipped	YES	YES (3 separate background tickers)

RFC: Standardize Collector Error Handling #869

Description

Background

Current State

Initialization

Scrape Time

The Core Question

Options

Option A - Return Error, Provider Logs and Skips

Option B - Return Error, Log, Skip, and Increment a Metric

Option C - Never Fail, Defer All Errors

Option D - Fail Fast, Fail the Provider or Scrape

Open Questions

References

Proposed Direction

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions