Skip to content

Fix: Standardise Collector Error Propagation #870

@stephan-rayner

Description

@stephan-rayner

Background

Several collectors silently swallow errors during scrapes: they log a warning, skip the affected resource, and return `nil` from `Collect()`. From the exporter's perspective the scrape succeeded, so `cloudcost_exporter_collector_error` never increments and SLO dashboards show the collector as healthy. Hundreds of VMs or volumes can go unpriced with no alertable signal.

The `collectormetrics` wrapper in `pkg/gatherer` already increments the error counter whenever `Collect()` returns a non-nil error; the infrastructure for observable failures exists. The gap is that collectors need to surface errors rather than absorb them.

Note: This issue is about operational metrics - metrics that describe the health of the exporter itself (e.g. `cloudcost_exporter_collector_scrape_errors_total`) - not the cost rate metrics that collectors emit.


Option Definitions (from #869)

Option A - Return Error, Provider Logs and Skips
At init, `New()` returns an error and the provider logs it and skips the collector. At scrape, `Collect()` returns an error and the provider logs it. Other collectors are unaffected at both phases. Failures are only observable via logs.

Option B - Return Error, Log, Skip, and Increment a Metric <- Chosen Option
Same as A, but the provider also increments an error counter labelled by collector name. If the EC2 collector fails to initialize, the AWS provider logs the error, skips EC2, and increments `cloudcost_exporter_collector_init_errors_total{collector="ec2"}`. An alert can fire on that counter without any log monitoring.

Option C - Never Fail, Defer All Errors
`New()` always succeeds. `Collect()` re-emits stale cached values or serves background-refreshed data rather than returning an error. A broken collector silently appears healthy.

Option D - Fail Fast, Fail the Provider or Scrape
Any collector failure fails the entire provider at init, or fails the entire scrape at scrape time.


Problem

Several collectors currently implement Option C: they swallow errors in `Collect()`, return `nil`, and the `cloudcost_exporter_collector_error` counter never increments. SLO dashboards show these collectors as healthy even when they silently fail to price resources.

The `collectormetrics.Collect()` wrapper in `pkg/gatherer` already increments the error counter when `Collect()` returns a non-nil error. The gap is entirely on the collector side: collectors need to return errors rather than swallowing them.

Possibly Related Issues

Metadata

Metadata

No fields configured for Enhancement.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions