Background
Several collectors silently swallow errors during scrapes: they log a warning, skip the affected resource, and return `nil` from `Collect()`. From the exporter's perspective the scrape succeeded, so `cloudcost_exporter_collector_error` never increments and SLO dashboards show the collector as healthy. Hundreds of VMs or volumes can go unpriced with no alertable signal.
The `collectormetrics` wrapper in `pkg/gatherer` already increments the error counter whenever `Collect()` returns a non-nil error; the infrastructure for observable failures exists. The gap is that collectors need to surface errors rather than absorb them.
Note: This issue is about operational metrics - metrics that describe the health of the exporter itself (e.g. `cloudcost_exporter_collector_scrape_errors_total`) - not the cost rate metrics that collectors emit.
Option Definitions (from #869)
Option A - Return Error, Provider Logs and Skips
At init, `New()` returns an error and the provider logs it and skips the collector. At scrape, `Collect()` returns an error and the provider logs it. Other collectors are unaffected at both phases. Failures are only observable via logs.
Option B - Return Error, Log, Skip, and Increment a Metric <- Chosen Option
Same as A, but the provider also increments an error counter labelled by collector name. If the EC2 collector fails to initialize, the AWS provider logs the error, skips EC2, and increments `cloudcost_exporter_collector_init_errors_total{collector="ec2"}`. An alert can fire on that counter without any log monitoring.
Option C - Never Fail, Defer All Errors
`New()` always succeeds. `Collect()` re-emits stale cached values or serves background-refreshed data rather than returning an error. A broken collector silently appears healthy.
Option D - Fail Fast, Fail the Provider or Scrape
Any collector failure fails the entire provider at init, or fails the entire scrape at scrape time.
Problem
Several collectors currently implement Option C: they swallow errors in `Collect()`, return `nil`, and the `cloudcost_exporter_collector_error` counter never increments. SLO dashboards show these collectors as healthy even when they silently fail to price resources.
The `collectormetrics.Collect()` wrapper in `pkg/gatherer` already increments the error counter when `Collect()` returns a non-nil error. The gap is entirely on the collector side: collectors need to return errors rather than swallowing them.
Possibly Related Issues
Background
Several collectors silently swallow errors during scrapes: they log a warning, skip the affected resource, and return `nil` from `Collect()`. From the exporter's perspective the scrape succeeded, so `cloudcost_exporter_collector_error` never increments and SLO dashboards show the collector as healthy. Hundreds of VMs or volumes can go unpriced with no alertable signal.
The `collectormetrics` wrapper in `pkg/gatherer` already increments the error counter whenever `Collect()` returns a non-nil error; the infrastructure for observable failures exists. The gap is that collectors need to surface errors rather than absorb them.
Option Definitions (from #869)
Option A - Return Error, Provider Logs and Skips
At init, `New()` returns an error and the provider logs it and skips the collector. At scrape, `Collect()` returns an error and the provider logs it. Other collectors are unaffected at both phases. Failures are only observable via logs.
Option B - Return Error, Log, Skip, and Increment a Metric <- Chosen Option
Same as A, but the provider also increments an error counter labelled by collector name. If the EC2 collector fails to initialize, the AWS provider logs the error, skips EC2, and increments `cloudcost_exporter_collector_init_errors_total{collector="ec2"}`. An alert can fire on that counter without any log monitoring.
Option C - Never Fail, Defer All Errors
`New()` always succeeds. `Collect()` re-emits stale cached values or serves background-refreshed data rather than returning an error. A broken collector silently appears healthy.
Option D - Fail Fast, Fail the Provider or Scrape
Any collector failure fails the entire provider at init, or fails the entire scrape at scrape time.
Problem
Several collectors currently implement Option C: they swallow errors in `Collect()`, return `nil`, and the `cloudcost_exporter_collector_error` counter never increments. SLO dashboards show these collectors as healthy even when they silently fail to price resources.
The `collectormetrics.Collect()` wrapper in `pkg/gatherer` already increments the error counter when `Collect()` returns a non-nil error. The gap is entirely on the collector side: collectors need to return errors rather than swallowing them.
Possibly Related Issues