Add histogram using gatherer interface to detect errors when collecting prometheus metrics #711

leonorfmartins · 2025-11-25T10:17:31Z

What this does

Gatherers interface is from the prometheus package and it allows us to inspect metrics being collected and check for errors.
We scan each collector and push a native histogram that looks like this:

cloudcost_exporter_collector_duration_seconds{
  collector=<my_collector>,
  duration=<scrape_dur>
}

This histogram can be used to then plot charts for each collector's health (e.g. error rate and duration). Since it's an histogram, it will be easier to make SLOs out of it as well.

Besides it, there are 2 other counter metrics:

cloudcost_exporter_collector_total
cloudcost_exporter_collector_error_total

which will increment on each scrape and on errors, respectively.

We can also leverage the gatherer to more precisely assert on metrics expectations.

Test

GCP:

# TYPE cloudcost_exporter_collector_duration_seconds histogram
cloudcost_exporter_collector_duration_seconds_bucket{collector="GCS",le="+Inf"} 1
cloudcost_exporter_collector_duration_seconds_sum{collector="GCS"} 0.000228667
cloudcost_exporter_collector_duration_seconds_count{collector="GCS"} 1
cloudcost_exporter_collector_duration_seconds_bucket{collector="cloudsql",le="+Inf"} 1
cloudcost_exporter_collector_duration_seconds_sum{collector="cloudsql"} 5.3625e-05
cloudcost_exporter_collector_duration_seconds_count{collector="cloudsql"} 1
cloudcost_exporter_collector_duration_seconds_bucket{collector="gcp_gke",le="+Inf"} 1
cloudcost_exporter_collector_duration_seconds_sum{collector="gcp_gke"} 2.125e-06
cloudcost_exporter_collector_duration_seconds_count{collector="gcp_gke"} 1
# TYPE cloudcost_exporter_collector_total counter
cloudcost_exporter_collector_total{collector="GCS"} 1
cloudcost_exporter_collector_total{collector="cloudsql"} 1
cloudcost_exporter_collector_total{collector="gcp_gke"} 1

AWS:

# TYPE cloudcost_exporter_collector_duration_seconds histogram
cloudcost_exporter_collector_duration_seconds_bucket{collector="aws_elb",le="+Inf"} 1
cloudcost_exporter_collector_duration_seconds_sum{collector="aws_elb"} 1.4375e-05
cloudcost_exporter_collector_duration_seconds_count{collector="aws_elb"} 1
cloudcost_exporter_collector_duration_seconds_bucket{collector="aws_rds",le="+Inf"} 1
cloudcost_exporter_collector_duration_seconds_sum{collector="aws_rds"} 1.75e-06
cloudcost_exporter_collector_duration_seconds_count{collector="aws_rds"} 1
# TYPE cloudcost_exporter_collector_total counter
cloudcost_exporter_collector_total{collector="aws_elb"} 3
cloudcost_exporter_collector_total{collector="aws_rds"} 3

Azure:

# HELP cloudcost_exporter_collector_duration_seconds Duration of a collector scrape in seconds
# TYPE cloudcost_exporter_collector_duration_seconds histogram
cloudcost_exporter_collector_duration_seconds_bucket{collector="azure_aks",le="+Inf"} 1
cloudcost_exporter_collector_duration_seconds_sum{collector="azure_aks"} 0.000239417
cloudcost_exporter_collector_duration_seconds_count{collector="azure_aks"} 1
# HELP cloudcost_exporter_collector_last_scrape_duration_seconds Duration of the last scrape in seconds.
# TYPE cloudcost_exporter_collector_last_scrape_duration_seconds gauge
cloudcost_exporter_collector_last_scrape_duration_seconds{collector="azure_aks",provider="azure"} 0.000239417
# TYPE cloudcost_exporter_collector_total counter
cloudcost_exporter_collector_total{collector="azure_aks"} 1

Relates to https://github.com/grafana/deployment_tools/issues/413121

nikimanoledaki

I found the documentation for the promhttp package really helpful for learning more about the Gatherer and the different instrumentation options for the handlers: https://pkg.go.dev/github.com/prometheus/client_golang/prometheus/promhttp

It's also interesting to learn more about how kube-state-metrics does this since it is also a custom Prometheus exporter: https://github.com/kubernetes/kube-state-metrics/blob/main/pkg/app/server.go#L90

leonorfmartins · 2025-11-26T15:48:17Z

thanks for the pointer Niki! It got me wondering if we should also return the total number of metrics, like ksm does... on the other hand, since we are emitting now one metric per collector, I guess we could just count the number of metrics 🤔

cindy · 2025-12-01T23:55:14Z

Hey Leonor! This looks great. It looks like we can use the gatherer to validate our metrics as well and possibly log when there's no metrics or when our metrics look wonky (like negative values when there shouldn't be negative values). This will be super useful! I also like the example you listed for failing to list metrics in KSM:

func GatherAndCount(g prometheus.Gatherer) (int, error) {
	got, err := g.Gather()
	if err != nil {
		return 0, fmt.Errorf("gathering metrics failed: %w", err)
	}

I'm wondering the is_error label is useful? Since we're getting latency for our metrics we can use that to create a latency SLO, but the error count will not be as useful to us in a bucket/histogram. I wonder if there's a better way to capture errors? Is it possible to create another metric that gives us a counter for successes and for failures? This blog has some suggestions on how that would look like we should have something like cloudcost_exporter_collector_total and cloudcost_exporter_collector_error_total. What do you think? That would allow us to create an error rate SLO.

…lecting prometheus metrics

nikimanoledaki · 2025-12-17T13:48:29Z

pkg/aws/aws.go

-			now := time.Now()
 			defer wg.Done()
+
+			duration, hasError := gatherer.CollectWithGatherer(collectCtx, c, ch, a.logger)


Are we collecting metrics twice here via CollectWithGatherer? 🤔

temporarily yes, until we deprecate the old metrics, see the comment right below: https://github.com/grafana/cloudcost-exporter/pull/711/changes#diff-557e359d6147fd859c120871f3c4b242f8781a32fb05b6a343978ff33c3b8e2fR255

nikimanoledaki · 2025-12-17T14:01:06Z

I’m trying to understand the strategy around the operational metrics with this change. :)

Checking my understanding of this change: there is already a Gatherer interface implemented for the Prometheus collector at the CSP level. This PR would add a Gatherer for the custom collectors at the resource level. Let me know if that's right or if any of my starting assumptions are wrong 👍

Looking at how these new metrics fit with the existing operational metrics. We have the following at the moment:

We have some metrics starting with cloudcost_exporter_<csp>:
- AWS: cloudcost_exporter_aws_collector_success{collector="S3"|"aws_ec2"}
- We don’t have the same^ for gcp e.g. no metrics that start with cloudcost_exporter_gcp
We have the following cloudcost_exporter_collector_* metrics (screenshot)

Could we document how the new operational metrics (cloudcost_exporter_collector_total & cloudcost_exporter_collector_duration_seconds_count|sum|bucket) are different from the existing operational metrics (cloudcost_exporter_<csp> & cloudcost_exporter_collector_*? Or, which ones we are replacing and for what purpose? I know you explained some of this before. It would be great to align on this again. 😊 Thank you @leonorfmartins!! 🌟

nikimanoledaki · 2025-12-17T14:05:27Z

Specifically would be great to document this here: https://github.com/grafana/cloudcost-exporter/blob/main/docs/metrics/providers.md

And/or add a new metrics/collectors.md?

leonorfmartins · 2025-12-17T14:24:37Z

yes, I can definitely add documentation for these metrics 😃 I'll just give some context here, for the sake of quickly answering your questions:

Are we collecting metrics twice here?

Yes, temporarily, specially all the metrics you are mentioning below such as cloudcost_exporter_collector_success or cloudcost_collector_last_scrape_error and cloudcost_collector_last_scrape_time. They are still being used by some of our dashboards, but we have seen how they are not very reliable. Yet, until we have an improved dashboard to monitor all our collectors, I wouldn't want to get rid of them, as having some collectors being monitored is better than having none. As you already pointed out, existing metrics don't even cover all provider collectors.
I can communicate this better with a comment on the code highlighting this temporary situation as well 👍

So, the plan is in the future to have only the metrics we are adding in this PR. Metrics such as

cloudcost_exporter_collector_success
cloudcost_collector_last_scrape_error
cloudcost_collector_last_scrape_time
cloudcost_exporter_<csp>

are to be deprecated as they don't accurately provide a way for us to monitor our collectors.

nikimanoledaki

Thanks @leonorfmartins, I've left some suggestions 😊 LGTM after this 👍

pkg/gatherer/gatherer.go

nikimanoledaki · 2025-12-24T12:29:01Z

pkg/gatherer/gatherer.go

+	}
+
+	if _, err := tempRegistry.Gather(); err != nil {
+		hasError = true
+		logger.LogAttrs(ctx, slog.LevelError, "did not detect gatherer",
+			slog.String("collector", c.Name()),
+			slog.String("message", err.Error()),
+		)


This call to tempRegistry.Gather() won't work because we are calling c.Collect() directly earlier, which means that the metrics are sent to a ch channel. 👀 Essentially it repeats the metric collection, which is either expensive or yields no metrics.

It can't be used as a backup for c.Collect, since calling tempRegistry.Gather() would call c.Collect() again and fail with the same reason.

We can remove this safely.

Suggested change

}

if _, err := tempRegistry.Gather(); err != nil {

hasError = true

logger.LogAttrs(ctx, slog.LevelError, "did not detect gatherer",

slog.String("collector", c.Name()),

slog.String("message", err.Error()),

)

Ok, this took me a while to understand as well, but let me try to explain why removing gather really defeats the purpose of this PR and why it doesn't collect metrics twice:

This call to tempRegistry.Gather() won't work because we are calling c.Collect() directly earlier, which means that the metrics are sent to a ch channel.

gather doesn't collect metrics, it just reads the state of the metric registered in the temporary registry.

We need to call c.Collect because that call is the only one actually collecting metrics, e.g. doing the API call. Gather is called on a tempRegistry which is the registry Gather will use to check the metrics status and see if there are any errors. The purpose of the gather is not to be used as a backup but to "ensure that the
// returned slice is valid and self-consistent so that it can be used
// for valid exposition. As an exception to the strict consistency
// requirements described for metric.Desc, Gather will tolerate
// different sets of label names for metrics of the same metric family." (source)

Basically, it doesn't collect again, it just checks the temp registry where metrics are and checks if they are ok or not.

How?

I'm not a prometheus expert so I'm not sure I can answer all the questions but from what I can understand, looking at the gather implementation fn, here's the part when the metric is actually checked: https://github.com/prometheus/client_golang/blob/2cd067eb23c940d3a1335ebc75eaf94e3037d8a9/prometheus/registry.go#L456

It calls collector.Collect. Collector is a prometheus type which implements gather and describe. For gather, all it does is read itself here. So, it doesn't actually collect anything.

To sum up, why do I think using gather is interesting:
it's something which will just inspect how metrics are being collected, without actually collecting those. It seems suitable to what we want as we need to monitor if there were any errors.

I'm not sure if I was able to answer your questions, let me know!

pkg/google/gcp.go

Co-authored-by: Niki <[email protected]>

leonorfmartins force-pushed the leonor/collector_error branch from ef69e3b to 175359e Compare November 25, 2025 10:24

nikimanoledaki reviewed Nov 26, 2025

View reviewed changes

leonorfmartins marked this pull request as ready for review November 26, 2025 15:49

leonorfmartins requested a review from a team November 26, 2025 15:49

leonorfmartins added 4 commits December 16, 2025 10:56

WIP: add histogram using gatherer interface to detect errors when col…

c044d20

…lecting prometheus metrics

Call gather. Add tests

b421939

Lint

fd72c8f

[REVIEW] Add counters for error rate

c1ab549

leonorfmartins force-pushed the leonor/collector_error branch from 608b335 to c1ab549 Compare December 16, 2025 13:30

leonorfmartins added 2 commits December 16, 2025 15:13

Fix increment collector valiue

d83722c

Ignore metrics in test

ef72c64

nikimanoledaki reviewed Dec 17, 2025

View reviewed changes

Add documentation on metrics added

86a90c3

nikimanoledaki reviewed Dec 24, 2025

View reviewed changes

leonorfmartins and others added 3 commits January 5, 2026 10:10

Update pkg/gatherer/gatherer.go

5e5b93b

Co-authored-by: Niki <[email protected]>

Update pkg/gatherer/gatherer.go

d8409ef

Co-authored-by: Niki <[email protected]>

Add describe for gcp

944fa56

leonorfmartins enabled auto-merge (squash) January 6, 2026 11:41

Add histogram using gatherer interface to detect errors when collecting prometheus metrics #711

Are you sure you want to change the base?

Add histogram using gatherer interface to detect errors when collecting prometheus metrics #711

Uh oh!

Conversation

leonorfmartins commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this does

Test

Uh oh!

nikimanoledaki left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leonorfmartins commented Nov 26, 2025

Uh oh!

cindy commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nikimanoledaki Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leonorfmartins Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

nikimanoledaki commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nikimanoledaki commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leonorfmartins commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nikimanoledaki left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nikimanoledaki Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

leonorfmartins Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

leonorfmartins commented Nov 25, 2025 •

edited

Loading

nikimanoledaki left a comment •

edited

Loading

cindy commented Dec 1, 2025 •

edited

Loading

nikimanoledaki Dec 17, 2025 •

edited

Loading

nikimanoledaki commented Dec 17, 2025 •

edited

Loading

nikimanoledaki commented Dec 17, 2025 •

edited

Loading

leonorfmartins commented Dec 17, 2025 •

edited

Loading

nikimanoledaki left a comment •

edited

Loading