Skip to content

Conversation

@leonorfmartins
Copy link
Contributor

@leonorfmartins leonorfmartins commented Nov 25, 2025

What this does

Gatherers interface is from the prometheus package and it allows us to inspect metrics being collected and check for errors.
We scan each collector and push a native histogram that looks like this:

cloudcost_exporter_collector_duration_seconds{
  collector=<my_collector>,
  duration=<scrape_dur>
}

This histogram can be used to then plot charts for each collector's health (e.g. error rate and duration). Since it's an histogram, it will be easier to make SLOs out of it as well.

Besides it, there are 2 other counter metrics:

  • cloudcost_exporter_collector_total
  • cloudcost_exporter_collector_error_total

which will increment on each scrape and on errors, respectively.

We can also leverage the gatherer to more precisely assert on metrics expectations.

Test

  • GCP:
# TYPE cloudcost_exporter_collector_duration_seconds histogram
cloudcost_exporter_collector_duration_seconds_bucket{collector="GCS",le="+Inf"} 1
cloudcost_exporter_collector_duration_seconds_sum{collector="GCS"} 0.000228667
cloudcost_exporter_collector_duration_seconds_count{collector="GCS"} 1
cloudcost_exporter_collector_duration_seconds_bucket{collector="cloudsql",le="+Inf"} 1
cloudcost_exporter_collector_duration_seconds_sum{collector="cloudsql"} 5.3625e-05
cloudcost_exporter_collector_duration_seconds_count{collector="cloudsql"} 1
cloudcost_exporter_collector_duration_seconds_bucket{collector="gcp_gke",le="+Inf"} 1
cloudcost_exporter_collector_duration_seconds_sum{collector="gcp_gke"} 2.125e-06
cloudcost_exporter_collector_duration_seconds_count{collector="gcp_gke"} 1
# TYPE cloudcost_exporter_collector_total counter
cloudcost_exporter_collector_total{collector="GCS"} 1
cloudcost_exporter_collector_total{collector="cloudsql"} 1
cloudcost_exporter_collector_total{collector="gcp_gke"} 1
  • AWS:
# TYPE cloudcost_exporter_collector_duration_seconds histogram
cloudcost_exporter_collector_duration_seconds_bucket{collector="aws_elb",le="+Inf"} 1
cloudcost_exporter_collector_duration_seconds_sum{collector="aws_elb"} 1.4375e-05
cloudcost_exporter_collector_duration_seconds_count{collector="aws_elb"} 1
cloudcost_exporter_collector_duration_seconds_bucket{collector="aws_rds",le="+Inf"} 1
cloudcost_exporter_collector_duration_seconds_sum{collector="aws_rds"} 1.75e-06
cloudcost_exporter_collector_duration_seconds_count{collector="aws_rds"} 1
# TYPE cloudcost_exporter_collector_total counter
cloudcost_exporter_collector_total{collector="aws_elb"} 3
cloudcost_exporter_collector_total{collector="aws_rds"} 3
  • Azure:
# HELP cloudcost_exporter_collector_duration_seconds Duration of a collector scrape in seconds
# TYPE cloudcost_exporter_collector_duration_seconds histogram
cloudcost_exporter_collector_duration_seconds_bucket{collector="azure_aks",le="+Inf"} 1
cloudcost_exporter_collector_duration_seconds_sum{collector="azure_aks"} 0.000239417
cloudcost_exporter_collector_duration_seconds_count{collector="azure_aks"} 1
# HELP cloudcost_exporter_collector_last_scrape_duration_seconds Duration of the last scrape in seconds.
# TYPE cloudcost_exporter_collector_last_scrape_duration_seconds gauge
cloudcost_exporter_collector_last_scrape_duration_seconds{collector="azure_aks",provider="azure"} 0.000239417
# TYPE cloudcost_exporter_collector_total counter
cloudcost_exporter_collector_total{collector="azure_aks"} 1

Relates to https://github.com/grafana/deployment_tools/issues/413121

Copy link
Contributor

@nikimanoledaki nikimanoledaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the documentation for the promhttp package really helpful for learning more about the Gatherer and the different instrumentation options for the handlers: https://pkg.go.dev/github.com/prometheus/client_golang/prometheus/promhttp

It's also interesting to learn more about how kube-state-metrics does this since it is also a custom Prometheus exporter: https://github.com/kubernetes/kube-state-metrics/blob/main/pkg/app/server.go#L90

@leonorfmartins
Copy link
Contributor Author

thanks for the pointer Niki! It got me wondering if we should also return the total number of metrics, like ksm does... on the other hand, since we are emitting now one metric per collector, I guess we could just count the number of metrics 🤔

@leonorfmartins leonorfmartins marked this pull request as ready for review November 26, 2025 15:49
@leonorfmartins leonorfmartins requested a review from a team November 26, 2025 15:49
@cindy
Copy link
Contributor

cindy commented Dec 1, 2025

Hey Leonor! This looks great. It looks like we can use the gatherer to validate our metrics as well and possibly log when there's no metrics or when our metrics look wonky (like negative values when there shouldn't be negative values). This will be super useful! I also like the example you listed for failing to list metrics in KSM:

func GatherAndCount(g prometheus.Gatherer) (int, error) {
	got, err := g.Gather()
	if err != nil {
		return 0, fmt.Errorf("gathering metrics failed: %w", err)
	}

I'm wondering the is_error label is useful? Since we're getting latency for our metrics we can use that to create a latency SLO, but the error count will not be as useful to us in a bucket/histogram. I wonder if there's a better way to capture errors? Is it possible to create another metric that gives us a counter for successes and for failures? This blog has some suggestions on how that would look like we should have something like cloudcost_exporter_collector_total and cloudcost_exporter_collector_error_total. What do you think? That would allow us to create an error rate SLO.

now := time.Now()
defer wg.Done()

duration, hasError := gatherer.CollectWithGatherer(collectCtx, c, ch, a.logger)
Copy link
Contributor

@nikimanoledaki nikimanoledaki Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we collecting metrics twice here via CollectWithGatherer? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nikimanoledaki
Copy link
Contributor

nikimanoledaki commented Dec 17, 2025

I’m trying to understand the strategy around the operational metrics with this change. :)

Checking my understanding of this change: there is already a Gatherer interface implemented for the Prometheus collector at the CSP level. This PR would add a Gatherer for the custom collectors at the resource level. Let me know if that's right or if any of my starting assumptions are wrong 👍

Looking at how these new metrics fit with the existing operational metrics. We have the following at the moment:

  • We have some metrics starting with cloudcost_exporter_<csp>:
    • AWS: cloudcost_exporter_aws_collector_success{collector="S3"|"aws_ec2"}
    • We don’t have the same^ for gcp e.g. no metrics that start with cloudcost_exporter_gcp
  • We have the following cloudcost_exporter_collector_* metrics (screenshot)
Screenshot 2025-12-17 at 14 59 40

Could we document how the new operational metrics (cloudcost_exporter_collector_total & cloudcost_exporter_collector_duration_seconds_count|sum|bucket) are different from the existing operational metrics (cloudcost_exporter_<csp> & cloudcost_exporter_collector_*? Or, which ones we are replacing and for what purpose? I know you explained some of this before. It would be great to align on this again. 😊 Thank you @leonorfmartins!! 🌟

@nikimanoledaki
Copy link
Contributor

nikimanoledaki commented Dec 17, 2025

Specifically would be great to document this here: https://github.com/grafana/cloudcost-exporter/blob/main/docs/metrics/providers.md

And/or add a new metrics/collectors.md?

@leonorfmartins
Copy link
Contributor Author

leonorfmartins commented Dec 17, 2025

yes, I can definitely add documentation for these metrics 😃 I'll just give some context here, for the sake of quickly answering your questions:

Are we collecting metrics twice here?

Yes, temporarily, specially all the metrics you are mentioning below such as cloudcost_exporter_collector_success or cloudcost_collector_last_scrape_error and cloudcost_collector_last_scrape_time. They are still being used by some of our dashboards, but we have seen how they are not very reliable. Yet, until we have an improved dashboard to monitor all our collectors, I wouldn't want to get rid of them, as having some collectors being monitored is better than having none. As you already pointed out, existing metrics don't even cover all provider collectors.
I can communicate this better with a comment on the code highlighting this temporary situation as well 👍

So, the plan is in the future to have only the metrics we are adding in this PR. Metrics such as

  • cloudcost_exporter_collector_success
  • cloudcost_collector_last_scrape_error
  • cloudcost_collector_last_scrape_time
  • cloudcost_exporter_<csp>

are to be deprecated as they don't accurately provide a way for us to monitor our collectors.

Copy link
Contributor

@nikimanoledaki nikimanoledaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @leonorfmartins, I've left some suggestions 😊 LGTM after this 👍

Comment on lines +87 to +94
}

if _, err := tempRegistry.Gather(); err != nil {
hasError = true
logger.LogAttrs(ctx, slog.LevelError, "did not detect gatherer",
slog.String("collector", c.Name()),
slog.String("message", err.Error()),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call to tempRegistry.Gather() won't work because we are calling c.Collect() directly earlier, which means that the metrics are sent to a ch channel. 👀 Essentially it repeats the metric collection, which is either expensive or yields no metrics.

It can't be used as a backup for c.Collect, since calling tempRegistry.Gather() would call c.Collect() again and fail with the same reason.

We can remove this safely.

Suggested change
}
if _, err := tempRegistry.Gather(); err != nil {
hasError = true
logger.LogAttrs(ctx, slog.LevelError, "did not detect gatherer",
slog.String("collector", c.Name()),
slog.String("message", err.Error()),
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this took me a while to understand as well, but let me try to explain why removing gather really defeats the purpose of this PR and why it doesn't collect metrics twice:

This call to tempRegistry.Gather() won't work because we are calling c.Collect() directly earlier, which means that the metrics are sent to a ch channel.

gather doesn't collect metrics, it just reads the state of the metric registered in the temporary registry.

We need to call c.Collect because that call is the only one actually collecting metrics, e.g. doing the API call. Gather is called on a tempRegistry which is the registry Gather will use to check the metrics status and see if there are any errors. The purpose of the gather is not to be used as a backup but to "ensure that the
// returned slice is valid and self-consistent so that it can be used
// for valid exposition. As an exception to the strict consistency
// requirements described for metric.Desc, Gather will tolerate
// different sets of label names for metrics of the same metric family." (source)

Basically, it doesn't collect again, it just checks the temp registry where metrics are and checks if they are ok or not.

How?

I'm not a prometheus expert so I'm not sure I can answer all the questions but from what I can understand, looking at the gather implementation fn, here's the part when the metric is actually checked: https://github.com/prometheus/client_golang/blob/2cd067eb23c940d3a1335ebc75eaf94e3037d8a9/prometheus/registry.go#L456

It calls collector.Collect. Collector is a prometheus type which implements gather and describe. For gather, all it does is read itself here. So, it doesn't actually collect anything.

To sum up, why do I think using gather is interesting:
it's something which will just inspect how metrics are being collected, without actually collecting those. It seems suitable to what we want as we need to monitor if there were any errors.

I'm not sure if I was able to answer your questions, let me know!

@leonorfmartins leonorfmartins enabled auto-merge (squash) January 6, 2026 11:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants