-
Notifications
You must be signed in to change notification settings - Fork 8
Add histogram using gatherer interface to detect errors when collecting prometheus metrics #711
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
ef69e3b to
175359e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found the documentation for the promhttp package really helpful for learning more about the Gatherer and the different instrumentation options for the handlers: https://pkg.go.dev/github.com/prometheus/client_golang/prometheus/promhttp
It's also interesting to learn more about how kube-state-metrics does this since it is also a custom Prometheus exporter: https://github.com/kubernetes/kube-state-metrics/blob/main/pkg/app/server.go#L90
|
thanks for the pointer Niki! It got me wondering if we should also return the total number of metrics, like ksm does... on the other hand, since we are emitting now one metric per collector, I guess we could just count the number of metrics 🤔 |
|
Hey Leonor! This looks great. It looks like we can use the gatherer to validate our metrics as well and possibly log when there's no metrics or when our metrics look wonky (like negative values when there shouldn't be negative values). This will be super useful! I also like the example you listed for failing to list metrics in KSM: func GatherAndCount(g prometheus.Gatherer) (int, error) {
got, err := g.Gather()
if err != nil {
return 0, fmt.Errorf("gathering metrics failed: %w", err)
}I'm wondering the |
608b335 to
c1ab549
Compare
| now := time.Now() | ||
| defer wg.Done() | ||
|
|
||
| duration, hasError := gatherer.CollectWithGatherer(collectCtx, c, ch, a.logger) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we collecting metrics twice here via CollectWithGatherer? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
temporarily yes, until we deprecate the old metrics, see the comment right below: https://github.com/grafana/cloudcost-exporter/pull/711/changes#diff-557e359d6147fd859c120871f3c4b242f8781a32fb05b6a343978ff33c3b8e2fR255
|
I’m trying to understand the strategy around the operational metrics with this change. :) Checking my understanding of this change: there is already a Looking at how these new metrics fit with the existing operational metrics. We have the following at the moment:
Could we document how the new operational metrics ( |
|
Specifically would be great to document this here: https://github.com/grafana/cloudcost-exporter/blob/main/docs/metrics/providers.md And/or add a new |
|
yes, I can definitely add documentation for these metrics 😃 I'll just give some context here, for the sake of quickly answering your questions:
Yes, temporarily, specially all the metrics you are mentioning below such as So, the plan is in the future to have only the metrics we are adding in this PR. Metrics such as
are to be deprecated as they don't accurately provide a way for us to monitor our collectors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @leonorfmartins, I've left some suggestions 😊 LGTM after this 👍
| } | ||
|
|
||
| if _, err := tempRegistry.Gather(); err != nil { | ||
| hasError = true | ||
| logger.LogAttrs(ctx, slog.LevelError, "did not detect gatherer", | ||
| slog.String("collector", c.Name()), | ||
| slog.String("message", err.Error()), | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This call to tempRegistry.Gather() won't work because we are calling c.Collect() directly earlier, which means that the metrics are sent to a ch channel. 👀 Essentially it repeats the metric collection, which is either expensive or yields no metrics.
It can't be used as a backup for c.Collect, since calling tempRegistry.Gather() would call c.Collect() again and fail with the same reason.
We can remove this safely.
| } | |
| if _, err := tempRegistry.Gather(); err != nil { | |
| hasError = true | |
| logger.LogAttrs(ctx, slog.LevelError, "did not detect gatherer", | |
| slog.String("collector", c.Name()), | |
| slog.String("message", err.Error()), | |
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this took me a while to understand as well, but let me try to explain why removing gather really defeats the purpose of this PR and why it doesn't collect metrics twice:
This call to tempRegistry.Gather() won't work because we are calling c.Collect() directly earlier, which means that the metrics are sent to a ch channel.
gather doesn't collect metrics, it just reads the state of the metric registered in the temporary registry.
We need to call c.Collect because that call is the only one actually collecting metrics, e.g. doing the API call. Gather is called on a tempRegistry which is the registry Gather will use to check the metrics status and see if there are any errors. The purpose of the gather is not to be used as a backup but to "ensure that the
// returned slice is valid and self-consistent so that it can be used
// for valid exposition. As an exception to the strict consistency
// requirements described for metric.Desc, Gather will tolerate
// different sets of label names for metrics of the same metric family." (source)
Basically, it doesn't collect again, it just checks the temp registry where metrics are and checks if they are ok or not.
How?
I'm not a prometheus expert so I'm not sure I can answer all the questions but from what I can understand, looking at the gather implementation fn, here's the part when the metric is actually checked: https://github.com/prometheus/client_golang/blob/2cd067eb23c940d3a1335ebc75eaf94e3037d8a9/prometheus/registry.go#L456
It calls collector.Collect. Collector is a prometheus type which implements gather and describe. For gather, all it does is read itself here. So, it doesn't actually collect anything.
To sum up, why do I think using gather is interesting:
it's something which will just inspect how metrics are being collected, without actually collecting those. It seems suitable to what we want as we need to monitor if there were any errors.
I'm not sure if I was able to answer your questions, let me know!
Co-authored-by: Niki <[email protected]>
Co-authored-by: Niki <[email protected]>

What this does
Gatherers interface is from the prometheus package and it allows us to inspect metrics being collected and check for errors.
We scan each collector and push a native histogram that looks like this:
This histogram can be used to then plot charts for each collector's health (e.g. error rate and duration). Since it's an histogram, it will be easier to make SLOs out of it as well.
Besides it, there are 2 other counter metrics:
cloudcost_exporter_collector_totalcloudcost_exporter_collector_error_totalwhich will increment on each scrape and on errors, respectively.
We can also leverage the gatherer to more precisely assert on metrics expectations.
Test
Relates to https://github.com/grafana/deployment_tools/issues/413121