Add prometheus metrics by jbiers · Pull Request #648 · rancher/backup-restore-operator

jbiers · 2025-01-17T21:10:53Z

Issue:

Solves #353. Also relates to SURE-8367.

charts/rancher-backup/values.yaml

pkg/controllers/backup/controller.go

pkg/monitoring/metrics.go

pkg/controllers/restore/controller.go

charts/rancher-backup/templates/service-monitor.yaml

alexandreLamarre · 2025-01-29T19:47:34Z

Also @jbiers need to update the rancher/hull tests for the new chart flag

pkg/controllers/backup/controller.go

alexandreLamarre · 2025-02-07T16:18:14Z

pkg/monitoring/metrics.go

IMO this is too granular, for the time being I'd recommend,setting this to something less granular like : 500,1000, 2500, 5000, 7500,,10000.

Metrics like these are useful for customer performance debugging, so the ideal bucket boundaries will need to change based on their environments

Ideally we would use exponential histograms (native histograms) to auto-tune bucket boundaries, however this was only introduced in v2.40.X for prometheus and still requires a feature flag on the prometheus side to enable, so we shouldn't have this be enabled by default.

The other option is to have a helm template value that customizes the bucket boundary (probably the best option until expo/native histograms are stabilized)

All that being said having auto/configurable bucket boundaries won't be super useful in the initial metrics introduction
so I'll create a tech debt issue for this

Some further reading for anyone interested:

https://prometheus.io/docs/specs/native_histograms/
https://opentelemetry.io/docs/specs/otel/metrics/data-model/
https://opentelemetry.io/docs/specs/otel/compatibility/prometheus_and_openmetrics/#exponential-histograms

Actually @jbiers we should probably increase the bucket upper limits here since 1000ms is just 10s for clusters with a lot of resources I'd expect every single backup to take longer than 10s... so the metric becomes useless

Perhaps the new upper bound should be 10m? @mallardduck you've got a better idea of what a good upper bound should be in prod deployments

Added new buckets for 30s, 1min and 2min. This is something we can investigate further though

Add very basic unit tests for metrics that are complex to predict in e2e

jbiers · 2025-02-11T19:58:42Z

Followed @mallardduck's idea in having those time-related metrics be tested via unit tests since it's tricky to do via integration tests. (Thanks for the suggestion and the PR 🤝)

I'm removing the WIP label from the PR and requesting reviews again, as I suppose this is in a good place now with all proposed backup/restore metrics implemented and tested as possible.

I'd consider as future goals somewhat related to this issue:

[RFE] Implement Prometheus alerting rules.
[RFE] Build Grafana dashboards to monitor these metrics
[Tech Debt] Extend metrics unit testing
[RFE] Allow for histogram bucket customization via values
[RFE] Providing optional pprof profiles

mallardduck

Just a few notes...

pkg/operator/start.go

pkg/monitoring/metrics.go

pkg/operator/start.go

e2e/backup/suite_test.go

e2e/backup/restore_test.go

e2e/backup/backup_test.go

mallardduck

LGTM

jbiers requested a review from a team as a code owner January 17, 2025 21:10

jbiers changed the title ~~[WIP] Add promtheus metrics~~ [WIP] Add prometheus metrics Jan 17, 2025