Skip to content

Add prometheus metrics#648

Merged
jbiers merged 26 commits intorancher:mainfrom
jbiers:add-promtheus-metrics
Feb 12, 2025
Merged

Add prometheus metrics#648
jbiers merged 26 commits intorancher:mainfrom
jbiers:add-promtheus-metrics

Conversation

@jbiers
Copy link
Contributor

@jbiers jbiers commented Jan 17, 2025

Issue:

Solves #353. Also relates to SURE-8367.

@jbiers jbiers requested a review from a team as a code owner January 17, 2025 21:10
@jbiers jbiers changed the title [WIP] Add promtheus metrics [WIP] Add prometheus metrics Jan 17, 2025
@alexandreLamarre
Copy link
Contributor

Also @jbiers need to update the rancher/hull tests for the new chart flag

@jbiers

This comment was marked as outdated.

@jbiers jbiers force-pushed the add-promtheus-metrics branch from 69c85ae to ca508d0 Compare February 7, 2025 03:13
Copy link
Contributor

@alexandreLamarre alexandreLamarre Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this is too granular, for the time being I'd recommend,setting this to something less granular like : 500,1000, 2500, 5000, 7500,,10000.


Metrics like these are useful for customer performance debugging, so the ideal bucket boundaries will need to change based on their environments

Ideally we would use exponential histograms (native histograms) to auto-tune bucket boundaries, however this was only introduced in v2.40.X for prometheus and still requires a feature flag on the prometheus side to enable, so we shouldn't have this be enabled by default.

The other option is to have a helm template value that customizes the bucket boundary (probably the best option until expo/native histograms are stabilized)

All that being said having auto/configurable bucket boundaries won't be super useful in the initial metrics introduction
so I'll create a tech debt issue for this

Some further reading for anyone interested:

https://prometheus.io/docs/specs/native_histograms/
https://opentelemetry.io/docs/specs/otel/metrics/data-model/
https://opentelemetry.io/docs/specs/otel/compatibility/prometheus_and_openmetrics/#exponential-histograms

Copy link
Contributor

@alexandreLamarre alexandreLamarre Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually @jbiers we should probably increase the bucket upper limits here since 1000ms is just 10s for clusters with a lot of resources I'd expect every single backup to take longer than 10s... so the metric becomes useless

Perhaps the new upper bound should be 10m? @mallardduck you've got a better idea of what a good upper bound should be in prod deployments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added new buckets for 30s, 1min and 2min. This is something we can investigate further though

@jbiers jbiers force-pushed the add-promtheus-metrics branch 16 times, most recently from d2fddf7 to 9042994 Compare February 10, 2025 20:59
@jbiers jbiers force-pushed the add-promtheus-metrics branch from 95e061c to dd28984 Compare February 11, 2025 19:20
@jbiers jbiers changed the title [WIP] Add prometheus metrics Add prometheus metrics Feb 11, 2025
@jbiers
Copy link
Contributor Author

jbiers commented Feb 11, 2025

Followed @mallardduck's idea in having those time-related metrics be tested via unit tests since it's tricky to do via integration tests. (Thanks for the suggestion and the PR 🤝)

I'm removing the WIP label from the PR and requesting reviews again, as I suppose this is in a good place now with all proposed backup/restore metrics implemented and tested as possible.

I'd consider as future goals somewhat related to this issue:

  • [RFE] Implement Prometheus alerting rules.
  • [RFE] Build Grafana dashboards to monitor these metrics
  • [Tech Debt] Extend metrics unit testing
  • [RFE] Allow for histogram bucket customization via values
  • [RFE] Providing optional pprof profiles

Copy link
Member

@mallardduck mallardduck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few notes...

Copy link
Member

@mallardduck mallardduck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants