Conversation
|
Also @jbiers need to update the rancher/hull tests for the new chart flag |
This comment was marked as outdated.
This comment was marked as outdated.
69c85ae to
ca508d0
Compare
pkg/monitoring/metrics.go
Outdated
There was a problem hiding this comment.
IMO this is too granular, for the time being I'd recommend,setting this to something less granular like : 500,1000, 2500, 5000, 7500,,10000.
Metrics like these are useful for customer performance debugging, so the ideal bucket boundaries will need to change based on their environments
Ideally we would use exponential histograms (native histograms) to auto-tune bucket boundaries, however this was only introduced in v2.40.X for prometheus and still requires a feature flag on the prometheus side to enable, so we shouldn't have this be enabled by default.
The other option is to have a helm template value that customizes the bucket boundary (probably the best option until expo/native histograms are stabilized)
All that being said having auto/configurable bucket boundaries won't be super useful in the initial metrics introduction
so I'll create a tech debt issue for this
Some further reading for anyone interested:
https://prometheus.io/docs/specs/native_histograms/
https://opentelemetry.io/docs/specs/otel/metrics/data-model/
https://opentelemetry.io/docs/specs/otel/compatibility/prometheus_and_openmetrics/#exponential-histograms
There was a problem hiding this comment.
Actually @jbiers we should probably increase the bucket upper limits here since 1000ms is just 10s for clusters with a lot of resources I'd expect every single backup to take longer than 10s... so the metric becomes useless
Perhaps the new upper bound should be 10m? @mallardduck you've got a better idea of what a good upper bound should be in prod deployments
There was a problem hiding this comment.
Added new buckets for 30s, 1min and 2min. This is something we can investigate further though
d2fddf7 to
9042994
Compare
95e061c to
dd28984
Compare
Add very basic unit tests for metrics that are complex to predict in e2e
|
Followed @mallardduck's idea in having those time-related metrics be tested via unit tests since it's tricky to do via integration tests. (Thanks for the suggestion and the PR 🤝) I'm removing the WIP label from the PR and requesting reviews again, as I suppose this is in a good place now with all proposed backup/restore metrics implemented and tested as possible. I'd consider as future goals somewhat related to this issue:
|
Issue:
Solves #353. Also relates to SURE-8367.