It might make sense to expose system state via canonical Prometheus metrics. Let's not do this only for the sake of "adding metrics", but instead properly think through what is going to be of value for health monitoring, alerting, and debugging.
Some thoughts:
- The controller pod might be the component of choice for exposing metrics about global system state, current ComputeDomain count, and transient error count, state of any individual ComputeDomain, ...
- Maybe each plugin pod should also expose a Prometheus endpoint exposing metrics about itself
- Think through entire pipeline: how to point canonical scrapers to these endpoints? Maybe with the ServiceMonitor primitive from Prometheus Operator?
The real task here is to do quite a bit more thinking and planning before building anything. Because what to build isn't quite obvious at all.
It might make sense to expose system state via canonical Prometheus metrics. Let's not do this only for the sake of "adding metrics", but instead properly think through what is going to be of value for health monitoring, alerting, and debugging.
Some thoughts:
The real task here is to do quite a bit more thinking and planning before building anything. Because what to build isn't quite obvious at all.