ComputeDomain: explore exposing Prometheus metrics

It might make sense to expose system state via canonical Prometheus metrics. Let's not do this only for the sake of "adding metrics", but instead properly think through what is going to be of _value_ for health monitoring, alerting, and debugging.

Some thoughts:
- The controller pod might be the component of choice for exposing metrics about global system state, current ComputeDomain count, and transient error count, state of any individual ComputeDomain, ...
- Maybe each plugin pod should also expose a Prometheus endpoint exposing metrics about itself
- Think through entire pipeline: how to point canonical scrapers to these endpoints? Maybe with the ServiceMonitor primitive from Prometheus Operator?

The real task here is to do quite a bit more thinking and planning before building anything. Because what to build isn't quite obvious at all.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ComputeDomain: explore exposing Prometheus metrics #352

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ComputeDomain: explore exposing Prometheus metrics #352

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions