Skip to content

ComputeDomain: explore exposing Prometheus metrics #352

@jgehrcke

Description

@jgehrcke

It might make sense to expose system state via canonical Prometheus metrics. Let's not do this only for the sake of "adding metrics", but instead properly think through what is going to be of value for health monitoring, alerting, and debugging.

Some thoughts:

  • The controller pod might be the component of choice for exposing metrics about global system state, current ComputeDomain count, and transient error count, state of any individual ComputeDomain, ...
  • Maybe each plugin pod should also expose a Prometheus endpoint exposing metrics about itself
  • Think through entire pipeline: how to point canonical scrapers to these endpoints? Maybe with the ServiceMonitor primitive from Prometheus Operator?

The real task here is to do quite a bit more thinking and planning before building anything. Because what to build isn't quite obvious at all.

Metadata

Metadata

Assignees

Labels

debuggabilityissue/pr related to the ability to debug the system

Type

No type

Projects

Status

Done

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions