Spike: Investigate limitations of metrics and the Prometheus ecosystem implemented in Fleet

The current implementation of the Fleet dashboards has metrics for the following custom resources of Fleet:

- GitRepo
- Cluster
- ClusterGroup
- Bundle
- BundleDeployment (!)

In addition to those which we already have, Gitjob metrics are currently being reviewed.

All of the metrics exposed by those custom resources, and those from gitjob, which are not directly exposing statuses of custom resources, have labels for both namespace and name of those resources. Each unique combination of label values is handled by Prometheus as a separate metric that requires separate storage, which also increases the size of samples fetched from exporters. Some metrics only have one sample, others, like the histogram, have two, plus the amount of buckets specified. In addition to that, if metrics need to be aggregated by Prometheus while processing PromQL queries, the load for processing those metrics increases. Such queries are issued by Grafana dashboards and are, if configured, repeated every n seconds.

Now, depending on how many GitRepos a user has and how they are configured, the amount of metrics to pull, process and store and to process for queries increases.

At a certain point, this can cause performance problems in Prometheus, which, if used for other purposes as well, can disrupt the monitoring capabilities of other applications. Such Prometheus queries can also increase the load on the host where Prometheus is running on.

In this spike, we would like to investigate how many metrics are exposed for certain custom resources and what their dimensions are, so that we can estimate the amount of samples needed Prometheus needs to process  to ingest them and to process for queries.

For that purpose, we would also need to think about how many resources, e.g. GitRepos are supported to exist at the same time.

As for resolving potential problems, there are several options which could mitigate performance problems. Those should be investigated as to which extend they are able to be applied effectively, considering the supported amount of resources.

## Mitigation Options

- Documentation: Recommend not to use the Prometheus instance for monitoring anything outside of the scope of Rancher but to use a separate Prometheus instance for that
- Documentation: Disable metrics
- Documentation: Increase scrape duration
- Documentation: Document the current limitations (as in saying that many Fleet resources may cause performance issues)
- Feature: Flip a switch for Fleet to provide less granular metrics
- Feature: Include aggregated metric, document configuration to drop unnecessary metrics before ingesting

Those concerns came up in #3649 (corresponding issue is #3161).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spike: Investigate limitations of metrics and the Prometheus ecosystem implemented in Fleet #3669

Mitigation Options

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Spike: Investigate limitations of metrics and the Prometheus ecosystem implemented in Fleet #3669

Description

Mitigation Options

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions