|
| 1 | +# KEP-1833: Enable Prometheus Metrics for Local Queues |
| 2 | + |
| 3 | +<!-- toc --> |
| 4 | +- [Summary](#summary) |
| 5 | +- [Motivation](#motivation) |
| 6 | + - [Goals](#goals) |
| 7 | + - [Non-Goals](#non-goals) |
| 8 | +- [Proposal](#proposal) |
| 9 | + - [User Stories (Optional)](#user-stories-optional) |
| 10 | + - [Story 1](#story-1) |
| 11 | + - [Story 2](#story-2) |
| 12 | + - [Story 3](#story-3) |
| 13 | +- [Design Details](#design-details) |
| 14 | + - [API changes:](#api-changes) |
| 15 | + - [List of metrics for Local Queues:](#list-of-metrics-for-local-queues) |
| 16 | + - [Test Plan](#test-plan) |
| 17 | + - [Unit Tests](#unit-tests) |
| 18 | + - [Integration tests](#integration-tests) |
| 19 | + - [Graduation Criteria](#graduation-criteria) |
| 20 | +- [Implementation History](#implementation-history) |
| 21 | +- [Drawbacks](#drawbacks) |
| 22 | +<!-- /toc --> |
| 23 | + |
| 24 | +## Summary |
| 25 | + |
| 26 | +The enhancement aims to introduce the exposure of local queue metrics to users, providing detailed insights into workload |
| 27 | +processing specific to individual namespaces / tenants. |
| 28 | + |
| 29 | +## Motivation |
| 30 | + |
| 31 | +Metrics related to local queues are invaluable for batch users, providing essential visibility and historical trends |
| 32 | +about their workloads. Currently, while metrics are available for only ClusterQueues, they do not provide batch users with |
| 33 | +the necessary insights into their specific workloads. |
| 34 | + |
| 35 | +### Goals |
| 36 | + |
| 37 | +1. Introduce the API changes required to enable Local Queue metrics. |
| 38 | +2. List the Prometheus metrics that would be exposed for Local Queues. |
| 39 | + |
| 40 | +### Non-Goals |
| 41 | + |
| 42 | +1. Discuss the implementation details on where these metrics need to be collected in codebase. |
| 43 | +2. Discuss on metric visibility and RBAC required to enable the metrics securely for namespace admins. |
| 44 | + |
| 45 | +## Proposal |
| 46 | + |
| 47 | +The proposal extends to enable collection of metrics for local queues that would be useful |
| 48 | +for batch users and cluster administrators. |
| 49 | + |
| 50 | +### User Stories (Optional) |
| 51 | + |
| 52 | +#### Story 1 |
| 53 | + |
| 54 | +As a batch user of Kueue, I want to access metrics for local queues running workloads restricted to my namespace so that |
| 55 | +I can monitor and analyze the performance and trends of my workloads. |
| 56 | + |
| 57 | +#### Story 2 |
| 58 | + |
| 59 | +As an administrator of Kueue, I want to enable batch users specific to a namespace to collect metrics for their workloads |
| 60 | +within their namespace so that they can have visibility and insights into their own workload metrics. |
| 61 | + |
| 62 | +#### Story 3 |
| 63 | + |
| 64 | +As an administrator of Kueue, I want to filter and gain insights on fine-grained metrics relevant to a local queue by |
| 65 | +namespace for specific tenants so that I can effectively manage and optimize resource usage and performance for different tenants. |
| 66 | + |
| 67 | +## Design Details |
| 68 | + |
| 69 | +### API changes: |
| 70 | + |
| 71 | +The [Configuration API](https://github.com/kubernetes-sigs/kueue/blob/7ec127b05c8a0c8268e623de61914472dc5bff29/apis/config/v1beta1/configuration_types.go#L30) |
| 72 | +currently provides the ability to enable collection of metrics for cluster queues. This API will be extended to include options for enabling metrics |
| 73 | +collection for local queues. |
| 74 | + |
| 75 | +The `ControllerMetrics` that contain the option to configure metrics, will be extended as follows: |
| 76 | + |
| 77 | +```go |
| 78 | +type ControllerManager struct { |
| 79 | + ... |
| 80 | + |
| 81 | + // Metrics contains the controller metrics configuration |
| 82 | + // +optional |
| 83 | + Metrics ControllerMetrics `json:"metrics,omitempty"` |
| 84 | + ... |
| 85 | +} |
| 86 | + |
| 87 | +// ControllerMetrics defines the metrics configs. |
| 88 | +type ControllerMetrics struct { |
| 89 | + ... |
| 90 | + |
| 91 | + // LocalQueueMetrics is a configuration that provides enabling LocalQueue metrics and its options. |
| 92 | + // +optional |
| 93 | + LocalQueueMetrics *LocalQueueMetrics `json:"localQueueMetrics,omitempty"` |
| 94 | +} |
| 95 | + |
| 96 | +// LocalQueueMetrics defines the configuration options for local queue metrics. |
| 97 | +// If left empty, then metrics will expose for all local queues across namespaces. |
| 98 | +type LocalQueueMetrics struct { |
| 99 | + // Enable is a knob to allow metrics to be exposed for local queues. Defaults to false. |
| 100 | + Enable bool `json:"enable,omitempty` |
| 101 | + |
| 102 | + // NamespaceSelector can be used to select namespaces in which the local queues should |
| 103 | + // report metrics. |
| 104 | + NamespaceSelector *metav1.LabelSelector `json:"namespaceSelector,omitempty"` |
| 105 | + |
| 106 | + // LocalQueueSelector can be used to choose the local queues that need metrics to be collected. |
| 107 | + LocalQueueSelector *metav1.LabelSelector `json:"localQueueSelector,omitempty"` |
| 108 | +} |
| 109 | +``` |
| 110 | + |
| 111 | +To reduce cardinality, and enable selection of metrics for local queues, the following |
| 112 | +knobs will be available for `LocalQueueMetrics`: |
| 113 | + |
| 114 | +| `Enable` | `NamespaceSelector` | `LocalQueueSelector` | Description | |
| 115 | +|----------|---------------------|----------------------|--------------------------------------------------------------------------------------------------------------------| |
| 116 | +| False | - | - | Metrics will not be exposed. | |
| 117 | +| True | - | - | Metrics for all local queues will be exposed. | |
| 118 | +| True | Specified | - | All LocalQueues in the specific namespaces that match the selector have metrics enabled. | |
| 119 | +| True | - | Specified | All LocalQueues matching the label selector have metrics enabled. | |
| 120 | +| True | Specified | Specified | Both the selectors are applied to local queues (logical AND) to filter the ones whose metrics have to be enabled. | |
| 121 | +| False | Specified | Specified | The selectors are disregarded, metrics will not be exposed. | |
| 122 | + |
| 123 | +### List of metrics for Local Queues: |
| 124 | + |
| 125 | +In the first iteration, following are the list of metrics that would contain information on Local Queue statuses: |
| 126 | + |
| 127 | +| Metrics Name | Prometheus Type | Description | |
| 128 | +|------------------------------------------------|-----------------|-----------------------------------------------------------------------------------------------------| |
| 129 | +| local_queue_pending_workloads | Gauge | The number of pending workloads. | |
| 130 | +| local_queue_reserved_workloads_total | Counter | Total number of workloads in the LocalQueue reserving quota. | |
| 131 | +| local_queue_admitted_workloads_total | Counter | Total number of admitted workloads. | |
| 132 | +| local_queue_resource_usage | Gauge | Total quantity of used quota per resource for a Local Queue. | |
| 133 | +| local_queue_evicted_workloads_total | Counter | The total number of evicted workloads in Local Queue. | |
| 134 | +| local_queue_reserved_wait_time_seconds | Histogram | The time between a workload was created or re-queued until it got quota reservation in local queue. | |
| 135 | +| local_queue_admission_checks_wait_time_seconds | Histogram | The time from when a workload got the quota reservation until admission in local queue. | |
| 136 | +| local_queue_admission_wait_time_seconds | Histogram | The time between a workload was created or re-queued until admission. | |
| 137 | +| local_queue_status | Gauge | Reports the status of the ClusterQueue. | |
| 138 | + |
| 139 | +Each of these metrics will be augmented with relevant Prometheus labels, indicating the local queue name, namespace, |
| 140 | +and any other unique identifiers as required during implementation. They will be exported in the controller namespace, |
| 141 | +alongside cluster queue metrics, at the same endpoint. |
| 142 | + |
| 143 | +### Test Plan |
| 144 | + |
| 145 | +[X] I/we understand the owners of the involved components may require updates to |
| 146 | +existing tests to make this code solid enough prior to committing the changes necessary |
| 147 | +to implement this enhancement. |
| 148 | + |
| 149 | +#### Unit Tests |
| 150 | + |
| 151 | +There are existing unit tests for prometheus metrics: https://github.com/kubernetes-sigs/kueue/blob/main/pkg/metrics/metrics_test.go. |
| 152 | +However, unit tests to ensure coverage for any additional local queue metrics will be added. |
| 153 | + |
| 154 | +- `pkg/metrics/`: `2024-07-19` - `48.2%` |
| 155 | + |
| 156 | +#### Integration tests |
| 157 | + |
| 158 | +The integration will address the following scenarios: |
| 159 | + |
| 160 | +1. Metrics for local queues are accurately reported throughout the lifecycle of workloads in local queues. |
| 161 | +2. Metrics are removed when a local queue is deleted from the cache. |
| 162 | + |
| 163 | +### Graduation Criteria |
| 164 | + |
| 165 | +## Implementation History |
| 166 | + |
| 167 | +## Drawbacks |
| 168 | + |
| 169 | +If not implemented correctly, in certain scenarios enabling local queue metrics for all namespaces across all local queues can lead to issues |
| 170 | +with cardinality and system overload. To mitigate this, configuration options are provided to selectively enable metrics |
| 171 | +reporting for specific local queues. |
0 commit comments