Skip to content

Commit fd67975

Browse files
[Feature] Enable prometheus metrics for local queues (#2516)
* [Feature] Enable prometheus metrics for local queues This PR introduces an enhancement to enable collection of prometheus metrics for local queues. Addresses issue: #1833 Signed-off-by: Varsha Prasad Narsing <[email protected]> * Address reviews This commit addresses reviews by adding additional metrics for local queue. Signed-off-by: Varsha Prasad Narsing <[email protected]> --------- Signed-off-by: Varsha Prasad Narsing <[email protected]>
1 parent 937ba45 commit fd67975

File tree

2 files changed

+193
-0
lines changed

2 files changed

+193
-0
lines changed
+171
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,171 @@
1+
# KEP-1833: Enable Prometheus Metrics for Local Queues
2+
3+
<!-- toc -->
4+
- [Summary](#summary)
5+
- [Motivation](#motivation)
6+
- [Goals](#goals)
7+
- [Non-Goals](#non-goals)
8+
- [Proposal](#proposal)
9+
- [User Stories (Optional)](#user-stories-optional)
10+
- [Story 1](#story-1)
11+
- [Story 2](#story-2)
12+
- [Story 3](#story-3)
13+
- [Design Details](#design-details)
14+
- [API changes:](#api-changes)
15+
- [List of metrics for Local Queues:](#list-of-metrics-for-local-queues)
16+
- [Test Plan](#test-plan)
17+
- [Unit Tests](#unit-tests)
18+
- [Integration tests](#integration-tests)
19+
- [Graduation Criteria](#graduation-criteria)
20+
- [Implementation History](#implementation-history)
21+
- [Drawbacks](#drawbacks)
22+
<!-- /toc -->
23+
24+
## Summary
25+
26+
The enhancement aims to introduce the exposure of local queue metrics to users, providing detailed insights into workload
27+
processing specific to individual namespaces / tenants.
28+
29+
## Motivation
30+
31+
Metrics related to local queues are invaluable for batch users, providing essential visibility and historical trends
32+
about their workloads. Currently, while metrics are available for only ClusterQueues, they do not provide batch users with
33+
the necessary insights into their specific workloads.
34+
35+
### Goals
36+
37+
1. Introduce the API changes required to enable Local Queue metrics.
38+
2. List the Prometheus metrics that would be exposed for Local Queues.
39+
40+
### Non-Goals
41+
42+
1. Discuss the implementation details on where these metrics need to be collected in codebase.
43+
2. Discuss on metric visibility and RBAC required to enable the metrics securely for namespace admins.
44+
45+
## Proposal
46+
47+
The proposal extends to enable collection of metrics for local queues that would be useful
48+
for batch users and cluster administrators.
49+
50+
### User Stories (Optional)
51+
52+
#### Story 1
53+
54+
As a batch user of Kueue, I want to access metrics for local queues running workloads restricted to my namespace so that
55+
I can monitor and analyze the performance and trends of my workloads.
56+
57+
#### Story 2
58+
59+
As an administrator of Kueue, I want to enable batch users specific to a namespace to collect metrics for their workloads
60+
within their namespace so that they can have visibility and insights into their own workload metrics.
61+
62+
#### Story 3
63+
64+
As an administrator of Kueue, I want to filter and gain insights on fine-grained metrics relevant to a local queue by
65+
namespace for specific tenants so that I can effectively manage and optimize resource usage and performance for different tenants.
66+
67+
## Design Details
68+
69+
### API changes:
70+
71+
The [Configuration API](https://github.com/kubernetes-sigs/kueue/blob/7ec127b05c8a0c8268e623de61914472dc5bff29/apis/config/v1beta1/configuration_types.go#L30)
72+
currently provides the ability to enable collection of metrics for cluster queues. This API will be extended to include options for enabling metrics
73+
collection for local queues.
74+
75+
The `ControllerMetrics` that contain the option to configure metrics, will be extended as follows:
76+
77+
```go
78+
type ControllerManager struct {
79+
...
80+
81+
// Metrics contains the controller metrics configuration
82+
// +optional
83+
Metrics ControllerMetrics `json:"metrics,omitempty"`
84+
...
85+
}
86+
87+
// ControllerMetrics defines the metrics configs.
88+
type ControllerMetrics struct {
89+
...
90+
91+
// LocalQueueMetrics is a configuration that provides enabling LocalQueue metrics and its options.
92+
// +optional
93+
LocalQueueMetrics *LocalQueueMetrics `json:"localQueueMetrics,omitempty"`
94+
}
95+
96+
// LocalQueueMetrics defines the configuration options for local queue metrics.
97+
// If left empty, then metrics will expose for all local queues across namespaces.
98+
type LocalQueueMetrics struct {
99+
// Enable is a knob to allow metrics to be exposed for local queues. Defaults to false.
100+
Enable bool `json:"enable,omitempty`
101+
102+
// NamespaceSelector can be used to select namespaces in which the local queues should
103+
// report metrics.
104+
NamespaceSelector *metav1.LabelSelector `json:"namespaceSelector,omitempty"`
105+
106+
// LocalQueueSelector can be used to choose the local queues that need metrics to be collected.
107+
LocalQueueSelector *metav1.LabelSelector `json:"localQueueSelector,omitempty"`
108+
}
109+
```
110+
111+
To reduce cardinality, and enable selection of metrics for local queues, the following
112+
knobs will be available for `LocalQueueMetrics`:
113+
114+
| `Enable` | `NamespaceSelector` | `LocalQueueSelector` | Description |
115+
|----------|---------------------|----------------------|--------------------------------------------------------------------------------------------------------------------|
116+
| False | - | - | Metrics will not be exposed. |
117+
| True | - | - | Metrics for all local queues will be exposed. |
118+
| True | Specified | - | All LocalQueues in the specific namespaces that match the selector have metrics enabled. |
119+
| True | - | Specified | All LocalQueues matching the label selector have metrics enabled. |
120+
| True | Specified | Specified | Both the selectors are applied to local queues (logical AND) to filter the ones whose metrics have to be enabled. |
121+
| False | Specified | Specified | The selectors are disregarded, metrics will not be exposed. |
122+
123+
### List of metrics for Local Queues:
124+
125+
In the first iteration, following are the list of metrics that would contain information on Local Queue statuses:
126+
127+
| Metrics Name | Prometheus Type | Description |
128+
|------------------------------------------------|-----------------|-----------------------------------------------------------------------------------------------------|
129+
| local_queue_pending_workloads | Gauge | The number of pending workloads. |
130+
| local_queue_reserved_workloads_total | Counter | Total number of workloads in the LocalQueue reserving quota. |
131+
| local_queue_admitted_workloads_total | Counter | Total number of admitted workloads. |
132+
| local_queue_resource_usage | Gauge | Total quantity of used quota per resource for a Local Queue. |
133+
| local_queue_evicted_workloads_total | Counter | The total number of evicted workloads in Local Queue. |
134+
| local_queue_reserved_wait_time_seconds | Histogram | The time between a workload was created or re-queued until it got quota reservation in local queue. |
135+
| local_queue_admission_checks_wait_time_seconds | Histogram | The time from when a workload got the quota reservation until admission in local queue. |
136+
| local_queue_admission_wait_time_seconds | Histogram | The time between a workload was created or re-queued until admission. |
137+
| local_queue_status | Gauge | Reports the status of the ClusterQueue. |
138+
139+
Each of these metrics will be augmented with relevant Prometheus labels, indicating the local queue name, namespace,
140+
and any other unique identifiers as required during implementation. They will be exported in the controller namespace,
141+
alongside cluster queue metrics, at the same endpoint.
142+
143+
### Test Plan
144+
145+
[X] I/we understand the owners of the involved components may require updates to
146+
existing tests to make this code solid enough prior to committing the changes necessary
147+
to implement this enhancement.
148+
149+
#### Unit Tests
150+
151+
There are existing unit tests for prometheus metrics: https://github.com/kubernetes-sigs/kueue/blob/main/pkg/metrics/metrics_test.go.
152+
However, unit tests to ensure coverage for any additional local queue metrics will be added.
153+
154+
- `pkg/metrics/`: `2024-07-19` - `48.2%`
155+
156+
#### Integration tests
157+
158+
The integration will address the following scenarios:
159+
160+
1. Metrics for local queues are accurately reported throughout the lifecycle of workloads in local queues.
161+
2. Metrics are removed when a local queue is deleted from the cache.
162+
163+
### Graduation Criteria
164+
165+
## Implementation History
166+
167+
## Drawbacks
168+
169+
If not implemented correctly, in certain scenarios enabling local queue metrics for all namespaces across all local queues can lead to issues
170+
with cardinality and system overload. To mitigate this, configuration options are provided to selectively enable metrics
171+
reporting for specific local queues.
+22
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
title:
2+
kep-number: 1833
3+
authors:
4+
- "@varshaprasad96"
5+
status: provisional
6+
creation-date: 2024-07-02
7+
reviewers:
8+
- "@PBundyra"
9+
- "@astefanutti"
10+
- "@alculquicondor"
11+
- "@tenzen-y"
12+
13+
approvers:
14+
- "@alculquicondor"
15+
- "@tenzen-y"
16+
17+
# The target maturity stage in the current dev cycle for this KEP.
18+
stage: alpha
19+
20+
# The milestone at which this feature was, or is targeted to be, at each stage.
21+
milestone:
22+
alpha: v0.9

0 commit comments

Comments
 (0)