Skip to content

Commit f52123b

Browse files
committed
Add KEP: Gateway Metric Aggregator
Signed-off-by: kerthcet <[email protected]>
1 parent 4f16a96 commit f52123b

File tree

3 files changed

+364
-0
lines changed

3 files changed

+364
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,326 @@
1+
# Proposal-376: Gateway Metric Aggregator
2+
3+
<!--
4+
This is the title of your Proposal. Keep it short, simple, and descriptive. A good
5+
title can help communicate what the Proposal is and should be considered as part of
6+
any review.
7+
-->
8+
9+
<!--
10+
A table of contents is helpful for quickly jumping to sections of a Proposal and for
11+
highlighting any additional information provided beyond the standard Proposal
12+
template.
13+
14+
Ensure the TOC is wrapped with
15+
<code>&lt;!-- toc --&rt;&lt;!-- /toc --&rt;</code>
16+
tags, and then generate with `hack/update-toc.sh`.
17+
-->
18+
19+
<!-- toc -->
20+
- [Summary](#summary)
21+
- [Motivation](#motivation)
22+
- [Goals](#goals)
23+
- [Non-Goals](#non-goals)
24+
- [Proposal](#proposal)
25+
- [User Stories (Optional)](#user-stories-optional)
26+
- [Story 1](#story-1)
27+
- [Story 2](#story-2)
28+
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
29+
- [Risks and Mitigations](#risks-and-mitigations)
30+
- [Design Details](#design-details)
31+
- [Test Plan](#test-plan)
32+
- [Prerequisite testing updates](#prerequisite-testing-updates)
33+
- [Unit tests](#unit-tests)
34+
- [Integration tests](#integration-tests)
35+
- [e2e tests](#e2e-tests)
36+
- [Graduation Criteria](#graduation-criteria)
37+
- [Implementation History](#implementation-history)
38+
- [Drawbacks](#drawbacks)
39+
- [Alternatives](#alternatives)
40+
<!-- /toc -->
41+
42+
## Summary
43+
44+
<!--
45+
This section is incredibly important for producing high-quality, user-focused
46+
documentation such as release notes or a development roadmap. It should be
47+
possible to collect this information before implementation begins, in order to
48+
avoid requiring implementors to split their attention between writing release
49+
notes and implementing the feature itself. Proposal editors and SIG Docs
50+
should help to ensure that the tone and content of the `Summary` section is
51+
useful for a wide audience.
52+
53+
A good summary is probably at least a paragraph in length.
54+
55+
Both in this section and below, follow the guidelines of the [documentation
56+
style guide]. In particular, wrap lines to a reasonable length, to make it
57+
easier for reviewers to cite specific portions, and to minimize diff churn on
58+
updates.
59+
60+
-->
61+
62+
Metric-based scheduling is common in many systems, including Kubernetes. For GenAI, this becomes more complex because of the heavy computational requirements of models. This proposal outlines a design for a metric aggregator that can efficiently handle the unique challenges posed by GenAI workloads.
63+
64+
## Motivation
65+
66+
<!--
67+
This section is for explicitly listing the motivation, goals, and non-goals of
68+
this Proposal. Describe why the change is important and the benefits to users. The
69+
motivation section can optionally provide links to [experience reports] to
70+
demonstrate the interest in a Proposal within the wider InftyAI community.
71+
72+
[experience reports]: https://github.com/golang/go/wiki/ExperienceReports
73+
-->
74+
75+
With traditional services, because the final results will be generated in a very short time, common algorithms like round-robin or least-connection are enough.
76+
77+
However, in inference services, because of the heavy computations of the matrix multiplication, the result generation is often very slow, which is an essential difference with the traditional services. Therefore, we need more advanced algorithms to help us make wise scheduling decisions. For example, based on the inference engine's queue size, kv cache size, or combined metrics.
78+
79+
All these indicators should be collected from the inference engines for further analysis, that's why a metric aggregator is needed.
80+
81+
### Goals
82+
83+
<!--
84+
List the specific goals of the Proposal. What is it trying to achieve? How will we
85+
know that this has succeeded?
86+
-->
87+
88+
- A simple implementation with random selector plugin
89+
- Extensible with different consumers in the cluster, like the Lora autoscaler or the ai gateway
90+
- Metrics visualization support, like Grafana
91+
- Metric management support, especially the gc policy
92+
93+
### Non-Goals
94+
95+
<!--
96+
What is out of scope for this Proposal? Listing non-goals helps to focus discussion
97+
and make progress.
98+
-->
99+
100+
- Different scheduling algorithm implementations in ai gateway
101+
- LoRA aware scheduling implementation, will be left to another KEP
102+
- Performance consideration in big clusters should be left to the Beta level
103+
104+
## Proposal
105+
106+
<!--
107+
This is where we get down to the specifics of what the proposal actually is.
108+
This should have enough detail that reviewers can understand exactly what
109+
you're proposing, but should not include things like API designs or
110+
implementation. What is the desired outcome and how do we measure success?.
111+
The "Design Details" section below is for the real
112+
nitty-gritty.
113+
-->
114+
115+
### User Stories (Optional)
116+
117+
<!--
118+
Detail the things that people will be able to do if this Proposal is implemented.
119+
Include as much detail as possible so that people can understand the "how" of
120+
the system. The goal here is to make this feel real for users without getting
121+
bogged down.
122+
-->
123+
124+
#### Story 1
125+
126+
As a user, I hope my LLM request could be routed to the least-busy instance, so that I can get the result as soon as possible.
127+
128+
#### Story 2
129+
130+
As a RAG user, when retrieving documents, sometime they'are the same, so I hope my request could be routed to the instance with the most available kv cache to avoid the repetitive calculation, which is know as the prefix cache aware scheduling.
131+
132+
### Notes/Constraints/Caveats (Optional)
133+
134+
<!--
135+
What are the caveats to the proposal?
136+
What are some important details that didn't come across above?
137+
Go in to as much detail as necessary here.
138+
This might be a good place to talk about core concepts and how they relate.
139+
-->
140+
141+
Metrics-based routing should meet the baseline requirements: even the metrics are unavailable or outdated, the system should still be able to work, despite the fact that the request response may be slower. For example, metrics-based lora scheduling is unfit here because once the metric indicates the wrong instance, we may hit 500 server error, it's unacceptable.
142+
143+
### Risks and Mitigations
144+
145+
<!--
146+
What are the risks of this proposal, and how do we mitigate? Think broadly.
147+
For example, consider both security and how this will impact the larger
148+
InftyAI ecosystem.
149+
150+
How will security be reviewed, and by whom?
151+
152+
How will UX be reviewed, and by whom?
153+
154+
Consider including folks who also work outside the SIG or subproject.
155+
-->
156+
157+
The metrics might be outdated or even unable to fetch, the router then may make suboptimal decisions, but as mentioned above, the system can still work with a slow response.
158+
159+
## Design Details
160+
161+
<!--
162+
This section should contain enough information that the specifics of your
163+
change are understandable. This may include API specs (though not always
164+
required) or even code snippets. If there's any ambiguity about HOW your
165+
proposal will be implemented, this is the place to discuss them.
166+
-->
167+
168+
The overall flow looks like:
169+
170+
![flow](./flow.png)
171+
172+
173+
### Steps
174+
175+
Let's break down the flow into several steps:
176+
177+
- Step 1: we'll collect the metrics from the inference workloads, we choose `PUSH` mode here just to put less pressure on the gateway side, or the gateway will have iterate all the Pods which obviously will lead to performance issues.
178+
- Step 2: the gateway plugin will parse the metrics and store them in the redis, this is for HA consideration and cache sharing. Once the instance is down, we can still retrieve the metrics from redis. And if we have multiple instances, we can share the metrics with each other via redis. Considering Envoy AI gateway already uses Redis for limit rating, we'll reuse the Redis here.
179+
- Step 3 & 4: Traffic comes, and the Router will retrieve the metrics from Redis and make routing decisions based on different algorithms, like queue size aware scheduling.
180+
- Step 5: The router will send the request to the selected instance, and the instance will return the result to the router, return to the user finally.
181+
182+
183+
### Additional components introduced:
184+
185+
- Pod Sidecar: a sidecar container is necessary for each inference workload, which was introduced in Kubernetes 1.28 as alpha feature, and enabled by default in 1.29, see [details](https://kubernetes.io/blog/2023/08/25/native-sidecar-containers/). The sidecar will be responsible for collecting the metrics and pushing them to the AI gateway. Let's set the interval time to 100ms at first.
186+
- Redis: a Redis instance is necessary for the metrics storage and sharing, we can use the existing Redis instance in the cluster, or deploy a new one if not available.
187+
- Gateway Plugin: a new plugin or [DynamicLoadBalancingBackend](https://github.com/envoyproxy/ai-gateway/blob/be2b479b04bc7a219b0c8239143bfbabebdcd615/filterapi/filterconfig.go#L199-L208) specifically in Envoy AI gateway to pick the best-fit Pod endpoints. However, we may block by the upstream issue [here](https://github.com/envoyproxy/ai-gateway/issues/604), we'll work with the Envoy AI Gateway team to resolve it ASAP. Maybe the final design will impact our implementation a bit but not much I think.
188+
189+
### Data Structure
190+
191+
The data structure could be varied based on the metrics we want to collect, let's take the queue size as an example:
192+
193+
Because redis is a kv store, we'll use the ZSET to store the results, `LeastBusy::ModelName` as the key, Pod name as the member and the (runningQueueSize * 0.3 + waitingQueueSize * 0.7) as the score, the factor of waitingQueueSize is higher because metric is a delayed indicator. RunningQueueSize and WaitingQueueSize are two metrics most of the inference engines support.
194+
195+
Also set the expiration time to 500ms just in case the metric reporting is delayed and lead to the hotspot issue.
196+
197+
Note: the algorithm is not the final one, we'll have more discussions with the community to find the best one.
198+
199+
### Test Plan
200+
201+
<!--
202+
**Note:** *Not required until targeted at a release.*
203+
The goal is to ensure that we don't accept enhancements with inadequate testing.
204+
205+
All code is expected to have adequate tests (eventually with coverage
206+
expectations).
207+
208+
[testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
209+
-->
210+
211+
[x] I/we understand the owners of the involved components may require updates to
212+
existing tests to make this code solid enough prior to committing the changes necessary
213+
to implement this enhancement.
214+
215+
##### Prerequisite testing updates
216+
217+
<!--
218+
Based on reviewers feedback describe what additional tests need to be added prior
219+
implementing this enhancement to ensure the enhancements have also solid foundations.
220+
-->
221+
222+
##### Unit tests
223+
224+
<!--
225+
In principle every added code should have complete unit test coverage, so providing
226+
the exact set of tests will not bring additional value.
227+
However, if complete unit test coverage is not possible, explain the reason of it
228+
together with explanation why this is acceptable.
229+
-->
230+
231+
<!--
232+
Additionally, for Alpha try to enumerate the core package you will be touching
233+
to implement this enhancement and provide the current unit coverage for those
234+
in the form of:
235+
- <package>: <date> - <current test coverage>
236+
237+
This can inform certain test coverage improvements that we want to do before
238+
extending the production code to implement this enhancement.
239+
-->
240+
241+
- Hard to predict now since it's a new component, but try the best to make sure all the functionalities are covered.
242+
243+
##### Integration tests
244+
245+
<!--
246+
Integration tests allow control of the configuration parameters used to start the binaries under test.
247+
This is different from e2e tests which do not allow configuration of parameters.
248+
Doing this allows testing non-default options and multiple different and potentially conflicting command line options.
249+
-->
250+
251+
<!--
252+
This question should be filled when targeting a release.
253+
For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
254+
255+
For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
256+
https://storage.googleapis.com/k8s-triage/index.html
257+
-->
258+
259+
- By faking the metrics to make sure the router can pick the right instance.
260+
261+
##### e2e tests
262+
263+
<!--
264+
This question should be filled when targeting a release.
265+
For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
266+
267+
For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
268+
https://storage.googleapis.com/k8s-triage/index.html
269+
270+
We expect no non-infra related flakes in the last month as a GA graduation criteria.
271+
-->
272+
273+
- Add one e2e test to make sure the whole system can be launched via helm chart.
274+
- For performance, we'll have benchmarks rather than e2e tests.
275+
276+
### Graduation Criteria
277+
278+
<!--
279+
280+
Clearly define what it means for the feature to be implemented and
281+
considered stable.
282+
283+
If the feature you are introducing has high complexity, consider adding graduation
284+
milestones with these graduation criteria:
285+
- [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels]
286+
- [Feature gate][feature gate] lifecycle
287+
- [Deprecation policy][deprecation-policy]
288+
289+
[feature gate]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md
290+
[maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions
291+
[deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/
292+
-->
293+
294+
Beta:
295+
296+
- No performance issues in big clusters, we may user daemonset to report metrics.
297+
- Other storages rather than KV store who supports only key-value pairs which might be not enough for more complex scenarios.
298+
299+
## Implementation History
300+
301+
<!--
302+
Major milestones in the lifecycle of a Proposal should be tracked in this section.
303+
Major milestones might include:
304+
- the `Summary` and `Motivation` sections being merged, signaling SIG acceptance
305+
- the `Proposal` section being merged, signaling agreement on a proposed design
306+
- the date implementation started
307+
- the first llmaz release where an initial version of the Proposal was available
308+
- the version of llmaz where the Proposal graduated to general availability
309+
- when the Proposal was retired or superseded
310+
-->
311+
312+
- 2025-05-08: Proposal initialized and submitted for review
313+
314+
## Drawbacks
315+
316+
<!--
317+
Why should this Proposal _not_ be implemented?
318+
-->
319+
320+
## Alternatives
321+
322+
<!--
323+
What other approaches did you consider, and why did you rule them out? These do
324+
not need to be as detailed as the proposal, but should include enough
325+
information to express the idea and why it was not acceptable.
326+
-->
219 KB
Loading
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
title: Gateway Metric Aggregator
2+
proposal-number: 376
3+
authors:
4+
- kerthcet
5+
status: implementable
6+
creation-date: 2025-04-25
7+
reviewers:
8+
- cr7258
9+
- googs1025
10+
approvers:
11+
- TBD
12+
13+
see-also: []
14+
15+
replaces: []
16+
17+
# The target maturity stage in the current dev cycle for this proposal.
18+
stage: beta
19+
20+
# The most recent milestone for which work toward delivery of this proposal has been
21+
# done. This can be the current (upcoming) milestone, if it is being actively
22+
# worked on.
23+
latest-milestone: "v0.2"
24+
25+
# The milestone at which this feature was, or is targeted to be, at each stage.
26+
milestone:
27+
alpha: "v0.2"
28+
beta: TBD
29+
stable: TBD
30+
31+
# The following PRR answers are required at alpha release
32+
# List the feature gate name and the components for which it must be enabled
33+
feature-gates: []
34+
35+
disable-supported: true
36+
37+
# The following PRR answers are required at beta release
38+
metrics: []

0 commit comments

Comments
 (0)