[feature] Added Active-Request-Scorer#297
Conversation
80d0d08 to
212363d
Compare
|
Force-pushes dealt with DCO and then undoing the rebase that was attempted through this web GUI. |
|
Thank you for reviewing, comments were addressed - ready for another round if needed. |
elevran
left a comment
There was a problem hiding this comment.
Leave to your discretion whether you'd like to add duration parsing or not.
/approve
Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com>
Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com>
Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com>
Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com>
|
@elevran thanks Etai - added duration parsing and rebased. |
|
/approve |
|
@vMaroon will this scorer work for prefill-decode disaggregation? Selecting the prefill instance and decode instance that have the least number of active requests? |
|
@VadimEisenberg yes - though note it does not work as expected when streaming is enabled at the moment, see this tracker: kubernetes-sigs/gateway-api-inference-extension#1483 |
| entry := requestEntry{targetPod.NamespacedName.String(), request.RequestId} | ||
|
|
||
| if _, found := s.requestCache.GetAndDelete(entry.String()); found { | ||
| s.decrementPodCount(entry.PodName) |
There was a problem hiding this comment.
@vMaroon From reading the code, it seems PreRequest adds first target pods of each of ProfileResults, for example of both decode and prefill, while PostResponse removes only the primary target pod (of the decode).
The target pod in PostResponse is set to be the target pod of RequestCtx, which is the first target pod from the primary ProfileResult:
There was a problem hiding this comment.
Ahh - apologies, I misunderstood. Then you're right, it does not track that granularity in P/D.
We can definitely bump this in priority if you provide info. For example, given 2-points PostResponse hooks (cc @kfswain), the start can set prefill as done and the end updates decode.
There was a problem hiding this comment.
@VadimEisenberg PostResponse intentionally receives the target pod who actually SERVED the request.
There is another factor that wasn't taken into consideration in this discussion -
EPP protocol allows to define multiple (prioritized) endpoints as candidates for serving.
more details here:
https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/004-endpoint-picker-protocol#destination-endpoint
this essentially tells Envoy that if the first endpoint in the list failed in serving the request, envoy should try to next endpoint in the list.
for all scorers (or other plugins) that maintain a state per request, it is useful to know which endpoint was serving the request and not which one was ranked first (and didn't necessarily served successfully).
This information should be reported by Gateways back to EPP:
https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/004-endpoint-picker-protocol#destination-endpoint-served
The above is still under development and therefore we currently use picker with configuration of MaxEndpoints=1.
let me know if it makes sense.
There was a problem hiding this comment.
@nirrozenbaum I see. My problem currently is that active-request-scorer will not work correctly with regard to prefill pods. Prefill pods will be incremented in the PreRequest, but will not be decremented in PostResponse. They will be removed by TTL eventually.
There was a problem hiding this comment.
@VadimEisenberg right.
I think that until PostResponse issue is fixed, this scorer should either count only decode results, or it cannot be used as is.
cc: @vMaroon
There was a problem hiding this comment.
For PD I think we can do with the start/end PostResponse endpoints and the scorer can use request-id to maintain happy-path association of 1 P and 1 D per request.
Summary
Introduced a new load-based scorer that looks at the active requests being served through the epp instance. Each request is tracked individually with its own TTL to ensure accurate timeout handling. Pods with fewer active requests receive higher scores.
The scorer maintains two data structures for efficiency:
Scores are normalized to a range of 0-1, where pods with fewer active requests get higher scores.