Skip to content

[feature] Added Active-Request-Scorer#297

Merged
github-actions[bot] merged 5 commits intollm-d:mainfrom
vMaroon:main
Aug 25, 2025
Merged

[feature] Added Active-Request-Scorer#297
github-actions[bot] merged 5 commits intollm-d:mainfrom
vMaroon:main

Conversation

@vMaroon
Copy link
Member

@vMaroon vMaroon commented Aug 18, 2025

Summary

Introduced a new load-based scorer that looks at the active requests being served through the epp instance. Each request is tracked individually with its own TTL to ensure accurate timeout handling. Pods with fewer active requests receive higher scores.

The scorer maintains two data structures for efficiency:

  • A cache of individual requests with TTL for automatic cleanup
  • A count map for fast O(1) lookup during scoring

Scores are normalized to a range of 0-1, where pods with fewer active requests get higher scores.

@vMaroon
Copy link
Member Author

vMaroon commented Aug 18, 2025

Force-pushes dealt with DCO and then undoing the rebase that was attempted through this web GUI.

@vMaroon
Copy link
Member Author

vMaroon commented Aug 21, 2025

Thank you for reviewing, comments were addressed - ready for another round if needed.

elevran
elevran previously approved these changes Aug 24, 2025
Copy link
Collaborator

@elevran elevran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave to your discretion whether you'd like to add duration parsing or not.
/approve

Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com>
Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com>
Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com>
Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com>
Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com>
@vMaroon
Copy link
Member Author

vMaroon commented Aug 24, 2025

@elevran thanks Etai - added duration parsing and rebased.

@elevran
Copy link
Collaborator

elevran commented Aug 25, 2025

/approve
/lgtm

@github-actions github-actions bot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 25, 2025
@github-actions github-actions bot merged commit 673ddd9 into llm-d:main Aug 25, 2025
5 checks passed
@VadimEisenberg
Copy link

@vMaroon will this scorer work for prefill-decode disaggregation? Selecting the prefill instance and decode instance that have the least number of active requests?

@vMaroon
Copy link
Member Author

vMaroon commented Sep 6, 2025

@VadimEisenberg yes - though note it does not work as expected when streaming is enabled at the moment, see this tracker: kubernetes-sigs/gateway-api-inference-extension#1483

entry := requestEntry{targetPod.NamespacedName.String(), request.RequestId}

if _, found := s.requestCache.GetAndDelete(entry.String()); found {
s.decrementPodCount(entry.PodName)
Copy link

@VadimEisenberg VadimEisenberg Sep 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vMaroon From reading the code, it seems PreRequest adds first target pods of each of ProfileResults, for example of both decode and prefill, while PostResponse removes only the primary target pod (of the decode).

The target pod in PostResponse is set to be the target pod of RequestCtx, which is the first target pod from the primary ProfileResult:

https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/8b154baffd35e1c4ad1dcd131f1cbcac04ddc304/pkg/epp/requestcontrol/director.go#L253C2-L263C34

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh - apologies, I misunderstood. Then you're right, it does not track that granularity in P/D.

We can definitely bump this in priority if you provide info. For example, given 2-points PostResponse hooks (cc @kfswain), the start can set prefill as done and the end updates decode.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vMaroon @kfswain maybe PostResponse should get target pods of all the profiles, and not only of the primary profile? Or canPostResponse receive SchedulingResult as a parameter, symmetrically to PreRequest?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VadimEisenberg PostResponse intentionally receives the target pod who actually SERVED the request.
There is another factor that wasn't taken into consideration in this discussion -
EPP protocol allows to define multiple (prioritized) endpoints as candidates for serving.
more details here:
https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/004-endpoint-picker-protocol#destination-endpoint

this essentially tells Envoy that if the first endpoint in the list failed in serving the request, envoy should try to next endpoint in the list.
for all scorers (or other plugins) that maintain a state per request, it is useful to know which endpoint was serving the request and not which one was ranked first (and didn't necessarily served successfully).
This information should be reported by Gateways back to EPP:
https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/004-endpoint-picker-protocol#destination-endpoint-served

The above is still under development and therefore we currently use picker with configuration of MaxEndpoints=1.

let me know if it makes sense.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nirrozenbaum I see. My problem currently is that active-request-scorer will not work correctly with regard to prefill pods. Prefill pods will be incremented in the PreRequest, but will not be decremented in PostResponse. They will be removed by TTL eventually.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VadimEisenberg right.
I think that until PostResponse issue is fixed, this scorer should either count only decode results, or it cannot be used as is.
cc: @vMaroon

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For PD I think we can do with the start/end PostResponse endpoints and the scorer can use request-id to maintain happy-path association of 1 P and 1 D per request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm "Looks good to me", indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants