[feature] Added Active-Request-Scorer by vMaroon · Pull Request #297 · llm-d/llm-d-inference-scheduler

vMaroon · 2025-08-18T13:13:57Z

Summary

Introduced a new load-based scorer that looks at the active requests being served through the epp instance. Each request is tracked individually with its own TTL to ensure accurate timeout handling. Pods with fewer active requests receive higher scores.

The scorer maintains two data structures for efficiency:

A cache of individual requests with TTL for automatic cleanup
A count map for fast O(1) lookup during scoring

Scores are normalized to a range of 0-1, where pods with fewer active requests get higher scores.

vMaroon · 2025-08-18T13:17:05Z

Force-pushes dealt with DCO and then undoing the rebase that was attempted through this web GUI.

docs/architecture.md

pkg/plugins/scorer/active_request.go

pkg/plugins/scorer/active_request_test.go

vMaroon · 2025-08-21T15:50:22Z

Thank you for reviewing, comments were addressed - ready for another round if needed.

elevran

Leave to your discretion whether you'd like to add duration parsing or not.
/approve

Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com>

vMaroon · 2025-08-24T11:50:39Z

@elevran thanks Etai - added duration parsing and rebased.

elevran · 2025-08-25T06:46:59Z

/approve
/lgtm

VadimEisenberg · 2025-09-06T14:54:52Z

@vMaroon will this scorer work for prefill-decode disaggregation? Selecting the prefill instance and decode instance that have the least number of active requests?

vMaroon · 2025-09-06T15:36:37Z

@VadimEisenberg yes - though note it does not work as expected when streaming is enabled at the moment, see this tracker: kubernetes-sigs/gateway-api-inference-extension#1483

VadimEisenberg · 2025-09-07T19:45:27Z

pkg/plugins/scorer/active_request.go

+	entry := requestEntry{targetPod.NamespacedName.String(), request.RequestId}
+
+	if _, found := s.requestCache.GetAndDelete(entry.String()); found {
+		s.decrementPodCount(entry.PodName)


@vMaroon From reading the code, it seems PreRequest adds first target pods of each of ProfileResults, for example of both decode and prefill, while PostResponse removes only the primary target pod (of the decode).

The target pod in PostResponse is set to be the target pod of RequestCtx, which is the first target pod from the primary ProfileResult:

https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/8b154baffd35e1c4ad1dcd131f1cbcac04ddc304/pkg/epp/requestcontrol/director.go#L253C2-L263C34

Ahh - apologies, I misunderstood. Then you're right, it does not track that granularity in P/D.

We can definitely bump this in priority if you provide info. For example, given 2-points PostResponse hooks (cc @kfswain), the start can set prefill as done and the end updates decode.

@vMaroon @kfswain maybe PostResponse should get target pods of all the profiles, and not only of the primary profile? Or canPostResponse receive SchedulingResult as a parameter, symmetrically to PreRequest?

@VadimEisenberg PostResponse intentionally receives the target pod who actually SERVED the request.
There is another factor that wasn't taken into consideration in this discussion -
EPP protocol allows to define multiple (prioritized) endpoints as candidates for serving.
more details here:
https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/004-endpoint-picker-protocol#destination-endpoint

this essentially tells Envoy that if the first endpoint in the list failed in serving the request, envoy should try to next endpoint in the list.
for all scorers (or other plugins) that maintain a state per request, it is useful to know which endpoint was serving the request and not which one was ranked first (and didn't necessarily served successfully).
This information should be reported by Gateways back to EPP:
https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/004-endpoint-picker-protocol#destination-endpoint-served

The above is still under development and therefore we currently use picker with configuration of MaxEndpoints=1.

let me know if it makes sense.

@nirrozenbaum I see. My problem currently is that active-request-scorer will not work correctly with regard to prefill pods. Prefill pods will be incremented in the PreRequest, but will not be decremented in PostResponse. They will be removed by TTL eventually.

@VadimEisenberg right.
I think that until PostResponse issue is fixed, this scorer should either count only decode results, or it cannot be used as is.
cc: @vMaroon

For PD I think we can do with the start/end PostResponse endpoints and the scorer can use request-id to maintain happy-path association of 1 P and 1 D per request.

vMaroon requested review from elevran, kfirtoledo, kfswain, nilig, nirrozenbaum and shmuelk as code owners August 18, 2025 13:13

vMaroon force-pushed the main branch 3 times, most recently from 80d0d08 to 212363d Compare August 18, 2025 13:16

elevran requested changes Aug 18, 2025

View reviewed changes

vMaroon force-pushed the main branch from 4bde565 to a27c0ac Compare August 20, 2025 17:33

elevran previously approved these changes Aug 24, 2025

View reviewed changes

vMaroon added 4 commits August 24, 2025 14:33

implemented active-request

b36ef38

Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com>

make lint

35acaed

Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com>

minor refactoring based on reviews

283b48f

Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com>

NewActiveRequestScorer gets params struct

f26d3bc

Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com>

vMaroon dismissed elevran’s stale review via fe6cacc August 24, 2025 11:44

vMaroon force-pushed the main branch from 0caf089 to fe6cacc Compare August 24, 2025 11:44

switched active-request-scorer's request-timeout param to duration

c705f75

Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com>

vMaroon force-pushed the main branch from fe6cacc to c705f75 Compare August 24, 2025 11:50

github-actions bot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 25, 2025

github-actions bot approved these changes Aug 25, 2025

View reviewed changes

github-actions bot merged commit 673ddd9 into llm-d:main Aug 25, 2025
5 checks passed

VadimEisenberg reviewed Sep 7, 2025

View reviewed changes

kfswain mentioned this pull request Sep 24, 2025

Unexpected behavior of PostResponse plugins for streaming requests kubernetes-sigs/gateway-api-inference-extension#1483

Closed

vMaroon mentioned this pull request Oct 13, 2025

Enhancement for Load Balancing During High-Concurrency Scenarios in EPP kubernetes-sigs/gateway-api-inference-extension#1700

Open

Conversation

vMaroon commented Aug 18, 2025

Summary

Uh oh!

vMaroon commented Aug 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vMaroon commented Aug 21, 2025

Uh oh!

elevran left a comment

Choose a reason for hiding this comment

Uh oh!

vMaroon commented Aug 24, 2025

Uh oh!

elevran commented Aug 25, 2025

Uh oh!

Uh oh!

VadimEisenberg commented Sep 6, 2025

Uh oh!

vMaroon commented Sep 6, 2025

Uh oh!

VadimEisenberg Sep 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vMaroon Sep 7, 2025

Choose a reason for hiding this comment

Uh oh!

VadimEisenberg Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

nirrozenbaum Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

VadimEisenberg Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

nirrozenbaum Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

vMaroon Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

VadimEisenberg Sep 7, 2025 •

edited

Loading