Conversation
|
🚨 Unsigned commits detected! Please sign your commits. For instructions on how to set up GPG/SSH signing and verify your commits, please see GitHub Documentation. |
d260dd4 to
043579b
Compare
e8648e2 to
f45142f
Compare
|
Metrics With LatencyPredictionScorer: |
|
Metrics without LatencyPredictionScorer using PD Guide: |
|
For PD:
SummaryThe SLO-aware latency prediction scorer delivers significant latency improvements across all percentiles:
The slight throughput reduction (-5%) is expected as the predictor prioritizes load-balanced, low-latency pods over maximizing absolute throughput, resulting in better QoS and reliability. |
@RishabhSaini what is the baseline here: (WITHOUT Latency Predictor). How was the EPP setup for that scenario? |
|
Can we summarize the approach we are taking here? is it using ttft prediction to pick the prefill endpoint and the itl to pick the decode endpoint? Also, can we have the scorers in IGW instead of here? |
The base scorer (PredictedLatency) lives in GAIE, which contains generic LatencyPredictionScorer logic. The P/D-specific wrapper (PDSLOAwareRouter) lives in llm-d-inference-scheduler because it contains P/D disaggregation logic specific to llm-d:
This maintains clean separation: GAIE provides the mechanism (how to score and predict), while llm-d-inference-scheduler provides the policy (how to handle P/D-specific concerns like pod type labels and dual-pod tracking). |
Uses GAIE's LatencyPredictorScorer for both profiles with |
|
@RishabhSaini Simce GAIE provides the mechanism (how to score and predict) can the entire logic of how to handle P/D-specific concerns like pod type labels and dual-pod tracking moved to GAIE, and we can only configure the scorer appropriately in llm-d yamls by selecting the right profile and asscociating the right predictedlatency scorer config with each profile? |
pkg/sidecar/proxy/timing_writer.go
Outdated
| // responseHeaderPrefillTTFTMs reports the actual prefill TTFT in milliseconds to EPP | ||
| // for training data collection. EPP's SLOAwareRouter extracts this header in the | ||
| // ResponseReceived hook and records training data for the prefill pod. | ||
| responseHeaderPrefillTTFTMs = "x-prefill-ttft-ms" |
There was a problem hiding this comment.
i am little confused why this is necessary. Cant we just reuse the requestcontroller hooks to measure ttft and tpot just as we are doing in epp. We already should have the information as to which pods (prefill or decode) the TTFT/TPOT values are coming from?
There was a problem hiding this comment.
The EPP from its hooks only sees
PreRequest(decodeEndpoint): sending request to the decode pod
ResponseReceived(decodeEndpoint): receiving from the decode pod
EPP never directly communicates with the prefill pod. That happens inside the routing-proxy sidecar for the P/D llm-d deployment.
This diagram explains it well: https://github.com/llm-d/llm-d-inference-scheduler/blob/main/docs/disagg_pd.md#architectural-details
With this header, the routing-proxy measures prefill latency internally and injects into response: "x-prefill-ttft-ms". EPP extracts this and records training data for the prefill endpoint
There was a problem hiding this comment.
i see, so the requestcontrol plugins in GAIE cannot handle dissagg scenario?
https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/epp/framework/interface/requestcontrol/plugins.go
Actually the requestcontrol hooks only needs to grab the metrics from the right endpoint: https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/epp/framework/plugins/scheduling/scorer/predictedlatency/requestcontrol_hooks.go
Only caveat is we can only grab TTFT as returned by decode pod whch includes network hop, but i think that i sthe true TTFT and we should include that in out training.
There was a problem hiding this comment.
EPP can query the prefill pod's aggregate metrics (via Prometheus/vLLM metrics endpoint). But since EPP does not track individual Prefill Pod requests, per request timing is not possible with the current P/D architecture. Only the routing-proxy sidecar has access to the local per-request measurements.
The reason we need to have a routing sidecar on the decode pod is to coordinate the KV Cache transfer between the Decode->Prefill->Decode for a request.
There was a problem hiding this comment.
@kaushikmitr Ok as discussed in today's sync, I explored the Experimental_DefaultPrefillProfile and was able to get rid of handling the addRequestToQueue and removeRequestFromQueue for the Prefill endpoint type. So extneding the request control hooks PreRequest and ResponseComplete are no longer needed in the llm-d-inferende-scheduler. These are now handled within the GAIE.
However, you would still need to return x-prefill-ttft-ms so it can be read by the ResponseReceived hook. This will be included in the training entry for the Prefill profile
6d86938 to
b43cfc0
Compare
- Add PDPredictionRequestBuilder to populate PodType from llm-d.ai/role labels - Add pd-slo-aware-scorer plugin wrapping slo_aware_router with P/D builder - Register pd-slo-aware-scorer in plugin registry - Add example EPP config for P/D SLO-aware scheduling (pd-slo-epp-config.yaml) - Add comprehensive guide on P/D SLO scheduling (docs/pd-slo-aware-scheduling.md) Enables separate latency prediction models for prefill vs decode workloads.
b43cfc0 to
e0c98c7
Compare
|
This PR is no longer needed. The role-aware latency prediction functionality has been moved to GAIE's Additionally, the Instead, deployments can now configure the base GAIE scorer with: - type: predicted-latency-scorer
parameters:
endpointRoleLabel: "llm-d.ai/role"See GAIE PR: kubernetes-sigs/gateway-api-inference-extension#2145 |
Blocked on GAIE