Skip to content

LatencyPredictionScorer for PD#564

Closed
RishabhSaini wants to merge 6 commits intollm-d:mainfrom
RishabhSaini:sloAwareRouter
Closed

LatencyPredictionScorer for PD#564
RishabhSaini wants to merge 6 commits intollm-d:mainfrom
RishabhSaini:sloAwareRouter

Conversation

@RishabhSaini
Copy link

@RishabhSaini RishabhSaini commented Jan 14, 2026

@github-actions
Copy link

🚨 Unsigned commits detected! Please sign your commits.

For instructions on how to set up GPG/SSH signing and verify your commits, please see GitHub Documentation.

@RishabhSaini
Copy link
Author

Metrics With LatencyPredictionScorer:

metrics:
  latency:
    inter_token_latency:
      max: 9.048826742492267
      mean: 6.968183873615024
      min: 0.012664986936838036
      mode: 6.990179919556484
      p0p1: 0.05423622643387558
      p1: 0.07692599456582293
      p10: 6.246632377573308
      p25: 6.92957519684862
      p5: 2.8817461641042823
      p50: 7.268913640271897
      p75: 7.679152008671089
      p90: 8.08750185473212
      p95: 8.268135506034698
      p99: 8.56772845223446
      p99p9: 8.748382529956382
      stddev: 1.5514035084191853
      units: ms/token
    request_latency:
      max: 5.730879068374634
      mean: 1.941128066895235
      min: 0.9017930030822754
      mode: 1.4029786586761475
      p0p1: 0.9767541885375977
      p1: 1.1386914253234863
      p10: 1.2835934162139893
      p25: 1.412358045578003
      p5: 1.2434842586517334
      p50: 1.8323948383331299
      p75: 2.319676399230957
      p90: 2.6365718841552734
      p95: 3.1370344161987305
      p99: 3.9784305095672607
      p99p9: 5.106745481491089
      stddev: 0.6338312467179671
      units: ms
    time_per_output_token:
      max: 38.193580309549965
      mean: 13.046563672782563
      min: 6.013398170471191
      mode: 9.552504221598307
      p0p1: 6.964163780212402
      p1: 7.771388689676921
      p10: 8.654205004374186
      p25: 9.47872002919515
      p5: 8.368279139200846
      p50: 12.337344487508139
      p75: 15.565009117126465
      p90: 17.673921585083008
      p95: 21.02405462807756
      p99: 26.988070710261066
      p99p9: 34.04345989227295
      stddev: 4.255576616120952
      units: ms/token
    time_to_first_token:
      max: 5265.411138534546
      mean: 910.783347086975
      min: 105.5290699005127
      mode: 351.94921493530273
      p0p1: 120.76783180236816
      p1: 152.2815227508545
      p10: 222.08595275878906
      p25: 313.3974075317383
      p5: 194.6730613708496
      p50: 835.9270095825195
      p75: 1244.2352771759033
      p90: 1664.6668910980225
      p95: 2222.6803302764893
      p99: 3607.8834533691406
      p99p9: 4754.289865493774
      stddev: 730.3735751206832
      units: ms
  requests:
    failures: 9
    incomplete: 72
    input_length:
      max: 1000.0
      mean: 953.9056718845293
      min: 833.0
      mode: 1000.0
      p0p1: 857.0
      p1: 885.0
      p10: 915.0
      p25: 934.0
      p5: 904.0
      p50: 955.0
      p75: 976.0
      p90: 995.0
      p95: 1000.0
      p99: 1000.0
      p99p9: 1000.0
      stddev: 29.042069546929625
      units: count
    output_length:
      max: 150.0
      mean: 148.69343362472048
      min: 116.0
      mode: 150.0
      p0p1: 123.0
      p1: 132.0
      p10: 145.0
      p25: 150.0
      p5: 141.0
      p50: 150.0
      p75: 150.0
      p90: 150.0
      p95: 150.0
      p99: 150.0
      p99p9: 150.0
      stddev: 3.6972679348630924
      units: count
    total: 5000
  throughput:
    output_tokens_per_sec: 7462.120410565094
    requests_per_sec: 49.190000000000005
    total_tokens_per_sec: 55333.49448800091
  time:
    duration: 100.0
    start: 1770255687.344836
    stop: 1770255787.344836
scenario:
  load:
    args:
      backend: openai_http
      backend_kwargs: null
      cooldown: null
      data:
      - '{''prompt_tokens_min'': 800, ''prompt_tokens_max'': 1000, ''prompt_tokens'':
        955, ''prompt_tokens_stdev'': 31, ''output_tokens_min'': 50, ''output_tokens_max'':
        150, ''output_tokens'': 162, ''output_tokens_stdev'': 13, ''samples'': -1}'
      data_args: []
      data_collator: generative
      data_column_mapper: generative_column_mapper
      data_num_workers: 1
      data_request_formatter: text_completions
      data_sampler: null
      data_samples: -1
      dataloader_kwargs: null
      max_error_rate: null
      max_errors: null
      max_global_error_rate: null
      max_requests: null
      max_seconds: 100
      model: openai/gpt-oss-120b
      output_dir: /requests/guidellm_1770255657_infra-pd-inference-gateway-istio-120b-base_1
      outputs:
      - results.json
      over_saturation: null
      prefer_response_metrics: true
      processor: null
      processor_args: null
      profile: constant
      rampup: 0.0
      random_seed: 42
      rate:
      - 50.0
      sample_requests: 10
      target: http://infra-pd-inference-gateway-istio.llm-d-pd-r.svc.cluster.local:80
      warmup: null
    metadata:
      stage: 0
    name: guidellm
  model:
    name: openai/gpt-oss-120b
version: '0.1'

@RishabhSaini
Copy link
Author

Metrics without LatencyPredictionScorer using PD Guide:

metrics:
  latency:
    inter_token_latency:
      max: 10.973376715743301
      mean: 8.34087350700517
      min: 5.026508497711796
      mode: 7.377678915958277
      p0p1: 5.432762555628015
      p1: 6.434498514447894
      p10: 7.235563841442134
      p25: 7.547504949889727
      p5: 7.06044459502969
      p50: 8.008612526787651
      p75: 9.048196293363635
      p90: 10.183751177625592
      p95: 10.422745006996513
      p99: 10.656614431598843
      p99p9: 10.8541176236909
      stddev: 1.0909205958278747
      units: ms/token
    request_latency:
      max: 11.338996648788452
      mean: 3.3373313657432906
      min: 0.9317543506622314
      mode: 4.807129859924316
      p0p1: 1.064784288406372
      p1: 1.686671495437622
      p10: 1.922729730606079
      p25: 2.211632251739502
      p5: 1.8284733295440674
      p50: 2.5789663791656494
      p75: 3.6431615352630615
      p90: 5.019238233566284
      p95: 8.351550817489624
      p99: 11.17785096168518
      p99p9: 11.310513019561768
      stddev: 1.9632946205919786
      units: ms
    time_per_output_token:
      max: 83.19439129395919
      mean: 22.4418730932295
      min: 6.2100521723429365
      mode: 15.962241490681967
      p0p1: 7.097036043802897
      p1: 11.412003835042318
      p10: 12.919944127400717
      p25: 14.873595237731934
      p5: 12.301375071207682
      p50: 17.359120051066082
      p75: 24.418218930562336
      p90: 33.76402219136556
      p95: 55.823912620544434
      p99: 74.82348124186198
      p99p9: 79.08243997722653
      stddev: 13.184826022030684
      units: ms/token
    time_to_first_token:
      max: 9867.053031921387
      mean: 2104.9683612219137
      min: 107.48100280761719
      mode: 2871.61922454834
      p0p1: 114.22514915466309
      p1: 552.6218414306641
      p10: 763.0441188812256
      p25: 1079.9410343170166
      p5: 660.6545448303223
      p50: 1397.4738121032715
      p75: 2270.2620029449463
      p90: 3585.4697227478027
      p95: 6787.7771854400635
      p99: 9686.982154846191
      p99p9: 9814.833879470825
      stddev: 1863.9344383988955
      units: ms
  requests:
    failures: 8
    incomplete: 137
    input_length:
      max: 1000.0
      mean: 953.8436341161929
      min: 833.0
      mode: 1000.0
      p0p1: 857.0
      p1: 885.0
      p10: 915.0
      p25: 934.0
      p5: 904.0
      p50: 955.0
      p75: 976.0
      p90: 995.0
      p95: 1000.0
      p99: 1000.0
      p99p9: 1000.0
      stddev: 29.03158247254098
      units: count
    output_length:
      max: 150.0
      mean: 148.6864441697569
      min: 116.0
      mode: 150.0
      p0p1: 123.0
      p1: 132.0
      p10: 145.0
      p25: 150.0
      p5: 141.0
      p50: 150.0
      p75: 150.0
      p90: 150.0
      p95: 150.0
      p99: 150.0
      p99p9: 150.0
      stddev: 3.7057169172928504
      units: count
    total: 4999
  throughput:
    output_tokens_per_sec: 7861.616706010213
    requests_per_sec: 48.540000000000006
    total_tokens_per_sec: 58294.95036153687
  time:
    duration: 100.0
    start: 1770256285.8556619
    stop: 1770256385.8556619
scenario:
  load:
    args:
      backend: openai_http
      backend_kwargs: null
      cooldown: null
      data:
      - '{''prompt_tokens_min'': 800, ''prompt_tokens_max'': 1000, ''prompt_tokens'':
        955, ''prompt_tokens_stdev'': 31, ''output_tokens_min'': 50, ''output_tokens_max'':
        150, ''output_tokens'': 162, ''output_tokens_stdev'': 13, ''samples'': -1}'
      data_args: []
      data_collator: generative
      data_column_mapper: generative_column_mapper
      data_num_workers: 1
      data_request_formatter: text_completions
      data_sampler: null
      data_samples: -1
      dataloader_kwargs: null
      max_error_rate: null
      max_errors: null
      max_global_error_rate: null
      max_requests: null
      max_seconds: 100
      model: openai/gpt-oss-120b
      output_dir: /requests/guidellm_1770256257_infra-pd-inference-gateway-istio-120b-base_1
      outputs:
      - results.json
      over_saturation: null
      prefer_response_metrics: true
      processor: null
      processor_args: null
      profile: constant
      rampup: 0.0
      random_seed: 42
      rate:
      - 50.0
      sample_requests: 10
      target: http://infra-pd-inference-gateway-istio.llm-d-pd-r.svc.cluster.local:80
      warmup: null
    metadata:
      stage: 0
    name: guidellm
  model:
    name: openai/gpt-oss-120b
version: '0.1'

@RishabhSaini
Copy link
Author

RishabhSaini commented Feb 5, 2026

For PD:

Metric WITHOUT Latency Predictor WITH Latency Predictor Difference
Success Rate 97.1% (4854/4999) 98.4% (4919/5000) +1.3%
TTFT Mean 2105 ms 911 ms -56.7%
TTFT P50 1397 ms 836 ms -40.2%
TTFT P99 9687 ms 3608 ms -62.8%
TPOT Mean 22.4 ms 13.0 ms -41.9%
TPOT P50 17.4 ms 12.3 ms -28.9%
TPOT P99 74.8 ms 27.0 ms -63.9%
Throughput 7862 tok/s 7462 tok/s -5.1%

Summary

The SLO-aware latency prediction scorer delivers significant latency improvements across all percentiles:

  • ~57% faster mean TTFT
  • ~42% faster mean TPOT
  • ~63% reduction in P99 tail latencies
  • Higher success rate (98.4% vs 97.1%)

The slight throughput reduction (-5%) is expected as the predictor prioritizes load-balanced, low-latency pods over maximizing absolute throughput, resulting in better QoS and reliability.

@kaushikmitr
Copy link

For PD:

Metric WITHOUT Latency Predictor WITH Latency Predictor Difference
Success Rate 97.1% (4854/4999) 98.4% (4919/5000) +1.3%
TTFT Mean 2105 ms 911 ms -56.7%
TTFT P50 1397 ms 836 ms -40.2%
TTFT P99 9687 ms 3608 ms -62.8%
TPOT Mean 22.4 ms 13.0 ms -41.9%
TPOT P50 17.4 ms 12.3 ms -28.9%
TPOT P99 74.8 ms 27.0 ms -63.9%
Throughput 7862 tok/s 7462 tok/s -5.1%

Summary

The SLO-aware latency prediction scorer delivers significant latency improvements across all percentiles:

  • ~57% faster mean TTFT
  • ~42% faster mean TPOT
  • ~63% reduction in P99 tail latencies
  • Higher success rate (98.4% vs 97.1%)

The slight throughput reduction (-5%) is expected as the predictor prioritizes load-balanced, low-latency pods over maximizing absolute throughput, resulting in better QoS and reliability.

@RishabhSaini what is the baseline here: (WITHOUT Latency Predictor). How was the EPP setup for that scenario?

@ahg-g
Copy link

ahg-g commented Feb 6, 2026

Can we summarize the approach we are taking here? is it using ttft prediction to pick the prefill endpoint and the itl to pick the decode endpoint?

Also, can we have the scorers in IGW instead of here?

@RishabhSaini
Copy link
Author

RishabhSaini commented Feb 7, 2026

Also, can we have the scorers in IGW instead of here?

The base scorer (PredictedLatency) lives in GAIE, which contains generic LatencyPredictionScorer logic. The P/D-specific wrapper (PDSLOAwareRouter) lives in llm-d-inference-scheduler because it contains P/D disaggregation logic specific to llm-d:

  • Track both prefill and decode pods in running request lists (runningRequestLists) for accurate load visibility
  • Extract prefill timing from x-prefill-ttft-ms header (added by routing-proxy sidecar)
  • Record training data for prefill pods using GAIE's RecordTrainingForProfile() API

This maintains clean separation: GAIE provides the mechanism (how to score and predict), while llm-d-inference-scheduler provides the policy (how to handle P/D-specific concerns like pod type labels and dual-pod tracking).

@RishabhSaini
Copy link
Author

Can we summarize the approach we are taking here?

Uses GAIE's LatencyPredictorScorer for both profiles with pod_type categorical feature enabling separating training data for the ML model. Prefill learns TTFT-dominant behavior (prompt processing), decode learns TTFT+TPOT (startup+generation).

@kaushikmitr
Copy link

kaushikmitr commented Feb 9, 2026

@RishabhSaini Simce GAIE provides the mechanism (how to score and predict) can the entire logic of how to handle P/D-specific concerns like pod type labels and dual-pod tracking moved to GAIE, and we can only configure the scorer appropriately in llm-d yamls by selecting the right profile and asscociating the right predictedlatency scorer config with each profile?

// responseHeaderPrefillTTFTMs reports the actual prefill TTFT in milliseconds to EPP
// for training data collection. EPP's SLOAwareRouter extracts this header in the
// ResponseReceived hook and records training data for the prefill pod.
responseHeaderPrefillTTFTMs = "x-prefill-ttft-ms"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am little confused why this is necessary. Cant we just reuse the requestcontroller hooks to measure ttft and tpot just as we are doing in epp. We already should have the information as to which pods (prefill or decode) the TTFT/TPOT values are coming from?

Copy link
Author

@RishabhSaini RishabhSaini Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The EPP from its hooks only sees
PreRequest(decodeEndpoint): sending request to the decode pod
ResponseReceived(decodeEndpoint): receiving from the decode pod

EPP never directly communicates with the prefill pod. That happens inside the routing-proxy sidecar for the P/D llm-d deployment.

This diagram explains it well: https://github.com/llm-d/llm-d-inference-scheduler/blob/main/docs/disagg_pd.md#architectural-details

With this header, the routing-proxy measures prefill latency internally and injects into response: "x-prefill-ttft-ms". EPP extracts this and records training data for the prefill endpoint

Copy link

@kaushikmitr kaushikmitr Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see, so the requestcontrol plugins in GAIE cannot handle dissagg scenario?
https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/epp/framework/interface/requestcontrol/plugins.go
Actually the requestcontrol hooks only needs to grab the metrics from the right endpoint: https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/epp/framework/plugins/scheduling/scorer/predictedlatency/requestcontrol_hooks.go

Only caveat is we can only grab TTFT as returned by decode pod whch includes network hop, but i think that i sthe true TTFT and we should include that in out training.

Copy link
Author

@RishabhSaini RishabhSaini Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EPP can query the prefill pod's aggregate metrics (via Prometheus/vLLM metrics endpoint). But since EPP does not track individual Prefill Pod requests, per request timing is not possible with the current P/D architecture. Only the routing-proxy sidecar has access to the local per-request measurements.
The reason we need to have a routing sidecar on the decode pod is to coordinate the KV Cache transfer between the Decode->Prefill->Decode for a request.

Copy link
Author

@RishabhSaini RishabhSaini Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaushikmitr Ok as discussed in today's sync, I explored the Experimental_DefaultPrefillProfile and was able to get rid of handling the addRequestToQueue and removeRequestFromQueue for the Prefill endpoint type. So extneding the request control hooks PreRequest and ResponseComplete are no longer needed in the llm-d-inferende-scheduler. These are now handled within the GAIE.
However, you would still need to return x-prefill-ttft-ms so it can be read by the ResponseReceived hook. This will be included in the training entry for the Prefill profile

@RishabhSaini RishabhSaini force-pushed the sloAwareRouter branch 3 times, most recently from 6d86938 to b43cfc0 Compare February 11, 2026 00:37
  - Add PDPredictionRequestBuilder to populate PodType from llm-d.ai/role labels
  - Add pd-slo-aware-scorer plugin wrapping slo_aware_router with P/D builder
  - Register pd-slo-aware-scorer in plugin registry
  - Add example EPP config for P/D SLO-aware scheduling (pd-slo-epp-config.yaml)
  - Add comprehensive guide on P/D SLO scheduling (docs/pd-slo-aware-scheduling.md)

  Enables separate latency prediction models for prefill vs decode workloads.
@RishabhSaini
Copy link
Author

RishabhSaini commented Feb 11, 2026

@kaushikmitr @ahg-g

This PR is no longer needed. The role-aware latency prediction functionality has been moved to GAIE's predicted-latency-scorer with a configurable endpointRoleLabel parameter in the EPP config. This eliminates the need for llm-d specific scorer code.

Additionally, the prefill-ttft-ms header approach has been replaced with direct timestamp calculation ResponseReceived - RequestReceived in GAIE, which captures the full prefill latency including scheduling, queuing, KV cache processing, and network transfer time.

Instead, deployments can now configure the base GAIE scorer with:

- type: predicted-latency-scorer
  parameters:
    endpointRoleLabel: "llm-d.ai/role"

See GAIE PR: kubernetes-sigs/gateway-api-inference-extension#2145

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants