SLO Aware Routing PD Disaggregation Support #511

RishabhSaini · 2025-12-12T19:25:32Z

Add PD-SLO Scheduling with Independent Prefill/Decode Optimization

Implements SLO-aware scheduling for PD (Prefill-Decode) disaggregated architecture with independent latency prediction and tiered epsilon-greedy selection.

Architecture

PD-SLO Optimizer (pkg/plugins/scorer/pd_slo_optimizer.go, pd_slo_selection.go): Independent pod scoring

Calculates headroom: bufferedSLO - predictedLatency for TTFT and TPOT separately
Buffered SLO = SLO × sloBufferFactor (default 0.9)
Tiered epsilon-greedy: 99% exploit positive headroom pods, 1% explore negative tier (configurable)
Weighted selection: Within each tier, selects pods proportionally to headroom scores (not uniform random)
Hierarchical negative tier: Prefers pods with zero running requests, then applies blended TTFT/TPOT deficit weighting
Uses latency predictor from GAIE (via latencypredictor package)

PD-SLO Profile Handler (pkg/plugins/profile/pd_slo_profile_handler.go): Coordinates dual-profile execution

Detects SLO headers (x-slo-ttft-ms, x-slo-tpot-ms)
Threshold-based PD decision: Skips prefill if <N non-cached tokens using prefix cache hit percentage
Reads prefix cache state via prefixPluginName parameter
Runs prefill → decode profiles independently (not as joint pairs)
Sets x-prefiller-pod header for selected prefill pod

Telemetry Collection (pkg/sidecar/proxy/connector_nixlv2.go, status_response_writer.go):

Sidecar measures prefill TTFT timing
Injects x-prefill-ttft-ms header via headerInjectorResponseWriter wrapper
Decode pod reports actual TTFT/TPOT for model training

Metrics (pkg/metrics/metrics.go): 4 Prometheus metrics track pod selections by headroom outcome, predictor call status, telemetry recording rates, and headroom distribution histogram.

Scheduling Flow

Request → Parse SLOs → Calculate prefix cache hit % → Threshold check (skip prefill if mostly cached) → Run prefill profile (select best prefill pod via weighted epsilon-greedy) → Run decode profile (select best decode pod via weighted epsilon-greedy) → Set prefiller header → Route to decode pod

Fallback Behavior

Returns nil scores when SLO headers absent (defers to other scorers)
Skips prefill when non-cached tokens < threshold

Dependencies

Updated GAIE to custom fork with SLO-aware routing support (github.com/RishabhSaini/gateway-api-inference-extension) which supports pod_type in training and prediction server.

Links to related PRs:
llm-d/llm-d#442
kubernetes-sigs/gateway-api-inference-extension#1993

…hold for PD applies to both the with and without slo

… and lifecyle management

github-actions · 2025-12-12T19:25:42Z

🚨 Unsigned commits detected! Please sign your commits.

For instructions on how to set up GPG/SSH signing and verify your commits,
please see GitHub Documentation.

…ion in exploration vs exploitation

RishabhSaini added 6 commits December 12, 2025 08:37

slo pd support

07d7d21

pd-slo-profile handler has fallback if no slo-headers provided. Thres…

5136d34

…hold for PD applies to both the with and without slo

make fixes for compilation errors: PredictoInterface to concrete type…

b912de0

… and lifecyle management

move to gaie main branch in go.mod

e61db0a

prefill predictor and decode predictor

0302f0f

switch gaie to slo-pd-route

d0e1fde

github-project-automation bot added this to llm-d-inference-scheduler Dec 12, 2025

make readme succinct

7a0684b

RishabhSaini added 18 commits December 15, 2025 09:35

feed telemetry to training server

b44dea1

telemetry fix issues using ctx

37f6b79

telemetry now registered in pair optimizer plugin

e0d541c

remove unused imports

65043a6

add logging

d44ac37

add more logging for fixing prefill telemetry

e5ccfdb

add ttft timing from prefill pod as a header to epp

67d592e

fix import issues

4e2490c

add respo to epp header

e60aebe

inject custom header into proxied response

4eb1be3

remove unneccesart imports

9b3db74

remove join optimization

1c8f542

remove unused fn decl

42d89bd

fix com[pi;e errors

21ba062

fix tokenlength issues

761c1ea

add telemetry on decisions

7d32dd8

fix telemetry namin conflicts

81550c1

add histogram to telemetry for headroom

198caab

RishabhSaini added 11 commits December 16, 2025 11:27

add epsilon-greedy with weighted exploration instead of greedy select…

1baf85f

…ion in exploration vs exploitation

fix compile issues with epsilon-greedy

80e1208

get fresh metrics across request lifecycle

80c3aa2

fix import

6bb8403

align scorer with gaie

032f617

only one trainig and predcition server needed

5b27c00

update gaie

8b16765

update gaie

bbb4abc

update gaie

436ecaf

update gaie

b9bf1e1

remove unneccesary comments and files

a8bb477

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SLO Aware Routing PD Disaggregation Support #511

SLO Aware Routing PD Disaggregation Support #511

Uh oh!

RishabhSaini commented Dec 12, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SLO Aware Routing PD Disaggregation Support #511

Are you sure you want to change the base?

SLO Aware Routing PD Disaggregation Support #511

Uh oh!

Conversation

RishabhSaini commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add PD-SLO Scheduling with Independent Prefill/Decode Optimization

Architecture

Scheduling Flow

Fallback Behavior

Dependencies

Uh oh!

github-actions bot commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

RishabhSaini commented Dec 12, 2025 •

edited

Loading