Skip to content

Conversation

@RishabhSaini
Copy link

@RishabhSaini RishabhSaini commented Dec 12, 2025

Add PD-SLO Scheduling with Independent Prefill/Decode Optimization

Implements SLO-aware scheduling for PD (Prefill-Decode) disaggregated architecture with independent latency prediction and tiered epsilon-greedy selection.

Architecture

PD-SLO Optimizer (pkg/plugins/scorer/pd_slo_optimizer.go, pd_slo_selection.go): Independent pod scoring

  • Calculates headroom: bufferedSLO - predictedLatency for TTFT and TPOT separately
  • Buffered SLO = SLO × sloBufferFactor (default 0.9)
  • Tiered epsilon-greedy: 99% exploit positive headroom pods, 1% explore negative tier (configurable)
  • Weighted selection: Within each tier, selects pods proportionally to headroom scores (not uniform random)
  • Hierarchical negative tier: Prefers pods with zero running requests, then applies blended TTFT/TPOT deficit weighting
  • Uses latency predictor from GAIE (via latencypredictor package)

PD-SLO Profile Handler (pkg/plugins/profile/pd_slo_profile_handler.go): Coordinates dual-profile execution

  • Detects SLO headers (x-slo-ttft-ms, x-slo-tpot-ms)
  • Threshold-based PD decision: Skips prefill if <N non-cached tokens using prefix cache hit percentage
  • Reads prefix cache state via prefixPluginName parameter
  • Runs prefill → decode profiles independently (not as joint pairs)
  • Sets x-prefiller-pod header for selected prefill pod

Telemetry Collection (pkg/sidecar/proxy/connector_nixlv2.go, status_response_writer.go):

  • Sidecar measures prefill TTFT timing
  • Injects x-prefill-ttft-ms header via headerInjectorResponseWriter wrapper
  • Decode pod reports actual TTFT/TPOT for model training

Metrics (pkg/metrics/metrics.go): 4 Prometheus metrics track pod selections by headroom outcome, predictor call status, telemetry recording rates, and headroom distribution histogram.

Scheduling Flow

Request → Parse SLOs → Calculate prefix cache hit % → Threshold check (skip prefill if mostly cached) → Run prefill profile (select best prefill pod via weighted epsilon-greedy) → Run decode profile (select best decode pod via weighted epsilon-greedy) → Set prefiller header → Route to decode pod

Fallback Behavior

  • Returns nil scores when SLO headers absent (defers to other scorers)
  • Skips prefill when non-cached tokens < threshold

Dependencies

Updated GAIE to custom fork with SLO-aware routing support (github.com/RishabhSaini/gateway-api-inference-extension) which supports pod_type in training and prediction server.

Links to related PRs:
llm-d/llm-d#442
kubernetes-sigs/gateway-api-inference-extension#1993

@github-actions
Copy link

🚨 Unsigned commits detected! Please sign your commits.

For instructions on how to set up GPG/SSH signing and verify your commits,
please see GitHub Documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant