Skip to content

Conversation

@kaushikmitr
Copy link

@kaushikmitr kaushikmitr commented Nov 5, 2025

This pull request significantly expands the latency prediction infrastructure and improves the configurability and scalability of the system. The main changes include increasing the number of prediction server sidecars from 3 to 10, updating all relevant configuration and service definitions, and making sampling parameters configurable via environment variables. Several image references have also been updated to newer versions. Below are the most important changes grouped by theme:

Latency Prediction Infrastructure Expansion:

  • Increased the number of latency prediction server sidecars from 3 to 10 in the inferencepool-resources-lp.yaml manifest, including all necessary container definitions, ports, probes, resources, and dedicated storage volumes for each new server (prediction-server-4 through prediction-server-10). [1] [2] [3]
  • Updated the PREDICTION_SERVER_URL environment variable to include all 10 prediction servers, ensuring the main application can route requests to all available predictors.

Configuration and Plugin Updates:

  • Added a new plugin type prefix-cache-scorer to the scheduling configuration and included it in both the default and slo scheduling profiles.
  • Introduced a new environment variable LATENCY_QUANTILE_ALPHA for latency prediction configuration, allowing adjustment of quantile estimation.

Image and Version Updates:

  • Updated container image references for the main EPP container, training server, and all prediction servers to use the new latencypredictor-v3 images from a different registry, reflecting a move to newer builds and possibly a new environment. [1] [2] [3] [4] [5] [6]

Sampling Parameter Improvements:

  • Made the Poisson sampling parameters (DefaultSamplingMean and MaxSampledTokens) configurable via environment variables, replacing hardcoded values and improving flexibility for tuning latency prediction sampling. [1] [2] [3] [4]

Logging and Debugging:

  • Adjusted the log level for composite score calculations in the SLO-aware router from DEBUG to TRACE for more granular logging during troubleshooting.

These changes collectively enhance the scalability, flexibility, and observability of the latency prediction system.

@BenjaminBraunDev BenjaminBraunDev merged commit 899add9 into BenjaminBraunDev:slo-aware-routing-stage-3 Nov 7, 2025
2 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants