Important
Terminology Change: The Inference Scheduler has been renamed to llm-d Router to better reflect its role as the intelligent entry point for inference requests in the llm-d stack.
Important
API & Code Consolidation: Core EPP code and the InferenceObjective and InferenceModelRewrite APIs have been merged into this repository from Gateway API Inference Extension (GIE). The GIE repository now exclusively hosts the InferencePool API—an extension of the Kubernetes Gateway API—and defines the Endpoint Picker Protocol.
The llm-d Router is the intelligent entry point for inference traffic, delivering LLM load and prefix-cache aware routing, request prioritization, and advanced flow control across diverse request formats to fulfill complex serving objectives. It can be deployed in Standalone Mode for lightweight setups or integrated with cloud provider managed load balancing solutions via the Kubernetes Gateway API.
The router achieves its intelligence through an Endpoint Picker (EPP) that integrates with production-grade proxies (such as Envoy) via the ext-proc protocol, injecting real-time signals into the data plane to optimize request placement.
This repository hosts the following core components:
- Endpoint Picker (EPP): The intelligent routing engine that serves as the "brain" of the router. It evaluates incoming requests against the current state of the InferencePool, considering factors like KV-cache locality, current load, and priority to make optimal placement decisions. It integrates with L7 proxies via the
ext-procprotocol. - Request Management APIs: These resources directly influence the EPP's request handling behavior:
- InferenceObjective: Configures the EPP's scheduling goals for specific requests, including priority levels and performance targets.
- InferenceModelRewrite: Directs the EPP to perform model name rewriting, enabling flexible traffic management for A/B testing and canary rollouts.
- Disaggregation Sidecar: A coordination component deployed alongside model servers (typically as a sidecar to the decode worker). It orchestrates complex multi-stage inference lifecycles, such as P/D (Prefill/Decode) and E/P/D (Encode/Prefill/Decode), by communicating with specialized encode and prefill workers to manage KV-cache and embedding transfers. For more details, see the Disaggregation Documentation.
The llm-d Router supports two primary deployment modes as specified in the Kubernetes Gateway API Inference Extensions:
A lightweight deployment where the proxy (e.g., Envoy) runs as a sidecar to the EPP in the same pod. This mode is ideal for clusters without Gateway API infrastructure or for basic testing and local evaluations.
The recommended mode for production environments, leveraging the official Gateway API. In this mode, the EPP acts as a backend for an InferencePool, which is referenced by an HTTPRoute on a shared Gateway. This enables advanced traffic management, multi-cluster load balancing, and shared infrastructure for both inference and traditional workloads.
For more details on the router architecture, routing logic, and different plugins (filters and scorers), see the Architecture Documentation.
Note
The project provides tools for automatic Envoy installation. However, if you install or
configure it yourself, please note that the only supported request_body_mode and response_body_mode
is FULL_DUPLEX_STREAMED
Our community meeting is bi-weekly at Wednesday 10AM PDT (Google Meet, Meeting Notes).
We currently utilize the #sig-inference-scheduler channel in llm-d Slack workspace for communications.
For large changes please create an issue first describing the change so the maintainers can do an assessment, and work on the details with you. See DEVELOPMENT.md for details on how to work with the codebase.
Contributions are welcome!