Skip to content

asharkhan3101/llm-d-inference-scheduler

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1,659 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Go Report Card Go Reference License Join Slack

llm-d Router

Important

Terminology Change: The Inference Scheduler has been renamed to llm-d Router to better reflect its role as the intelligent entry point for inference requests in the llm-d stack.

Important

API & Code Consolidation: Core EPP code and the InferenceObjective and InferenceModelRewrite APIs have been merged into this repository from Gateway API Inference Extension (GIE). The GIE repository now exclusively hosts the InferencePool API—an extension of the Kubernetes Gateway API—and defines the Endpoint Picker Protocol.

The llm-d Router is the intelligent entry point for inference traffic, delivering LLM load and prefix-cache aware routing, request prioritization, and advanced flow control across diverse request formats to fulfill complex serving objectives. It can be deployed in Standalone Mode for lightweight setups or integrated with cloud provider managed load balancing solutions via the Kubernetes Gateway API.

The router achieves its intelligence through an Endpoint Picker (EPP) that integrates with production-grade proxies (such as Envoy) via the ext-proc protocol, injecting real-time signals into the data plane to optimize request placement.

llm-d Router Architecture

Core Components and APIs

This repository hosts the following core components:

  • Endpoint Picker (EPP): The intelligent routing engine that serves as the "brain" of the router. It evaluates incoming requests against the current state of the InferencePool, considering factors like KV-cache locality, current load, and priority to make optimal placement decisions. It integrates with L7 proxies via the ext-proc protocol.
  • Request Management APIs: These resources directly influence the EPP's request handling behavior:
    • InferenceObjective: Configures the EPP's scheduling goals for specific requests, including priority levels and performance targets.
    • InferenceModelRewrite: Directs the EPP to perform model name rewriting, enabling flexible traffic management for A/B testing and canary rollouts.
  • Disaggregation Sidecar: A coordination component deployed alongside model servers (typically as a sidecar to the decode worker). It orchestrates complex multi-stage inference lifecycles, such as P/D (Prefill/Decode) and E/P/D (Encode/Prefill/Decode), by communicating with specialized encode and prefill workers to manage KV-cache and embedding transfers. For more details, see the Disaggregation Documentation.

Modes of Operation

The llm-d Router supports two primary deployment modes as specified in the Kubernetes Gateway API Inference Extensions:

1. Standalone Mode

A lightweight deployment where the proxy (e.g., Envoy) runs as a sidecar to the EPP in the same pod. This mode is ideal for clusters without Gateway API infrastructure or for basic testing and local evaluations.

2. Gateway Mode (Inference Gateway)

The recommended mode for production environments, leveraging the official Gateway API. In this mode, the EPP acts as a backend for an InferencePool, which is referenced by an HTTPRoute on a shared Gateway. This enables advanced traffic management, multi-cluster load balancing, and shared infrastructure for both inference and traditional workloads.

For more details on the router architecture, routing logic, and different plugins (filters and scorers), see the Architecture Documentation.


Note

The project provides tools for automatic Envoy installation. However, if you install or configure it yourself, please note that the only supported request_body_mode and response_body_mode is FULL_DUPLEX_STREAMED

Contributing

Our community meeting is bi-weekly at Wednesday 10AM PDT (Google Meet, Meeting Notes).

We currently utilize the #sig-inference-scheduler channel in llm-d Slack workspace for communications.

For large changes please create an issue first describing the change so the maintainers can do an assessment, and work on the details with you. See DEVELOPMENT.md for details on how to work with the codebase.

Contributions are welcome!

About

Inference scheduler for llm-d

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Go 95.9%
  • Shell 2.3%
  • Makefile 0.9%
  • Go Template 0.6%
  • Python 0.2%
  • Ruby 0.1%