llm-d Router

Important

Terminology Change: The Inference Scheduler has been renamed to llm-d Router to better reflect its role as the intelligent entry point for inference requests in the llm-d stack.

Important

API & Code Consolidation: Core EPP code and the InferenceObjective and InferenceModelRewrite APIs have been merged into this repository from Gateway API Inference Extension (GIE). The GIE repository now exclusively hosts the InferencePool API—an extension of the Kubernetes Gateway API—and defines the Endpoint Picker Protocol.

The llm-d Router is the intelligent entry point for inference traffic, delivering LLM load and prefix-cache aware routing, request prioritization, and advanced flow control across diverse request formats to fulfill complex serving objectives. It can be deployed in Standalone Mode for lightweight setups or integrated with cloud provider managed load balancing solutions via the Kubernetes Gateway API.

The router achieves its intelligence through an Endpoint Picker (EPP) that integrates with production-grade proxies (such as Envoy) via the ext-proc protocol, injecting real-time signals into the data plane to optimize request placement.

Core Components and APIs

This repository hosts the following core components:

Endpoint Picker (EPP): The intelligent routing engine that serves as the "brain" of the router. It evaluates incoming requests against the current state of the InferencePool, considering factors like KV-cache locality, current load, and priority to make optimal placement decisions. It integrates with L7 proxies via the ext-proc protocol.
Request Management APIs: These resources directly influence the EPP's request handling behavior:
- InferenceObjective: Configures the EPP's scheduling goals for specific requests, including priority levels and performance targets.
- InferenceModelRewrite: Directs the EPP to perform model name rewriting, enabling flexible traffic management for A/B testing and canary rollouts.
Disaggregation Sidecar: A coordination component deployed alongside model servers (typically as a sidecar to the decode worker). It orchestrates complex multi-stage inference lifecycles, such as P/D (Prefill/Decode) and E/P/D (Encode/Prefill/Decode), by communicating with specialized encode and prefill workers to manage KV-cache and embedding transfers. For more details, see the Disaggregation Documentation.

Modes of Operation

The llm-d Router supports two primary deployment modes as specified in the Kubernetes Gateway API Inference Extensions:

1. Standalone Mode

A lightweight deployment where the proxy (e.g., Envoy) runs as a sidecar to the EPP in the same pod. This mode is ideal for clusters without Gateway API infrastructure or for basic testing and local evaluations.

2. Gateway Mode (Inference Gateway)

The recommended mode for production environments, leveraging the official Gateway API. In this mode, the EPP acts as a backend for an InferencePool, which is referenced by an HTTPRoute on a shared Gateway. This enables advanced traffic management, multi-cluster load balancing, and shared infrastructure for both inference and traditional workloads.

For more details on the router architecture, routing logic, and different plugins (filters and scorers), see the Architecture Documentation.

Note

The project provides tools for automatic Envoy installation. However, if you install or configure it yourself, please note that the only supported request_body_mode and response_body_mode is FULL_DUPLEX_STREAMED

Contributing

Our community meeting is bi-weekly at Wednesday 10AM PDT (Google Meet, Meeting Notes).

We currently utilize the #sig-inference-scheduler channel in llm-d Slack workspace for communications.

For large changes please create an issue first describing the change so the maintainers can do an assessment, and work on the details with you. See DEVELOPMENT.md for details on how to work with the codebase.

Contributions are welcome!

Name		Name	Last commit message	Last commit date
Latest commit History 1,659 Commits
.github		.github
apix/config/v1alpha1		apix/config/v1alpha1
cmd		cmd
config		config
deploy		deploy
docs		docs
hack		hack
hooks		hooks
internal		internal
pkg		pkg
scripts		scripts
test		test
version		version
.dockerignore		.dockerignore
.gitignore		.gitignore
.golangci.yml		.golangci.yml
.lychee.toml		.lychee.toml
.prowlabels.yaml		.prowlabels.yaml
.typos.toml		.typos.toml
DEVELOPMENT.md		DEVELOPMENT.md
Dockerfile.builder		Dockerfile.builder
Dockerfile.epp		Dockerfile.epp
Dockerfile.sidecar		Dockerfile.sidecar
LICENSE		LICENSE
Makefile		Makefile
Makefile.cluster.mk		Makefile.cluster.mk
Makefile.gen.mk		Makefile.gen.mk
Makefile.kind.mk		Makefile.kind.mk
Makefile.tools.mk		Makefile.tools.mk
OWNERS		OWNERS
README.md		README.md
RELEASE-NOTES.md		RELEASE-NOTES.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-d Router

Core Components and APIs

Modes of Operation

1. Standalone Mode

2. Gateway Mode (Inference Gateway)

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llm-d Router

Core Components and APIs

Modes of Operation

1. Standalone Mode

2. Gateway Mode (Inference Gateway)

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages