llm-d is an extensible architecture designed to route inference requests efficiently across model-serving pods. A central component of this architecture is the Inference Gateway, which builds on the Kubernetes-native Gateway API Inference Extension to enable scalable, flexible, and pluggable routing of requests.
The design enables:
- Support for multiple base models within a shared cluster (see serving multiple inference pools)
- Efficient routing based on KV cache locality, session affinity, load, and model metadata
- Disaggregated Prefill/Decode (P/D) execution
- Pluggable filters, scorers, and scrapers for extensible routing
- Route inference requests to optimal pods based on:
- Base model compatibility
- KV cache reuse
- Load balancing
- Support multi-model deployments on heterogeneous hardware
- Enable runtime extensibility with pluggable logic (filters, scorers, scrapers)
- Community-aligned implementation using GIE and Envoy + External Processing (EPP)
The inference scheduler is built on top of:
- Envoy as a programmable data plane
- EPP (External Processing Plugin) using GIE
- BBR (External Processing Plugin) using GIE
Routing decisions are governed by dynamic components:
- Filters: Exclude pods based on static or dynamic criteria
- Scorers: Assign scores to candidate pods
- Scrapers: Collect pod metadata and metrics for scorers
These components are maintained in the llm-d-inference-scheduler repository and can evolve independently.
A sample filter plugin guide is provided to illustrate how one could extend the
Inference Gateway functionality to address unique requirements.
- Pluggability: No core changes are needed to add new scorers or filters
- Isolation: Each component operates independently
-
Filtering
- Pods in an
InferencePoolgo through a sequential chain of filters - Pods may be excluded based on criteria like model compatibility, resource usage, or custom logic
- Pods in an
-
Scoring
- Filtered pods are scored using a weighted set of scorers
- Scorers currently run sequentially (future: parallel execution)
- Scorers access a shared datastore populated by scrapers
-
Pod Selection
- The highest-scored pod is selected
- If multiple pods share the same score, one is selected at random
Pre-callScoringPost-choiceAfter-response
The set of lifecycle hooks (plugins) that are used by the inference scheduler is determined by how it is configured. The configuration is in the form of YAML text, which can either be in a file or specified in-line as a parameter. The configuration defines the set of plugins to be instantiated along with their parameters. Each plugin is also given a name, enabling the same plugin type to be instantiated multiple times, if needed. Also defined is a set of SchedulingProfiles, which determine the set of plugins to be used when scheduling a request. The set of plugins instantiated must also include a Profile Handler, which determines which SchedulingProfiles will be used for a particular request and how their results will be processed.
The configuration text has the following form:
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: EndpointPickerConfig
plugins:
- ....
- ....
schedulingProfiles:
- ....
- ....The first two lines of the configuration are constant and must appear as is.
The plugins section defines the set of plugins that will be instantiated and their parameters. Each entry in this section has the following form:
- name: aName
type: a-type
parameters:
param1: val1
param2: val2The fields in a plugin entry are:
- name (optional): provides a name by which the plugin instance can be referenced. If this field is omitted, the plugin's type will be used as its name.
- type: specifies the type of the plugin to be instantiated.
- parameters (optional): defines the set of parameters used to configure the plugin in question. The actual set of parameters varies from plugin to plugin.
The schedulingProfiles section defines the set of scheduling profiles that can be used in scheduling requests to pods. The number of scheduling profiles one defines, depends on the use case. For simple serving of requests, one is enough. For disaggregated prefill, two profiles are required. Each entry in this section has the following form:
- name: aName
plugins:
- pluginRef: plugin1
- pluginRef: plugin2
weight: 50The fields in a schedulingProfile entry are:
- name: specifies the scheduling profile's name.
- plugins: specifies the set of plugins to be used when this scheduling profile is chosen for a request.
- pluginRef: reference to the name of the plugin instance to be used
- weight: weight to be used if the referenced plugin is a scorer.
A complete configuration might look like this:
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: EndpointPickerConfig
plugins:
- type: precise-prefix-cache-scorer
parameters:
indexerConfig:
tokenProcessorConfig:
blockSize: 5
kvBlockIndexConfig:
maxPrefixBlocksToMatch: 256
- type: decode-filter
- type: max-score-picker
- type: single-profile-handler
schedulingProfiles:
- name: default
plugins:
- pluginRef: decode-filter
- pluginRef: max-score-picker
- pluginRef: precise-prefix-cache-scorer
weight: 50If the configuration is in a file, the EPP command line argument --configFile should be used
to specify the full path of the file in question. If the configuration is passed as in-line
text the EPP command line argument --configText should be used.
This section describes how to setup the various plugins available with the llm-d-inference-scheduler
Sets a header for use in disaggregated prefill/decode
- Type:
prefill-header-handler - Parameters:
prefillProfile: specifies the name of the profile used for the prefill scheduling. Only needed if the prefill profile is not namedprefill.
Selects the profiles to use when running with disaggregated prefill/decode
- Type:
pd-profile-handler - Parameters:
decodeProfile: specifies the name of the profile used for the decode scheduling. Only needed if the decode profile is not nameddecode.prefillProfile: specifies the name of the profile used for the prefill scheduling. Only needed if the prefill profile is not namedprefill.deciderPluginName: specifies the name of the decider plugin. Decider determines whether disaggregated PD should be executedprimaryPort: the base port number used for data parallel communication.
Note: When using this plugin you must also have a PrefixCachePlugin configured in the prefill and decode scheduling profiles.
Type: prefix-based-pd-decider
Parameters
nonCachedTokens: length, in token, of the uncached part of the user input above which disaggregated PD is triggered.
Note: prepareDataPlugins feature gate should be enabled
Example
kind: EndpointPickerConfig
featureGates:
- prepareDataPlugins
plugins:
- type: prefix-based-pd-decider
parameters:
nonCachedTokens: 4
- type: pd-profile-handler
parameters:
primaryPort: 8000
deciderPluginName: prefix-based-pd-deciderFilters out pods using a standard Kubernetes label selector.
Note: Only the matching labels feature of Kubernetes label selectors is supported.
- Type:
by-label-selector - Parameters: A standard Kubernetes label selector.
matchLabels: map of{key,value}pairs. If more than one pair are in the map, all of the keys are checked and the results are combined with AND logic.
Example configuration with the above parameters set:
plugins:
- type: by-label-selector
parameters:
matchLabels:
inference-role: decode
hardware-type: H100In this example:
- Only pods that have both labels
inference-role=decodeandhardware-type=H100
will be selected. - Pods missing either label, or having a different value (e.g.,
inference-role=prefill), are filtered out. - The matching logic follows standard Kubernetes label selector semantics: all key-value pairs in
matchLabelsmust match (AND logic).
Filters out pods that do not have a specific label with one of the allowed values. Pods missing the label are either filtered out or retained based on the allowsNoLabel setting.
- Type:
by-label - Parameters:
label(string, required): The name of the Kubernetes label to inspect on each pod.validValues(list of strings, required unlessallowsNoLabel=true): A list of acceptable label values. A pod is kept if its label value matches any entry in this list.allowsNoLabel(boolean, optional, default:false): Iftrue, pods that do not have the specified label at all will be included in the candidate set. Iffalse(default), such pods are filtered out.
Example configuration with the above parameters set:
plugins:
- type: by-label
parameters:
label: "inference-role"
validValues: ["decode", "both"]
allowsNoLabel: falseIn this example:
- Only pods labeled for decoding (
inference-role=decode) or supporting both stages (inference-role=both) are selected. - Pods missing the
inference-rolelabel are not considered for decode scheduling.
Filters out pods that are not marked either as decode or both prefill and decode. The filter looks for
the label llm-d.ai/role, with a value of either decode or both. In addition pods that are missing
the label will not be filtered out.
- Type:
decode-filter - Parameters: None
Filters out pods that are not marked as prefill. The filter looks for the label llm-d.ai/role, with a value of prefill.
- Type:
prefill-filter - Parameters: None
The precise-prefix-cache-scorer scores a request based on KV-cache localities.
Similarly to the IGW prefix-cache-scorer, it provides a score based on the number of
matching KV-cache blocks between the request's prompt and the KV-cache contents of each pod.
However, unlike the IGW prefix-cache-scorer, which relies on estimations based on scheduling history,
the precise-prefix-cache-scorer tracks the real-time KV-cache states across the vLLM instances to
provide more accurate scoring.
When enabled, the scorer will use the llm-d-kv-cache to track the KV-cache states
across the vLLM instances. It will use the kvcache.Indexer to score the pods based on the
number of matching blocks in the KV-cache. It will also use the kvevents.Pool to subscribe
to the KV-Events emitted by the vLLM instances and update the KV-cache states in near-real-time.
Configuration:
- Type:
precise-prefix-cache-scorer - Parameters:
tokenProcessorConfig: Configuration for thekvblock.TokenProcessor.indexerConfig: Configuration for thekvcache.Indexer.kvEventsConfig: Configuration for thekvevents.Pool.
See list of parameters at llm-d-kv-cache/docs/configuration.md.
Note that in most cases you will only need to set:
- Model name in the
tokenizersPoolConfigto match the model used in the vLLM deployment. - HuggingFace token for the
tokenizersPoolConfigor thetokenizersCacheDirto a mounted directory containing the tokenizers.- For the HuggingFace token, the inference-scheduler also accepts the environment variable
HF_TOKEN- this is the practical option for security.
- For the HuggingFace token, the inference-scheduler also accepts the environment variable
- IMPORTANT: Token processor's block-size and hash-seed to match those used in the vLLM deployment.
KVBlockIndexmetrics to true if you wish to enable metrics for the KV-Block Index (admissions, evictions, lookups and hits).
Example configuration with the above parameters set:
plugins:
- type: precise-prefix-cache-scorer
parameters:
tokenProcessorConfig:
blockSize: 64 # must match vLLM block size
hashSeed: "12345" # must match vLLM PYTHONHASHSEED env var
indexerConfig:
kvBlockIndexConfig:
enableMetrics: true
tokenizersPoolConfig:
modelName: hf-repo/model-name
hf:
huggingFaceToken: your_hf_token_here # automatically set by `HF_TOKEN` environment variableExample configuration for automatic pod discovery in active-active multi-replica scheduler deployments:
- type: precise-prefix-cache-scorer
parameters:
tokenProcessorConfig:
blockSize: 64
hashSeed: "42"
indexerConfig:
tokenizersPoolConfig:
modelName: "Qwen/Qwen3-32B"
hf:
tokenizersCacheDir: "/tmp/tokenizers"
kvEventsConfig:
topicFilter: "kv@"
concurrency: 4
discoverPods: true # enables automatic pod discovery for active-active HA
podDiscoveryConfig:
socketPort: 5556Where the vLLM engines are configured to emit KV-Events on port 5556 as follows:
--kv-events-config "{\"enable_kv_cache_events\":true,\"publisher\":\"zmq\",\"endpoint\":\"tcp://*:5556\",\"topic\":\"kv@${POD_IP}@Qwen/Qwen3-32B\"}"Example configuration with all parameters set:
plugins:
- type: precise-prefix-cache-scorer
parameters:
tokenProcessorConfig:
blockSize: 16
hashSeed: "12345"
kvEventsConfig:
topicFilter: "kv@"
concurrency: 4
discoverPods: true # enables automatic pod discovery for active-active HA
podDiscoveryConfig:
socketPort: 5556
indexerConfig:
prefixStoreConfig:
cacheSize: 500000
blockSize: 256
kvBlockIndexConfig:
inMemoryConfig:
size: 100000000
podCacheSize: 10
enableMetrics: true
tokenizersPoolConfig:
modelName: hf-repo/model-name
workersCount: 8
hf:
huggingFaceToken: your_hf_token_here # automatically set by `HF_TOKEN` environment variable
tokenizersCacheDir: /tmp/tokenizersScores pods based on their load, based on the number of requests concurrently being processed. A threshold is provided which is used to determine what is considered an overloaded pod.
Scores are given to the pods in the range of 0-1. Currently the metrics contain the number of requests waiting in the queue, there is no information about number of requests that can be processed in the given pod immediately.
Pods with an empty waiting requests queue are scored with 0.5.
Pods with requests in the queue will get score between 0.5 and 0.
- Type:
load-aware-scorer - Parameters:
threshold: specifies the threshold at which a pod is considered overloaded.
Scores pods based on the number of active requests being served per pod. Each request is tracked individually with its own TTL to ensure accurate timeout handling. Pods with fewer active requests receive higher scores.
Scores are normalized to a range of 0-1, where pods with fewer active requests get higher scores.
- Type:
active-request-scorer - Parameters:
requestTimeout: specifies the timeout for requests in seconds. Once a request is "in-flight" for this duration, it is considered timed out and automatically removed.
Scores the candidate pods by giving a higher score to the pods that were previously used for the same session.
- Type:
session-affinity-scorer - Parameters: None
Scores pods based on least recently used (LRU) ordering for cold requests (requests with no KV cache hits). This helps evenly distribute cache growth across pods, since cold requests result in new KV blocks being created.
The scorer integrates with a prefix cache plugin to determine if a request has cache hits:
- For cold requests (no cache hits): Ranks pods by LRU order, with never-used or least recently used pods receiving higher scores (up to 1.0) and most recently used pods receiving lower scores (approaching 0.0)
- For warm requests (cache hits): Returns neutral scores (0.5) for all pods to avoid interfering with cache locality optimization
The LRU tracking is specific to cold requests only - pods are added to the LRU cache when they serve a cold request, not when they serve requests with cache hits.
- Type:
no-hit-lru-scorer - Parameters:
prefixPluginName(optional): The name of the prefix cache plugin to read state from. Defaults toprefix-cache-scorer.lruSize(optional): The maximum number of pods to track in the LRU cache. Defaults to 1024.
Example configuration:
plugins:
- type: precise-prefix-cache-scorer
parameters:
indexerConfig:
tokenProcessorConfig:
blockSize: 5
kvBlockIndexConfig:
maxPrefixBlocksToMatch: 256
- type: no-hit-lru-scorer
parameters:
lruSize: 2048
- type: decode-filter
- type: max-score-picker
- type: single-profile-handler
schedulingProfiles:
- name: default
plugins:
- pluginRef: decode-filter
- pluginRef: max-score-picker
- pluginRef: precise-prefix-cache-scorer
weight: 2
- pluginRef: no-hit-lru-scorer
weight: 1Note: This scorer is designed to work alongside a prefix cache scorer (such as prefix-cache-scorer or
precise-prefix-cache-scorer). If no prefix cache state is available, all requests are treated as cold.
When integrating with a prefix-cache scorer, the prefix-cache scorer should be defined first in the scheduling profile.
The following is an example of what a configuration for disaggregated Prefill/Decode might look like:
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: EndpointPickerConfig
plugins:
- type: prefill-header-handler
- type: prefix-cache-scorer
parameters:
hashBlockSize: 5
maxPrefixBlocksToMatch: 256
lruCapacityPerServer: 31250
- type: prefill-filter
- type: decode-filter
- type: max-score-picker
- type: pd-profile-handler
parameters:
threshold: 10
hashBlockSize: 5
schedulingProfiles:
- name: prefill
plugins:
- pluginRef: prefill-filter
- pluginRef: max-score-picker
- pluginRef: prefix-cache-scorer
weight: 50
- name: decode
plugins:
- pluginRef: decode-filter
- pluginRef: max-score-picker
- pluginRef: prefix-cache-scorer
weight: 50Several things should be noted:
- The
PrefillHeader,PdProfileHandler,DecodeFilter,PrefillFilterand thePrefixCachePluginplugins must be in the list of plugins instantiated. - There must be two scheduler profiles defined.
- The scheduler profile for prefill, must include the
PrefillFilter - The scheduler profile for decode, must include the
DecodeFilter
- Scrapers collect metrics (e.g., memory usage, active adapters)
- Data is injected into the shared datastore for scorers
- Scoring can rely on numerical metrics or metadata (model ID, adapter tags)
When enabled, the router:
- Selects one pod for Prefill (prompt processing)
- Selects another pod for Decode (token generation)
The vLLM sidecar handles orchestration between Prefill and Decode stages. It allows:
- Queuing
- Local memory management
- Experimental protocol compatibility
Note: The detailed P/D design is available in this document: Disaggregated Prefill/Decode in llm-d
- Single
InferencePooland singleEPPdue to Envoy limitations - Model-based filtering can be handled within EPP
- Currently only one base model is supported
Note
The InferenceModel CRD is in the process of being significantly changed in IGW.
Once finalized, these changes would be reflected in llm-d as well.

