Skip to content

Commit 1723402

Browse files
authored
add architecture.md reference to writing a new plugin (#280)
Signed-off-by: Etai Lev Ran <elevran@gmail.com>
1 parent 54fa924 commit 1723402

File tree

1 file changed

+44
-22
lines changed

1 file changed

+44
-22
lines changed

docs/architecture.md

Lines changed: 44 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,9 @@
44

55
## Overview
66

7-
**llm-d** is an extensible architecture designed to route inference requests efficiently across model-serving pods. A central component of this architecture is the **Inference Gateway**, which builds on the Kubernetes-native **Gateway API Inference Extension (GIE)** to enable scalable, flexible, and pluggable routing of requests.
7+
**llm-d** is an extensible architecture designed to route inference requests efficiently across model-serving pods.
8+
A central component of this architecture is the **Inference Gateway**, which builds on the Kubernetes-native
9+
**Gateway API Inference Extension** to enable scalable, flexible, and pluggable routing of requests.
810

911
The design enables:
1012

@@ -51,6 +53,8 @@ Routing decisions are governed by dynamic components:
5153
- **Scrapers**: Collect pod metadata and metrics for scorers
5254

5355
These components are maintained in the `llm-d-inference-scheduler` repository and can evolve independently.
56+
A [sample filter plugin guide](./create_new_filter.md) is provided to illustrate how one could extend the
57+
Inference Gateway functionality to address unique requirements.
5458

5559
---
5660

@@ -92,14 +96,16 @@ These components are maintained in the `llm-d-inference-scheduler` repository an
9296
## Configuration
9397

9498
The set of lifecycle hooks (plugins) that are used by the inference scheduler is determined by how
95-
it is configured. The configuration is in the form of YAML text, which can either be in a file or
96-
specified in-line as a parameter. The configuration defines the set of plugins to be instantiated along with their parameters. Each plugin is also given a name, enabling the same plugin type to be instantiated
97-
multiple times, if needed. Also defined is a set of SchedulingProfiles, which determine the set of
98-
plugins to be used when scheduling a request. The set of plugins instantiated must also include a
99-
Profile Handler, which determines which SchedulingProfiles will be used for a particular request and
100-
how their results will be processed.
99+
it is configured. The configuration is in the form of YAML text, which can either be in a file or
100+
specified in-line as a parameter. The configuration defines the set of plugins to be instantiated
101+
along with their parameters. Each plugin is also given a name, enabling the same plugin type to be
102+
instantiated multiple times, if needed. Also defined is a set of SchedulingProfiles, which determine
103+
the set of plugins to be used when scheduling a request. The set of plugins instantiated must also
104+
include a Profile Handler, which determines which SchedulingProfiles will be used for a particular
105+
request and how their results will be processed.
101106

102107
The configuration text has the following form:
108+
103109
```yaml
104110
apiVersion: inference.networking.x-k8s.io/v1alpha1
105111
kind: EndpointPickerConfig
@@ -113,15 +119,17 @@ schedulingProfiles:
113119
114120
The first two lines of the configuration are constant and must appear as is.
115121
116-
The plugins section defines the set of plugins that will be instantiated and their parameters. Each entry in this section
117-
has the following form:
122+
The plugins section defines the set of plugins that will be instantiated and their parameters.
123+
Each entry in this section has the following form:
124+
118125
```yaml
119126
- name: aName
120127
type: a-type
121128
parameters:
122129
param1: val1
123130
param2: val2
124131
```
132+
125133
The fields in a plugin entry are:
126134
- **name** (optional): provides a name by which the plugin instance can be referenced. If this
127135
field is omitted, the plugin's type will be used as its name.
@@ -133,6 +141,7 @@ The schedulingProfiles section defines the set of scheduling profiles that can b
133141
requests to pods. The number of scheduling profiles one defines, depends on the use case. For simple
134142
serving of requests, one is enough. For disaggregated prefill, two profiles are required. Each entry
135143
in this section has the following form:
144+
136145
```yaml
137146
- name: aName
138147
plugins:
@@ -147,6 +156,7 @@ The fields in a schedulingProfile entry are:
147156
- **weight**: weight to be used if the referenced plugin is a scorer.
148157
149158
A complete configuration might look like this:
159+
150160
```yaml
151161
apiVersion: inference.networking.x-k8s.io/v1alpha1
152162
kind: EndpointPickerConfig
@@ -168,8 +178,9 @@ schedulingProfiles:
168178
weight: 50
169179
```
170180
171-
If the configuration is in a file, the EPP command line argument `--configFile` should be used to specify the full path of the file in question. If the configuration is passed as in-line text the EPP command
172-
line argument `--configText` should be used.
181+
If the configuration is in a file, the EPP command line argument `--configFile` should be used
182+
to specify the full path of the file in question. If the configuration is passed as in-line
183+
text the EPP command line argument `--configText` should be used.
173184

174185
---
175186

@@ -189,7 +200,7 @@ Sets a header for use in disaggregated prefill/decode
189200

190201
#### PdProfileHandler
191202

192-
Selects the profiles to use when running with disagregated prefill/decode
203+
Selects the profiles to use when running with disaggregated prefill/decode
193204

194205
- **Type**: `pd-profile-handler`
195206
- **Parameters**:
@@ -216,7 +227,9 @@ Filters out pods using a standard Kubernetes label selector.
216227

217228
#### DecodeFilter
218229

219-
Filters out pods that are not marked either as decode or both prefill and decode. The filter looks for the label `llm-d.ai/role`, with a value of either `decode` or `both`. In addition pods that are missing the label will not be filtered out.
230+
Filters out pods that are not marked either as decode or both prefill and decode. The filter looks for
231+
the label `llm-d.ai/role`, with a value of either `decode` or `both`. In addition pods that are missing
232+
the label will not be filtered out.
220233

221234
- **Type**: `decode-filter`
222235
- **Parameters**: None
@@ -253,12 +266,13 @@ The estimation is based on scheduling history.
253266
##### `cache_tracking` mode:
254267

255268
This mode scores requests based on the actual KV-cache states across the vLLM instances.
256-
It is more accurate than both `SessionAffinity` and `PrefixCachePlugin` in `estimate` mode,
257-
but incurs additional computation overhead and KV-Events streaming to track the KV-cache states.
269+
It is more accurate than both `SessionAffinity` and `PrefixCachePlugin` in `estimate` mode,
270+
but incurs additional computation overhead and KV-Events streaming to track the KV-cache states.
258271

259-
When enabled, the scorer will use the `llm-d-kv-cache-manager` to track the KV-cache states across the vLLM instances.
260-
It will use the `kvcache.Indexer` to score the pods based on the number of matching blocks in the KV-cache.
261-
It will also use the `kvevents.Pool` to subscribe to the KV-Events emitted by the vLLM instances and update the KV-cache states in near-real-time.
272+
When enabled, the scorer will use the `llm-d-kv-cache-manager` to track the KV-cache states
273+
across the vLLM instances. It will use the `kvcache.Indexer` to score the pods based on the
274+
number of matching blocks in the KV-cache. It will also use the `kvevents.Pool` to subscribe
275+
to the KV-Events emitted by the vLLM instances and update the KV-cache states in near-real-time.
262276

263277
Configuration:
264278

@@ -271,12 +285,13 @@ Configuration:
271285
See list of parameters at [llm-d-kv-cache-manager/docs/configuration.md](https://github.com/llm-d/llm-d-kv-cache-manager/blob/fa85b60207ba0a09daf23071e10ccb62d7977b40/docs/configuration.md).
272286

273287
Note that in most cases you will only need to set:
274-
- Hugging Face token for the `tokenizersPoolConfig` or the `tokenizersCacheDir` to a mounted directory containing the tokenizers.
288+
- HuggingFace token for the `tokenizersPoolConfig` or the `tokenizersCacheDir` to a mounted directory containing the tokenizers.
275289
- For the HuggingFace token, the inference-scheduler also accepts the environment variable `HF_TOKEN` - this is the practical option for security.
276-
- IMPORTANT: Token processor's block-size and hash-seed to match those used in the vLLM deployment.
277-
- KVBlockIndex metrics to true if you wish to enable metrics for the KV-Block Index (admissions, evictions, lookups and hits).
290+
- **IMPORTANT**: Token processor's block-size and hash-seed to match those used in the vLLM deployment.
291+
- `KVBlockIndex` metrics to true if you wish to enable metrics for the KV-Block Index (admissions, evictions, lookups and hits).
278292

279293
Example configuration with the above parameters set:
294+
280295
```yaml
281296
plugins:
282297
- type: prefix-cache-scorer
@@ -292,6 +307,7 @@ plugins:
292307
```
293308
294309
Example configuration with all parameters set:
310+
295311
```yaml
296312
plugins:
297313
- type: prefix-cache-scorer
@@ -352,6 +368,7 @@ used for the same session.
352368
### Sample Disaggregated Prefill/Decode Configuration
353369

354370
The following is an example of what a configuration for disaggregated Prefill/Decode might look like:
371+
355372
```yaml
356373
apiVersion: inference.networking.x-k8s.io/v1alpha1
357374
kind: EndpointPickerConfig
@@ -386,7 +403,7 @@ schedulingProfiles:
386403

387404
Several things should be noted:
388405
1. The `PrefillHeader`, `PdProfileHandler`, `DecodeFilter`, `PrefillFilter` and the `PrefixCachePlugin`
389-
plugins must be in the list of plugins instantiated.
406+
plugins must be in the list of plugins instantiated.
390407
2. There must be two scheduler profiles defined.
391408
3. The scheduler profile for prefill, must include the `PrefillFilter`
392409
4. The scheduler profile for decode, must include the `DecodeFilter`
@@ -415,6 +432,7 @@ The **vLLM sidecar** handles orchestration between Prefill and Decode stages. It
415432
- Experimental protocol compatibility
416433

417434
> **Note**: The detailed P/D design is available in this document: [Disaggregated Prefill/Decode in llm-d](./dp.md)
435+
418436
---
419437

420438
## InferencePool & InferenceModel Design
@@ -425,6 +443,10 @@ The **vLLM sidecar** handles orchestration between Prefill and Decode stages. It
425443
- Model-based filtering can be handled within EPP
426444
- Currently only one base model is supported
427445

446+
> [!NOTE]
447+
> The `InferenceModel` CRD is in the process of being significantly changed in IGW.
448+
> Once finalized, these changes would be reflected in llm-d as well.
449+
428450
---
429451

430452
## References

0 commit comments

Comments
 (0)