You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/architecture.md
+44-22Lines changed: 44 additions & 22 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,9 @@
4
4
5
5
## Overview
6
6
7
-
**llm-d** is an extensible architecture designed to route inference requests efficiently across model-serving pods. A central component of this architecture is the **Inference Gateway**, which builds on the Kubernetes-native **Gateway API Inference Extension (GIE)** to enable scalable, flexible, and pluggable routing of requests.
7
+
**llm-d** is an extensible architecture designed to route inference requests efficiently across model-serving pods.
8
+
A central component of this architecture is the **Inference Gateway**, which builds on the Kubernetes-native
9
+
**Gateway API Inference Extension** to enable scalable, flexible, and pluggable routing of requests.
8
10
9
11
The design enables:
10
12
@@ -51,6 +53,8 @@ Routing decisions are governed by dynamic components:
51
53
-**Scrapers**: Collect pod metadata and metrics for scorers
52
54
53
55
These components are maintained in the `llm-d-inference-scheduler` repository and can evolve independently.
56
+
A [sample filter plugin guide](./create_new_filter.md) is provided to illustrate how one could extend the
57
+
Inference Gateway functionality to address unique requirements.
54
58
55
59
---
56
60
@@ -92,14 +96,16 @@ These components are maintained in the `llm-d-inference-scheduler` repository an
92
96
## Configuration
93
97
94
98
The set of lifecycle hooks (plugins) that are used by the inference scheduler is determined by how
95
-
it is configured. The configuration is in the form of YAML text, which can either be in a file or
96
-
specified in-line as a parameter. The configuration defines the set of plugins to be instantiated along with their parameters. Each plugin is also given a name, enabling the same plugin type to be instantiated
97
-
multiple times, if needed. Also defined is a set of SchedulingProfiles, which determine the set of
98
-
plugins to be used when scheduling a request. The set of plugins instantiated must also include a
99
-
Profile Handler, which determines which SchedulingProfiles will be used for a particular request and
100
-
how their results will be processed.
99
+
it is configured. The configuration is in the form of YAML text, which can either be in a file or
100
+
specified in-line as a parameter. The configuration defines the set of plugins to be instantiated
101
+
along with their parameters. Each plugin is also given a name, enabling the same plugin type to be
102
+
instantiated multiple times, if needed. Also defined is a set of SchedulingProfiles, which determine
103
+
the set of plugins to be used when scheduling a request. The set of plugins instantiated must also
104
+
include a Profile Handler, which determines which SchedulingProfiles will be used for a particular
If the configuration is in a file, the EPP command line argument `--configFile` should be used to specify the full path of the file in question. If the configuration is passed as in-line text the EPP command
172
-
line argument `--configText` should be used.
181
+
If the configuration is in a file, the EPP command line argument `--configFile` should be used
182
+
to specify the full path of the file in question. If the configuration is passed as in-line
183
+
text the EPP command line argument `--configText` should be used.
173
184
174
185
---
175
186
@@ -189,7 +200,7 @@ Sets a header for use in disaggregated prefill/decode
189
200
190
201
#### PdProfileHandler
191
202
192
-
Selects the profiles to use when running with disagregated prefill/decode
203
+
Selects the profiles to use when running with disaggregated prefill/decode
193
204
194
205
- **Type**: `pd-profile-handler`
195
206
- **Parameters**:
@@ -216,7 +227,9 @@ Filters out pods using a standard Kubernetes label selector.
216
227
217
228
#### DecodeFilter
218
229
219
-
Filters out pods that are not marked either as decode or both prefill and decode. The filter looks for the label `llm-d.ai/role`, with a value of either `decode` or `both`. In addition pods that are missing the label will not be filtered out.
230
+
Filters out pods that are not marked either as decode or both prefill and decode. The filter looks for
231
+
the label `llm-d.ai/role`, with a value of either `decode` or `both`. In addition pods that are missing
232
+
the label will not be filtered out.
220
233
221
234
- **Type**: `decode-filter`
222
235
- **Parameters**: None
@@ -253,12 +266,13 @@ The estimation is based on scheduling history.
253
266
##### `cache_tracking` mode:
254
267
255
268
This mode scores requests based on the actual KV-cache states across the vLLM instances.
256
-
It is more accurate than both `SessionAffinity` and `PrefixCachePlugin` in `estimate` mode,
257
-
but incurs additional computation overhead and KV-Events streaming to track the KV-cache states.
269
+
It is more accurate than both `SessionAffinity` and `PrefixCachePlugin` in `estimate` mode,
270
+
but incurs additional computation overhead and KV-Events streaming to track the KV-cache states.
258
271
259
-
When enabled, the scorer will use the `llm-d-kv-cache-manager` to track the KV-cache states across the vLLM instances.
260
-
It will use the `kvcache.Indexer` to score the pods based on the number of matching blocks in the KV-cache.
261
-
It will also use the `kvevents.Pool` to subscribe to the KV-Events emitted by the vLLM instances and update the KV-cache states in near-real-time.
272
+
When enabled, the scorer will use the `llm-d-kv-cache-manager` to track the KV-cache states
273
+
across the vLLM instances. It will use the `kvcache.Indexer` to score the pods based on the
274
+
number of matching blocks in the KV-cache. It will also use the `kvevents.Pool` to subscribe
275
+
to the KV-Events emitted by the vLLM instances and update the KV-cache states in near-real-time.
262
276
263
277
Configuration:
264
278
@@ -271,12 +285,13 @@ Configuration:
271
285
See list of parameters at [llm-d-kv-cache-manager/docs/configuration.md](https://github.com/llm-d/llm-d-kv-cache-manager/blob/fa85b60207ba0a09daf23071e10ccb62d7977b40/docs/configuration.md).
272
286
273
287
Note that in most cases you will only need to set:
274
-
- Hugging Face token for the `tokenizersPoolConfig` or the `tokenizersCacheDir` to a mounted directory containing the tokenizers.
288
+
- HuggingFace token for the `tokenizersPoolConfig` or the `tokenizersCacheDir` to a mounted directory containing the tokenizers.
275
289
- For the HuggingFace token, the inference-scheduler also accepts the environment variable `HF_TOKEN` - this is the practical option for security.
276
-
- IMPORTANT: Token processor's block-size and hash-seed to match those used in the vLLM deployment.
277
-
- KVBlockIndex metrics to true if you wish to enable metrics for the KV-Block Index (admissions, evictions, lookups and hits).
290
+
- **IMPORTANT**: Token processor's block-size and hash-seed to match those used in the vLLM deployment.
291
+
- `KVBlockIndex`metrics to true if you wish to enable metrics for the KV-Block Index (admissions, evictions, lookups and hits).
278
292
279
293
Example configuration with the above parameters set:
0 commit comments