|
1 | | -The llm-d-fast-model-actuation repository contains work on one of the |
2 | | -many areas of work that contribute to fast model actuation. This area |
3 | | -concerns exploiting techniques in which an inference server process |
4 | | -dramatically changes its properties and behavior over time. |
| 1 | +The llm-d-fast-model-actuation repository is part of the |
| 2 | +[llm-d](https://github.com/llm-d) ecosystem for serving large |
| 3 | +language models on Kubernetes. FMA lives in the |
| 4 | +[llm-d-incubation](https://github.com/llm-d-incubation) organization, |
| 5 | +where new llm-d components are developed before graduation. |
| 6 | + |
| 7 | +This repository contains work on one of the many areas of work that |
| 8 | +contribute to fast model actuation. This area concerns exploiting |
| 9 | +techniques in which an inference server process dramatically changes |
| 10 | +its properties and behavior over time. |
5 | 11 |
|
6 | 12 | There are two sorts of changes contemplated here. Both are currently |
7 | 13 | realized only for vLLM and nvidia's GPU operator, but we hope that |
@@ -38,26 +44,45 @@ _server-requesting Pod_, which describes a desired inference server |
38 | 44 | but does not actually run it, and (b) a _server-providing Pod_, which |
39 | 45 | actually runs the inference server(s). |
40 | 46 |
|
41 | | -The topics above are realized by two software components, as follows. |
| 47 | +The topics above are realized by the following software components. |
42 | 48 |
|
43 | | -- A vLLM instance launcher, the persistent management process |
44 | | - mentioned above. This is written in Python and the source code is in |
45 | | - the [inference_server/launcher](inference_server/launcher) |
46 | | - directory. |
47 | | - |
48 | | -- A "dual-pods" controller, which manages the server-providing Pods |
| 49 | +- A **dual-pods controller**, which manages the server-providing Pods |
49 | 50 | in reaction to the server-requesting Pods that other manager(s) |
50 | 51 | create and delete. This controller is written in the Go programming |
51 | 52 | language and this repository's contents follow the usual conventions |
52 | 53 | for one containing Go code. |
53 | 54 |
|
54 | | -We are currently in the midst of a development roadmap with three |
55 | | -milestones. We are currently polishing off milestone 2, which involves |
56 | | -using vLLM sleep/wake but not the launcher. The final milestone, 3, |
57 | | -adds the use of the launcher. |
| 55 | +- A **vLLM instance launcher**, the persistent management process |
| 56 | + mentioned above. This is written in Python and the source code is in |
| 57 | + the [inference_server/launcher](inference_server/launcher) |
| 58 | + directory. |
| 59 | + |
| 60 | +- A **launcher-populator** controller, which watches LauncherConfig |
| 61 | + and LauncherPopulationPolicy custom resources and ensures that the |
| 62 | + right number of launcher pods exist on each node. This controller is |
| 63 | + also written in Go. |
| 64 | + |
| 65 | +These controllers are deployed together via a unified Helm chart at |
| 66 | +[charts/fma-controllers](charts/fma-controllers). The chart also |
| 67 | +installs the shared RBAC resources and optional ValidatingAdmissionPolicies. |
| 68 | + |
| 69 | +The repository defines three Custom Resource Definitions (CRDs): |
| 70 | + |
| 71 | +- **InferenceServerConfig** — declares the properties of an inference |
| 72 | + server (image, command, resources) that server-providing Pods use. |
| 73 | +- **LauncherConfig** — declares the configuration for a launcher |
| 74 | + process (image, resources, ports) that manages vLLM instances. |
| 75 | +- **LauncherPopulationPolicy** — declares the desired population of |
| 76 | + launcher pods per node. |
| 77 | + |
| 78 | +These CRD definitions live in [config/crd](config/crd) and the Go |
| 79 | +types are in [pkg/api](pkg/api). |
58 | 80 |
|
59 | | -**NOTE**: we are in the midst of a terminology shift, from |
60 | | - "server-running Pod" to "server-providing Pod". |
| 81 | +The development roadmap has three milestones. Milestone 2, which |
| 82 | +introduced vLLM sleep/wake without the launcher, is finished. |
| 83 | +Milestone 3, which adds launcher-based model swapping where a |
| 84 | +persistent launcher process manages vLLM instances on each node, is |
| 85 | +under implementation. |
61 | 86 |
|
62 | 87 | For further design documentation, see [the docs |
63 | 88 | directory](docs/README.md). |
0 commit comments