docs: Update FMA docs with ecosystem context, milestone status, and dependencies

rubambiza · rubambiza · commit 202dc3691615 · 2026-02-20T15:39:58.000-05:00
Co-authored-by: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

Signed-off-by: Gloire Rubambiza &lt;gloire@ibm.com&gt;
diff --git a/README.md b/README.md
@@ -1,7 +1,13 @@
-The llm-d-fast-model-actuation repository contains work on one of the
-many areas of work that contribute to fast model actuation. This area
-concerns exploiting techniques in which an inference server process
-dramatically changes its properties and behavior over time.
+The llm-d-fast-model-actuation repository is part of the
+[llm-d](https://github.com/llm-d) ecosystem for serving large
+language models on Kubernetes. FMA lives in the
+[llm-d-incubation](https://github.com/llm-d-incubation) organization,
+where new llm-d components are developed before graduation.
+
+This repository contains work on one of the many areas of work that
+contribute to fast model actuation. This area concerns exploiting
+techniques in which an inference server process dramatically changes
+its properties and behavior over time.
 
 There are two sorts of changes contemplated here. Both are currently
 realized only for vLLM and nvidia's GPU operator, but we hope that
@@ -38,26 +44,45 @@ _server-requesting Pod_, which describes a desired inference server
 but does not actually run it, and (b) a _server-providing Pod_, which
 actually runs the inference server(s).
 
-The topics above are realized by two software components, as follows.
+The topics above are realized by the following software components.
 
-- A vLLM instance launcher, the persistent management process
-  mentioned above. This is written in Python and the source code is in
-  the [inference_server/launcher](inference_server/launcher)
-  directory.
-
-- A "dual-pods" controller, which manages the server-providing Pods
+- A **dual-pods controller**, which manages the server-providing Pods
   in reaction to the server-requesting Pods that other manager(s)
   create and delete. This controller is written in the Go programming
   language and this repository's contents follow the usual conventions
   for one containing Go code.
 
-We are currently in the midst of a development roadmap with three
-milestones. We are currently polishing off milestone 2, which involves
-using vLLM sleep/wake but not the launcher. The final milestone, 3,
-adds the use of the launcher.
+- A **vLLM instance launcher**, the persistent management process
+  mentioned above. This is written in Python and the source code is in
+  the [inference_server/launcher](inference_server/launcher)
+  directory.
+
+- A **launcher-populator** controller, which watches LauncherConfig
+  and LauncherPopulationPolicy custom resources and ensures that the
+  right number of launcher pods exist on each node. This controller is
+  also written in Go.
+
+These controllers are deployed together via a unified Helm chart at
+[charts/fma-controllers](charts/fma-controllers). The chart also
+installs the shared RBAC resources and optional ValidatingAdmissionPolicies.
+
+The repository defines three Custom Resource Definitions (CRDs):
+
+- **InferenceServerConfig** — declares the properties of an inference
+  server (image, command, resources) that server-providing Pods use.
+- **LauncherConfig** — declares the configuration for a launcher
+  process (image, resources, ports) that manages vLLM instances.
+- **LauncherPopulationPolicy** — declares the desired population of
+  launcher pods per node.
+
+These CRD definitions live in [config/crd](config/crd) and the Go
+types are in [pkg/api](pkg/api).
 
-**NOTE**: we are in the midst of a terminology shift, from
-  "server-running Pod" to "server-providing Pod".
+The development roadmap has three milestones. Milestone 2, which
+introduced vLLM sleep/wake without the launcher, is finished.
+Milestone 3, which adds launcher-based model swapping where a
+persistent launcher process manages vLLM instances on each node, is
+under implementation.
 
 For further design documentation, see [the docs
 directory](docs/README.md).
diff --git a/docs/README.md b/docs/README.md
@@ -13,11 +13,15 @@
 
 - [Markdown and Python code quality check](../.github/workflows/python-code-quality.yml)
 - [Go code quality check](../.github/workflows/golangci-lint.yml)
+- [Verify IDL consumption](../.github/workflows/verify-idl-consumption.yml)
 - [Test build of dual-pods controller image](../.github/workflows/build-controller-image.yml)
 - [Test build of launcher image](../.github/workflows/build-launcher-image.yml)
 - [Test build of requester image](../.github/workflows/build-requester-image.yml)
 - [Test build of launcher populator image](../.github/workflows/build-populator-image.yml)
 - [End-to-end testing in CI using a `kind` cluster](../.github/workflows/pr-test-in-kind.yml)
+- [Launcher-based end-to-end testing in CI](../.github/workflows/launcher-based-e2e-test.yml)
+- [End-to-end testing on OpenShift](../.github/workflows/ci-e2e-openshift.yaml)
+- [Signed commits check](../.github/workflows/ci-signed-commits.yaml)
 - [Release – Build Images & Publish Helm Charts to GHCR](../.github/workflows/publish-release.yaml)
 
 # Release
diff --git a/docs/dual-pods.md b/docs/dual-pods.md
@@ -86,11 +86,10 @@ is not yet supported.
 
 ## Design
 
-Note: this document currently focuses on the design for the second of
-three milestones.
-
-Defining limitation of milestone 2: No use of the launcher. Each
-server-providing Pod runs just one vLLM instance.
+Note: this document covers the design for milestones 2 and 3.
+Milestone 2 (vLLM sleep/wake without the launcher) is finished.
+Milestone 3 (launcher-based model swapping) is under implementation;
+see [launcher](launcher.md) for details on the launcher API.
 
 ### Drawing
 
@@ -508,8 +507,7 @@ ConfigMap is populated with the needed information. The dual-pods
 controller reads the mapping from GPU UUID to index from that
 ConfigMap.
 
-This will change in milestone 3. The launcher will read the
-UUIDs of the GPUs on its node, and the request to launch a vLLM
-instance will carry the list of assigned GPU UUIDs. The launcher will
-translate from UUID to index and put the list of indices in the vLLM
-container's CUDA_VISIBLE_DEVICES.
+In milestone 3, the launcher reads the UUIDs of the GPUs on its node,
+and the request to launch a vLLM instance carries the list of assigned
+GPU UUIDs. The launcher translates from UUID to index and puts the
+list of indices in the vLLM container's CUDA_VISIBLE_DEVICES.
diff --git a/docs/upstream-versions.md b/docs/upstream-versions.md
@@ -5,8 +5,13 @@
 
 ## Dependencies
 
-<!-- Add your tracked dependencies using the format below. Remove this comment when populated. -->
-
 | Dependency | Current Pin | Pin Type | File Location | Upstream Repo |
 |-----------|-------------|----------|---------------|---------------|
-<!-- | **example-lib** | `v1.2.3` | tag | `go.mod` line 10 | example-org/example-lib | -->
+| **Go** | `1.24.2` | version | `go.mod` line 3 | [golang/go](https://github.com/golang/go) |
+| **k8s.io/api** | `v0.34.0` | tag | `go.mod` line 7 | [kubernetes/api](https://github.com/kubernetes/api) |
+| **k8s.io/apimachinery** | `v0.34.0` | tag | `go.mod` line 8 | [kubernetes/apimachinery](https://github.com/kubernetes/apimachinery) |
+| **k8s.io/client-go** | `v0.34.0` | tag | `go.mod` line 9 | [kubernetes/client-go](https://github.com/kubernetes/client-go) |
+| **sigs.k8s.io/controller-runtime** | `v0.22.1` | tag | `go.mod` line 12 | [kubernetes-sigs/controller-runtime](https://github.com/kubernetes-sigs/controller-runtime) |
+| **vllm/vllm-openai** | `v0.10.2` | tag | `cmd/requester/README.md` | [vllm-project/vllm](https://github.com/vllm-project/vllm) |
+| **vllm (CPU build)** | `v0.15.0` | tag | `dockerfiles/Dockerfile.launcher.cpu` | [vllm-project/vllm](https://github.com/vllm-project/vllm) |
+| **nvidia/cuda** | `12.8.0-base-ubuntu22.04` | tag | `dockerfiles/Dockerfile.requester` | [NVIDIA CUDA](https://hub.docker.com/r/nvidia/cuda) |