llm-d-incubation · rubambiza · Mar 24, 2026 · Mar 25, 2026 · Mar 25, 2026 · Mar 25, 2026
diff --git a/inference_server/benchmark/benchmark.md b/inference_server/benchmark/benchmark.md
@@ -4,110 +4,139 @@ latency of model-serving pods within the LLM-D Fast Model Actuation workflow.
 
 ## Purpose
 The goal is to quantify and compare how quickly a model-serving duo (server-requesting
-and server-providing pods) becomes available under different actuation conditions such
-as cold starts, wake-ups from a sleeping state, using prewarmed pods, etc. These metrics
-will guide future optimizations for the **Dual-Pods Controller (DPC)**. Ultimately, the goal
+and server-providing pods), when integrated with the Workload Variant Autoscaler (WVA),
+becomes available under three different actuation conditions in order of decreasing
+latency:
+
+- **Cold start**: creating a new vLLM instance without using a launcher
+- **Warm start**: creating a new vLLM instance in an existing launcher pod
+- **Hot start**: waking a sleeping vLLM instance on an existing launcher pod
+
+These metrics will guide future optimizations for the **Dual-Pods Controller (DPC)**. Ultimately, the goal
 is *high predictability*, which is defined as achieving close to 100% hit rate of awakening
 available, sleeping pods on cluster GPUs as function of total inference server
 requests for common user scenarios.
 
-## Baseline Startup Latency
-
-**Objective:**
-Measure the time from **deployment (server-request submission)** to **dual-pod readiness**.
-
-### Inputs
-
-| Parameter           | Type   | Required | Default                                 | Description                                                                                            |
-| ------------------ | ------ | -------- | --------------------------------------- | ------------------------------------------------------------------------------------------------------ |
-| `--namespace`      | `str`  | **Yes**  | —                                       | Openshift namespace to run benchmark                                      |
-| `--yaml`           | `str`  | **Yes**  | —                                       | Path to the server-requesting YAML template file              |
-| `--image`          | `str`  | **Yes*** | —                                       | Image repository for the requester pod. Required *only if* `CONTAINER_IMG_REG` env var is **not** set |
-| `--tag`            | `str`  | **Yes*** | —                                       | Image tag for the requester pod. Required *only if* `CONTAINER_IMG_VERSION` env var is **not** set    |
-| `--cleanup`        | `bool` | No       | `True`                                  | Whether to clean up created resources after the benchmark                                             |
-| `--iterations`     | `int`  | No       | `1`                                     | Number of times to run each benchmark scenario                                                        |
-| `--cluster-domain` | `str`  | No       | `fmaas-platform-eval.fmaas.res.ibm.com` | Cluster domain for Prometheus GPU metrics query                                                           |
-| `--model-path`     | `str`  | No       | `None`                                  | Path to JSON file containing models for scenario (used only in the `new_variant` scenario).                           |
-| `--scenario`       | `str`  | No       | `"scaling"`                             | Benchmark scenario to run: `baseline`, `scaling`, or `new_variant`.                                    |
-
-
-### Outputs
-
-| Output                 | Description                                                                |
-| ---------------------- | -------------------------------------------------------------------------- |
-| `startup_time`         | Total time from deployment to readiness                                    |
-| `availability_mode`    | Indicates whether the vLLM instance was started cold or resumed from sleep |
-
-**Example Usage**
-```bash
-python3 inference_server/benchmark/bechmark_base.py --namespace <str> --yaml <str> --cleanup <bool,default:True> --iterations <int, default:1> --cluster-domain <str> --model-path <str> --scenario <str, default:scaling> --image <str> --tag <str>
-```
-
-**Output Example (Subject to Change)**
-
-```
-2025-12-01 13:59:52,031 - INFO - scale-request-3-1764615426-4pztx-dual-lhv7s:scale-request-3-1764615426-v9jkh bound through a HIT.
-2025-12-01 13:59:52,053 - INFO - Checking readiness of Requester Pod:scale-request-3-1764615426-ktcqd
-2025-12-01 13:59:52,496 - INFO - Checking readiness of Requester Pod:scale-request-3-1764615426-ktcqd
-2025-12-01 13:59:53,930 - INFO - Checking readiness of Requester Pod:scale-request-3-1764615426-ktcqd
-2025-12-01 13:59:53,962 - INFO - Checking readiness of Requester Pod:scale-request-3-1764615426-ktcqd
-2025-12-01 13:59:53,972 - INFO - Checking readiness of Requester Pod:scale-request-3-1764615426-ktcqd
-2025-12-01 13:59:55,900 - INFO - Checking readiness of Requester Pod:scale-request-3-1764615426-ktcqd
-2025-12-01 14:00:03,738 - INFO - Checking readiness of Requester Pod:scale-request-3-1764615426-ktcqd
-2025-12-01 14:00:33,850 - INFO - Checking readiness of Requester Pod:scale-request-3-1764615426-ktcqd
-2025-12-01 14:01:03,904 - INFO - Checking readiness of Requester Pod:scale-request-3-1764615426-ktcqd
-2025-12-01 14:01:22,404 - INFO - Checking readiness of Requester Pod:scale-request-3-1764615426-ktcqd
-2025-12-01 14:01:22,405 - INFO - Requester Pod:scale-request-3-1764615426-ktcqd ready after 108s on node:fmaas-vllm-d-wv25b-worker-h100-3-hmtwn using GPU:GPU-ca8ae222-50e0-b69e-16f2-e49dac1afe28
-2025-12-01 14:01:22,405 - INFO - scale-request-3-1764615426-ktcqd-dual-pptzq:scale-request-3-1764615426-ktcqd bound through a COLD START.
-2025-12-01 14:01:22,405 - INFO - ✅ All pods {'scale-request-3-1764615426-hvxjg', 'scale-request-3-1764615426-v9jkh', 'scale-request-3-1764615426-ktcqd'} Ready after 108.97s
-replicaset.apps "scale-request-3-1764615426" deleted
-pod "scale-request-3-1764615426-9hlb2-dual-dgcg2" deleted
-pod "scale-request-3-1764615426-hvxjg-dual-59hc8" deleted
-pod "scale-request-3-1764615426-4pztx-dual-lhv7s" deleted
-pod "scale-request-3-1764615426-ktcqd-dual-pptzq" deleted
-2025-12-01 14:01:32,868 - INFO - ---------------------------------------------------------------------
-
-Total Runs: 15
-Successful Runs: 15
-Failed Runs: 0
-Requester Pods
-	Min: 9s,
-	Max: 318s
-	Average: 125.4s
-	Median: 115s
-Hits: 3/6 (50%)
-Hit Wake-up Times
-	Min: 9s,
-	Max: 18s
-	Average: 13.0s
-```
-
-## Benchmarking Scenarios (WIP)
-
-| Scenario                      | Description                                                                                                                                           |
-| ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
-| **Fast Replica Scale Up**     | As a ModelService Owner, I can scale up the number of active replicas for a variant from 0 to 1 or 1 to 2 with minimal latency |
-| **Introducing New Variant**   | As a ModelService Owner, I can deploy a newly released variant in anticipation of user requests |
-| **Free Up Cluster Resources** | As a Cluster Operator, I can reduce/deactivate resource intensive variants to make space for numerous smaller model variants |
+## Measurement Layers
+
+FMA benchmarking uses a layered measurement model. Layer 1 is what FMA benchmarks own
+directly. Layer 2 bridges actuation to inference readiness (i.e., latency for inference
+requests to get responses back in an FMA-enabled context). Layer 3 is out of FMA's
+direct scope but is referenced for completeness and handoff to other frameworks.
+
+| Layer | Focus | Metrics | Measured By |
+| ----- | ----- | ------- | ----------- |
+| **L1: Actuation** | Requester pod readiness | T_actuation (requester creation to readiness), T_wake (DPC wakes sleeping vLLM instance), Hit_rate (GPU hits), T_launcher (launcher creates new vLLM instance) | llm-d-benchmark new harness |
+| **L2: Inference Readiness** | First inference response | T_e2e (requester creation to first inference response), T_first_token (requester ready to first inference response) | llm-d-benchmark nop/inference-perf harness |
+| **L3: Steady-State** | Throughput/latency | T_actuation (requester creation to readiness), TPOT (time per output token), throughput, queue depth, KV cache usage, replica stability | llm-d-benchmark / WVA |
+
+**Metric definitions:**
+
+- **T_actuation**: Time from requester pod creation (ReplicaSet scale-up) to requester pod readiness (`/ready` probe passes), which implies the DPC has bound the requester to a server-providing pod and the vLLM instance is serving.
+- **T_wake**: Request-response time for the DPC's `/wake_up` call to a sleeping vLLM instance on the server-providing pod. A part of T_actuation when a hot start occurs.
+- **Hit_rate**: Fraction of requesters that get bound to an existing sleeping pod on the correct GPU (hit) vs. requiring a cold start (i.e., new vLLM instance in existing launcher pod or new launcher pod + new vLLM instance).
+- **T_launcher**: Time from the launcher receiving a create request to the new vLLM instance reporting healthy. Includes the benefit of vLLM module preloading.
+- **T_e2e**: Total time from requester pod creation to first successful inference response. Spans the full path: requester scheduling, DPC binding, instance wake-up or launcher instance creation, vLLM ready, first inference (T_actuation + T_first_token).
+- **T_first_token**: Time from requester pod readiness to first successful inference response received through the server-providing pod's vLLM instance (time-to-first-token, post-actuation).
+
+## Benchmarking Scenarios
+
+| Scenario                           | Description                                                                                                                            |
+| ---------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------- |
+| **Fast Replica Scale Up**          | As a ModelService Owner, I can scale up the number of active replicas for a variant from 0 to 1 or 1 to N with minimal latency         |
+| **Introducing New Variant**        | As a ModelService Owner, I can deploy a newly released variant in anticipation of user requests                                         |
 | **Resource Request Justification** | As a Workload Owner, I can stress-test my namespace's resources to justify more resource requests (routes, gateways, GPUs) from cluster operator |
-| **Maintenance Planning**      | As a Cluster Operator, I can validate the cluster performance is similar or better after node maintenance schedules and upgrades |
+| **Maintenance Planning**           | As a Cluster Operator, I can validate the cluster performance is similar or better after node maintenance schedules and upgrades         |
+
+
+## Benchmarking Matrix
+
+### Actuation Paths (columns)
+
+The columns represent the different paths FMA can take to satisfy a new server-requesting pod,
+using the team's established terminology:
+
+| Actuation Path            | What It Measures | llm-d-benchmark Config |
+| ------------------------- | ---------------- | ---------------------- |
+| **Cold Start**            | No launcher, no sleeping pods. Raw Kubernetes deploy-to-ready latency (non-FMA baseline, or FMA milestone 2 without launcher). | `-t standalone` comparison baseline |
+| **Warm Start**            | A launcher pod exists (pre-created by the LPC) but no sleeping instance is available on the assigned GPU. Launcher creates a new vLLM instance with the benefit of module preloading. | `-t fma` with default `LLMDBENCH_FMA_LAUNCHER_*` env vars |
+| **Hot Start**             | A sleeping vLLM instance exists on the correct GPU. DPC sends `/wake_up`. Best-case actuation path. | `-t fma` with `LLMDBENCH_VLLM_COMMON_ENABLE_SLEEP_MODE=true`, `LLMDBENCH_FMA_DUAL_POD_SLEEPER_LIMIT>=1` |
+
+> **Note on simulation:** Any of the above paths can be exercised with mock GPUs
+> (`llm-d-inference-sim` image or launcher `--mock-mode`) for CI pipelines and scenario
+> prototyping. Simulation is an orthogonal testing mode, not a separate actuation path.
+>
+> **Note on caching:** Model tensor caching (via PVC) and CUDA graph compilation caching
+> are orthogonal to the actuation paths above. Either or both can be enabled for any path
+> besides Hot Start (where the instance is already loaded). Caching configuration is
+> controlled via `LLMDBENCH_VLLM_COMMON_EXTRA_PVC_NAME` and `LLMDBENCH_VLLM_COMMON_VLLM_CACHE_ROOT`
+> in the `fma.sh` scenario.
+
+### Matrix
+
+Cell annotations indicate which measurement layers apply:
+- **L1** -- Layer 1 actuation metrics (T_actuation, T_wake, Hit_rate, T_launcher)
+- **L1+L2** -- Actuation metrics plus inference readiness (T_first_token, T_e2e)
+- **P** -- Planned but not yet implemented
+- **--** -- Not applicable to this combination
+
+| Scenario                           | Cold Start | Warm Start | Hot Start |
+| ---------------------------------- | :--------: | :--------: | :-------: |
+| **Fast Replica Scale Up**          | L1+L2      | L1+L2      | L1+L2     |
+| **Introducing New Variant**        | L1+L2      | L1+L2      | --        |
+| **Resource Request Justification** | L1         | L1         | L1        |
+| **Maintenance Planning**           | L1+L2      | L1+L2      | L1+L2     |
+
+
+### Scenario Rationale
+
+| FMA Scenario                       | Why Included | llm-d-benchmark Applicability |
+| ---------------------------------- | ------------ | ----------------------------- |
+| **Fast Replica Scale Up**          | Core FMA value proposition: how fast can the DPC bring replicas online? Directly measures the benefit of sleep/wake and launcher preloading over cold starts. | `fma.sh` scenario with varying `LLMDBENCH_VLLM_COMMON_REPLICAS`; `inference-scheduling` guide with `-t fma` |
+| **Introducing New Variant**        | Measures actuation latency for a previously unseen model (cold cache, no sleeping instances for this model). Assumes the LPC has already created the needed launchers. Captures the "day 1" deployment experience. | `fma.sh` with `LLMDBENCH_DEPLOY_MODEL_LIST` variations; nop harness for pure actuation timing |
+| **Resource Request Justification** | Stress-tests namespace resources across multiple concurrent models/variants to produce data for capacity planning and resource justification to cluster operators. | `fma.sh` with multi-model list; DoE experiment with replica/model treatments |
+| **Maintenance Planning**           | Regression baseline: run the same scenarios before and after node maintenance or upgrades. Detects performance regressions in actuation latency. | Any guide scenario as regression baseline with `-t fma`; compare pre/post results |
+
+### Actuation Path Rationale
+
+| Actuation Path   | Why Included |
+| ---------------- | ------------ |
+| **Cold Start**   | Baseline without FMA (or FMA milestone 2 without launcher). Establishes the raw Kubernetes deploy-to-ready latency that all FMA paths should improve upon. |
+| **Warm Start**   | Measures the launcher's contribution when no sleeping instance is available. LPC has pre-created launcher pods, and the launcher creates a new vLLM instance with module preloading benefit. |
+| **Hot Start**    | Best-case FMA path. DPC sends `/wake_up` to a sleeping vLLM instance on the correct GPU. Measures sleep-to-wake latency. |
+
+
+## Integration Strategy
+
+### Current State
 
+FMA is being integrated as a third deploy method (`-t fma`) in [llm-d-benchmark](https://github.com/llm-d/llm-d-benchmark),
+alongside `standalone` and `modelservice`. This work is tracked on the
+[`fma` branch](https://github.com/manoelmarques/llm-d-benchmark/tree/fma) and includes:
 
-## Benchmarking Matrix (WIP)
+- **`scenarios/examples/fma.sh`** -- Scenario configuration with sleep mode enabled, model caching PVC, and FMA image references.
+- **`setup/steps/07_deploy_fma_models.py`** -- Standup step that deploys InferenceServerConfig, LauncherConfig, LauncherPopulationPolicy, and requester ReplicaSet CRs. Installs FMA CRDs and the `fma-controllers` Helm chart. Waits for dual-pod controller and launcher-populator readiness.
+- **`setup/env.sh`** -- 35+ new `LLMDBENCH_FMA_*` environment variables covering chart version, image registry/tags, dual-pod configuration, launcher configuration, and requester resource limits.
+- **`setup/run.sh`** -- FMA endpoint discovery via Kubernetes service labels (`stood-up-via=fma`).
+- **`setup/teardown.sh`** -- Ordered teardown: FMA custom resources first, then wait for the dual-pods controller to remove finalizers, then uninstall the FMA Helm release.
 
-| Scenario                      | Cold Start (No Launcher)  | Cold Start (w/ Launcher)  | Caching (No Launcher) | Caching (w/ Launcher) | Scale Up (No Sleep) | Scale Up (Sleep + GPU Hit/Bind) |
-| ----------------------------- | ------------------------- | ------------------------- | --------------------  | --------------------- | ------------------- | ------------------------------- |
-| **Introducing New Variant**   |                           |                           |                       |                       |                     |                                 |
-| **Fast Replica Scale Up**     |                           |                           |                       |                       |                     |                                 |
-| **Free Up Cluster Resources** |                           |                           |                       |                       |                     |                                 |
-| **Resource Request Justification** |                      |                           |                       |                       |                     |                                 |
-| **Maintenance Planning**      |                           |                           |                       |                       |                     |                                 |
+The existing llm-d-benchmark harnesses (nop, inference-perf, vllm-benchmark) can run after
+FMA standup to measure L2 and L3 metrics.
 
+### Next Steps
 
-### Next steps
+1. **Upstream the `fma` branch** into llm-d-benchmark, aligning with the declarative
+   Python architecture in [PR #848](https://github.com/llm-d/llm-d-benchmark/pull/848).
+2. **Add FMA-specific experiment YAML** for Design of Experiments (DoE) treatments:
+   replica count, sleep mode on/off, sleeper limit, model variant combinations.
+3. **Add actuation-specific metrics collection** in the nop harness: T_actuation, T_wake,
+   Hit_rate parsed from FMA pod events and DPC logs.
+4. **Consider Grafana integration** for visual actuation metrics (scale-up latency
+   dashboards), following the pattern in [WVA PR #900](https://github.com/llm-d/llm-d-workload-variant-autoscaler/pull/900).
+5. **Maintain framework-agnostic interface**: the FMA benchmark lifecycle (deploy, measure,
+   teardown) should remain pluggable into other benchmarking frameworks beyond llm-d-benchmark.
 
-This benchmarking framework may be integrated into an existing or a new LLM-D benchmark harness. It should:
 
-- Continuously measure actuation latency across GPU and node changes in the cluster, plus various model variants.
+## Legacy Benchmark Tooling
 
-- Validate improvements across llm-d releases and DPC changes.
+See [benchmark_legacy.md](benchmark_legacy.md) for documentation on the original `benchmark_base.py` tool.