Skip to content
Open
188 changes: 97 additions & 91 deletions inference_server/benchmark/benchmark.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,103 +11,109 @@ is *high predictability*, which is defined as achieving close to 100% hit rate o
available, sleeping pods on cluster GPUs as function of total inference server
requests for common user scenarios.

## Baseline Startup Latency

**Objective:**
Measure the time from **deployment (server-request submission)** to **dual-pod readiness**.

### Inputs

| Parameter | Type | Required | Default | Description |
| ------------------ | ------ | -------- | --------------------------------------- | ------------------------------------------------------------------------------------------------------ |
| `--namespace` | `str` | **Yes** | — | Openshift namespace to run benchmark |
| `--yaml` | `str` | **Yes** | — | Path to the server-requesting YAML template file |
| `--image` | `str` | **Yes*** | — | Image repository for the requester pod. Required *only if* `CONTAINER_IMG_REG` env var is **not** set |
| `--tag` | `str` | **Yes*** | — | Image tag for the requester pod. Required *only if* `CONTAINER_IMG_VERSION` env var is **not** set |
| `--cleanup` | `bool` | No | `True` | Whether to clean up created resources after the benchmark |
| `--iterations` | `int` | No | `1` | Number of times to run each benchmark scenario |
| `--cluster-domain` | `str` | No | `fmaas-platform-eval.fmaas.res.ibm.com` | Cluster domain for Prometheus GPU metrics query |
| `--model-path` | `str` | No | `None` | Path to JSON file containing models for scenario (used only in the `new_variant` scenario). |
| `--scenario` | `str` | No | `"scaling"` | Benchmark scenario to run: `baseline`, `scaling`, or `new_variant`. |


### Outputs

| Output | Description |
| ---------------------- | -------------------------------------------------------------------------- |
| `startup_time` | Total time from deployment to readiness |
| `availability_mode` | Indicates whether the vLLM instance was started cold or resumed from sleep |

**Example Usage**
```bash
python3 inference_server/benchmark/bechmark_base.py --namespace <str> --yaml <str> --cleanup <bool,default:True> --iterations <int, default:1> --cluster-domain <str> --model-path <str> --scenario <str, default:scaling> --image <str> --tag <str>
```

**Output Example (Subject to Change)**

```
2025-12-01 13:59:52,031 - INFO - scale-request-3-1764615426-4pztx-dual-lhv7s:scale-request-3-1764615426-v9jkh bound through a HIT.
2025-12-01 13:59:52,053 - INFO - Checking readiness of Requester Pod:scale-request-3-1764615426-ktcqd
2025-12-01 13:59:52,496 - INFO - Checking readiness of Requester Pod:scale-request-3-1764615426-ktcqd
2025-12-01 13:59:53,930 - INFO - Checking readiness of Requester Pod:scale-request-3-1764615426-ktcqd
2025-12-01 13:59:53,962 - INFO - Checking readiness of Requester Pod:scale-request-3-1764615426-ktcqd
2025-12-01 13:59:53,972 - INFO - Checking readiness of Requester Pod:scale-request-3-1764615426-ktcqd
2025-12-01 13:59:55,900 - INFO - Checking readiness of Requester Pod:scale-request-3-1764615426-ktcqd
2025-12-01 14:00:03,738 - INFO - Checking readiness of Requester Pod:scale-request-3-1764615426-ktcqd
2025-12-01 14:00:33,850 - INFO - Checking readiness of Requester Pod:scale-request-3-1764615426-ktcqd
2025-12-01 14:01:03,904 - INFO - Checking readiness of Requester Pod:scale-request-3-1764615426-ktcqd
2025-12-01 14:01:22,404 - INFO - Checking readiness of Requester Pod:scale-request-3-1764615426-ktcqd
2025-12-01 14:01:22,405 - INFO - Requester Pod:scale-request-3-1764615426-ktcqd ready after 108s on node:fmaas-vllm-d-wv25b-worker-h100-3-hmtwn using GPU:GPU-ca8ae222-50e0-b69e-16f2-e49dac1afe28
2025-12-01 14:01:22,405 - INFO - scale-request-3-1764615426-ktcqd-dual-pptzq:scale-request-3-1764615426-ktcqd bound through a COLD START.
2025-12-01 14:01:22,405 - INFO - ✅ All pods {'scale-request-3-1764615426-hvxjg', 'scale-request-3-1764615426-v9jkh', 'scale-request-3-1764615426-ktcqd'} Ready after 108.97s
replicaset.apps "scale-request-3-1764615426" deleted
pod "scale-request-3-1764615426-9hlb2-dual-dgcg2" deleted
pod "scale-request-3-1764615426-hvxjg-dual-59hc8" deleted
pod "scale-request-3-1764615426-4pztx-dual-lhv7s" deleted
pod "scale-request-3-1764615426-ktcqd-dual-pptzq" deleted
2025-12-01 14:01:32,868 - INFO - ---------------------------------------------------------------------

Total Runs: 15
Successful Runs: 15
Failed Runs: 0
Requester Pods
Min: 9s,
Max: 318s
Average: 125.4s
Median: 115s
Hits: 3/6 (50%)
Hit Wake-up Times
Min: 9s,
Max: 18s
Average: 13.0s
```

## Benchmarking Scenarios (WIP)

| Scenario | Description |
| ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Fast Replica Scale Up** | As a ModelService Owner, I can scale up the number of active replicas for a variant from 0 to 1 or 1 to 2 with minimal latency |
| **Introducing New Variant** | As a ModelService Owner, I can deploy a newly released variant in anticipation of user requests |
| **Free Up Cluster Resources** | As a Cluster Operator, I can reduce/deactivate resource intensive variants to make space for numerous smaller model variants |
## Measurement Layers

FMA benchmarking uses a layered measurement model. Layer 1 is what FMA benchmarks own
directly. Layer 2 bridges actuation to inference readiness. Layer 3 is out of FMA's
direct scope but is referenced for completeness and handoff to other frameworks.

| Layer | Focus | Metrics | Measured By |
| ----- | ----- | ------- | ----------- |
| **L1: Actuation** | Requester pod readiness | T_actuation (requester creation to readiness), T_wake (launcher wakes sleeping vLLM instance), Hit_rate (% GPU hits), T_launcher (launcher creates new vLLM instance) | llm-d-benchmark new harness |
| **L2: Inference Readiness** | First inference response | T_first_token (requester ready to first inference response), T_e2e (requester creation to first inference response) | FMA + llm-d-benchmark nop/inference-perf harness |
| **L3: Steady-State** | Throughput/latency | TPOT (time per output token), throughput, queue depth, KV cache usage, replica stability | llm-d-benchmark / WVA |

**Metric definitions:**

- **T_actuation**: Time from requester pod creation (ReplicaSet scale-up) to requester pod readiness (`/ready` probe passes), which implies the DPC has bound the requester to a server-providing pod and the vLLM instance is serving.
- **T_wake**: Time from the DPC instructing the launcher to wake a sleeping vLLM instance on the server-providing pod to that instance reporting ready to serve. A subset of T_actuation when a GPU hit occurs.
- **Hit_rate**: Percentage of scale-up events where the DPC binds a requester to an existing sleeping pod on the correct GPU (hit) vs. requiring a cold start (miss).
- **T_launcher**: Time from the launcher receiving a create request to the new vLLM instance reporting healthy. Includes the benefit of vLLM module preloading.
- **T_first_token**: Time from requester pod readiness to first successful inference response received through the server-providing pod's vLLM instance (time-to-first-token, post-actuation).
- **T_e2e**: Total time from requester pod creation to first successful inference response. Spans the full path: requester scheduling, DPC binding, launcher wake/create, vLLM ready, first inference (T_actuation + T_first_token).
Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"launcher wake/create" is not clear about the fact that there are three cases:

  1. wake sleeping vllm instance
  2. create new vllm instance in existing launcher
  3. create launcher and then create new vllm instance in that launcher

All three may also include deletion of sleeping vllm instance(s) to free up GPU memory (but doing it for case 3 is not designed yet, and the design for the first two cases is incomplete).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. How about replacing "launcher wake/create" with the explicit path: "requester scheduling, DPC binding, instance wake-up or launcher instance creation, vLLM ready, first inference"? That avoids collapsing the three cases into ambiguous shorthand.

On deletion of sleeping instances: from what I prompted Claude about the current paths in the code, sleeper deletion happens to respect the sleeperLimit during DPC reconciliation (inference-server.go:758-796), not as part of the wake path itself. But it's good to know where the design is headed, and the benchmark should leave room for these future paths.

Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"instance wake-up or launcher instance creation" is better but still has the problem that it covers only two of the three actuation paths --- it omits the path of creating a launcher then creating a vllm instance in that launcher.

That omission is pervasive in this benchmarking design. I actually think that this path should be measured. While in a simple steady-state situation we might expect that no more launchers are needed than are created by the LPC, it is less clear in a dynamic situation. Consider what must be done when changing a LauncherConfig's definition. Do we want to suddenly double the main memory consumption (full population of old LauncherConfig + full population of new LauncherConfig)? I suspect that will not be tenable. Instead, there may have to be a roll-out. Timed/scheduled to coincide with server-requesting Pod deletions and replacements (since we are going with A3 to Q3 in #201 (comment)). That level of coordination may be difficult to achieve perfectly. So the question becomes relevant: what is the performance of the path where the dual-pods controller creates a launcher? There are other conceivable scenarios where the LPC cannot be expected to pre-create exactly as many launchers as is needed, such as when there is a mix of 1-GPU models and 2-GPU models under dynamic load (e.g., WVA active) and main memory pressure prevents creating as many launchers as there are GPUs.

Regarding deletion of vllm instances, that answer from Claude is confused. "reconciliation" is where the DPC does all its processing, even in the wake-up actuation path. While it is true that today the call at

err, retry := ctl.enforceSleeperBudget(ctx, serverDat, requestingPod, int(lc.Spec.MaxSleepingInstances))
does nothing (serverDat.GPUIndices is empty in the launcher-based case), this is a recognized shortcoming that needs to be fixed for milestone 3 (
// TODO(waltforme): enforceSleeperBudget should be revised for launcher-based server-providing Pods
).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, you are effectively arguing for a fourth column (let's call it "luke warm start" for now), right? I want to confirm because it would require more rethinking for the matrix.

Re: instance deletion, thanks for the clarification.


## Benchmarking Scenarios

| Scenario | Description |
| ---------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------- |
| **Fast Replica Scale Up** | As a ModelService Owner, I can scale up the number of active replicas for a variant from 0 to 1 or 1 to N with minimal latency |
| **Introducing New Variant** | As a ModelService Owner, I can deploy a newly released variant in anticipation of user requests |
| **Free Up Cluster Resources** | As a Cluster Operator, I can reduce/deactivate resource intensive variants to make space for numerous smaller model variants |
| **Resource Request Justification** | As a Workload Owner, I can stress-test my namespace's resources to justify more resource requests (routes, gateways, GPUs) from cluster operator |
| **Maintenance Planning** | As a Cluster Operator, I can validate the cluster performance is similar or better after node maintenance schedules and upgrades |
| **Maintenance Planning** | As a Cluster Operator, I can validate the cluster performance is similar or better after node maintenance schedules and upgrades |


## Benchmarking Matrix (WIP)
## Benchmarking Matrix

| Scenario | Cold Start (No Launcher) | Cold Start (w/ Launcher) | Caching (No Launcher) | Caching (w/ Launcher) | Scale Up (No Sleep) | Scale Up (Sleep + GPU Hit/Bind) |
| ----------------------------- | ------------------------- | ------------------------- | -------------------- | --------------------- | ------------------- | ------------------------------- |
| **Introducing New Variant** | | | | | | |
| **Fast Replica Scale Up** | | | | | | |
| **Free Up Cluster Resources** | | | | | | |
| **Resource Request Justification** | | | | | | |
| **Maintenance Planning** | | | | | | |
Cell annotations indicate which measurement layers apply:
- **L1** -- Layer 1 actuation metrics (T_actuation, T_wake, Hit_rate, T_launcher)
- **L1+L2** -- Actuation metrics plus inference readiness (T_first_token, T_e2e)
- **P** -- Planned but not yet implemented
- **--** -- Not applicable to this combination

| Scenario | Cold Start (Standalone) | Cold Start (Launcher) | Wake from Sleep (GPU Hit) | Wake from Sleep (GPU Miss) | Model Swap (Launcher) | Cached Model (PVC) | Simulated (Mock GPUs) |
| ---------------------------------- | :---------------------: | :-------------------: | :-----------------------: | :------------------------: | :-------------------: | :-----------------: | :-------------------: |
| **Fast Replica Scale Up** | L1 | L1+L2 | L1+L2 | L1 | -- | L1+L2 | L1 |
| **Introducing New Variant** | L1 | L1+L2 | -- | -- | L1+L2 | L1+L2 | L1 |
| **Free Up Cluster Resources** | -- | P | P | -- | P | -- | P |
| **Resource Request Justification** | L1 | L1 | L1 | L1 | P | L1 | L1 |
| **Maintenance Planning** | L1 | L1+L2 | L1+L2 | L1 | P | L1+L2 | -- |

### Next steps

This benchmarking framework may be integrated into an existing or a new LLM-D benchmark harness. It should:
### Scenario Rationale

- Continuously measure actuation latency across GPU and node changes in the cluster, plus various model variants.
| FMA Scenario | Why Included | llm-d-benchmark Applicability |
| ---------------------------------- | ------------ | ----------------------------- |
| **Fast Replica Scale Up** | Core FMA value proposition: how fast can the DPC bring replicas online? Directly measures the benefit of sleep/wake and launcher preloading over cold starts. | `fma.sh` scenario with varying `LLMDBENCH_VLLM_COMMON_REPLICAS`; `inference-scheduling` guide with `-t fma` |
| **Introducing New Variant** | Tests the full FMA deployment path: InferenceServerConfig + LauncherConfig + LauncherPopulationPolicy creation, followed by requester ReplicaSet. Captures model download, launcher instance creation, and dual-pod binding. | `fma.sh` with `LLMDBENCH_DEPLOY_MODEL_LIST` variations; nop harness for pure actuation timing |
| **Free Up Cluster Resources** | Validates the reverse path: sleeping/deactivating variants. Important for cluster operators managing GPU capacity across tenants. Measures GPU release latency. | FMA-specific: scale down + verify GPU release via Prometheus. Not yet in llm-d-benchmark |
| **Resource Request Justification** | Stress-tests namespace resources across multiple concurrent models/variants to produce data for capacity planning and resource justification to cluster operators. | `fma.sh` with multi-model list; DoE experiment with replica/model treatments |
| **Maintenance Planning** | Regression baseline: run the same scenarios before and after node maintenance or upgrades. Detects performance regressions in actuation latency. | Any guide scenario as regression baseline with `-t fma`; compare pre/post results |

- Validate improvements across llm-d releases and DPC changes.
### Actuation Condition Rationale

| Condition | What It Tests | llm-d-benchmark Config |
| ------------------------------ | ------------- | ---------------------- |
| **Cold Start (Standalone)** | No launcher, no sleeping pods. Raw Kubernetes deploy-to-ready latency. Serves as the baseline for all other conditions. | `-t standalone` comparison baseline |
| **Cold Start (w/ Launcher)** | Launcher is present and pre-loads vLLM Python modules. Measures the launcher's contribution to reducing cold start latency vs. standalone. | `-t fma` with default `LLMDBENCH_FMA_LAUNCHER_*` env vars |
| **Wake from Sleep (GPU Hit)** | A sleeping vLLM instance exists on the correct GPU. Measures the sleep-to-wake latency, which is the best-case actuation path. | `-t fma` with `LLMDBENCH_VLLM_COMMON_ENABLE_SLEEP_MODE=true`, `LLMDBENCH_FMA_DUAL_POD_SLEEPER_LIMIT>=1` |
| **Wake from Sleep (GPU Miss)** | A sleeping pod exists but on the wrong GPU or node. Requires cold start despite available sleeping capacity. Measures the miss penalty and validates DPC scheduling decisions. | `-t fma`, requires multi-node cluster with asymmetric GPU placement |
| **Model Swap (Launcher)** | The launcher swaps the model on an existing vLLM instance without restarting the process. Tests the launcher's dynamic model management capability. | `-t fma`, sequential model deployment via `07_deploy_fma_models.py` |
| **Cached Model (PVC)** | Model weights are pre-cached on a PersistentVolumeClaim, eliminating download time. Isolates the non-download portion of actuation latency. | `-t fma` with `LLMDBENCH_VLLM_COMMON_EXTRA_PVC_NAME` + `LLMDBENCH_VLLM_COMMON_VLLM_CACHE_ROOT` (configured in `fma.sh`) |
| **Simulated (Mock GPUs)** | Uses `llm-d-inference-sim` image or launcher GPU mock mode. No real GPUs required. For CI pipelines and scenario prototyping. | `simulated-accelerators` guide or launcher `--mock-mode` |


## Integration Strategy

### Current State

FMA is being integrated as a third deploy method (`-t fma`) in [llm-d-benchmark](https://github.com/llm-d/llm-d-benchmark),
alongside `standalone` and `modelservice`. This work is tracked on the
[`fma` branch](https://github.com/manoelmarques/llm-d-benchmark/tree/fma) and includes:

- **`scenarios/examples/fma.sh`** -- Scenario configuration with sleep mode enabled, model caching PVC, and FMA image references.
- **`setup/steps/07_deploy_fma_models.py`** -- Standup step that deploys InferenceServerConfig, LauncherConfig, LauncherPopulationPolicy, and requester ReplicaSet CRs. Installs FMA CRDs and the `fma-controllers` Helm chart. Waits for dual-pod controller and launcher-populator readiness.
- **`setup/env.sh`** -- 35+ new `LLMDBENCH_FMA_*` environment variables covering chart version, image registry/tags, dual-pod configuration, launcher configuration, and requester resource limits.
- **`setup/run.sh`** -- FMA endpoint discovery via Kubernetes service labels (`stood-up-via=fma`).
- **`setup/teardown.sh`** -- Ordered teardown: FMA custom resources first, then wait for the dual-pods controller to remove finalizers, then uninstall the FMA Helm release.

The existing llm-d-benchmark harnesses (nop, inference-perf, vllm-benchmark) can run after
FMA standup to measure L2 and L3 metrics.

### Next Steps

1. **Upstream the `fma` branch** into llm-d-benchmark, aligning with the declarative
Python architecture in [PR #848](https://github.com/llm-d/llm-d-benchmark/pull/848).
2. **Add FMA-specific experiment YAML** for Design of Experiments (DoE) treatments:
replica count, sleep mode on/off, sleeper limit, model variant combinations.
3. **Add actuation-specific metrics collection** in the nop harness: T_actuation, T_wake,
Hit_rate parsed from FMA pod events and DPC logs.
4. **Consider Grafana integration** for visual actuation metrics (scale-up latency
dashboards), following the pattern in [WVA PR #900](https://github.com/llm-d/llm-d-workload-variant-autoscaler/pull/900).
5. **Maintain framework-agnostic interface**: the FMA benchmark lifecycle (deploy, measure,
teardown) should remain pluggable into other benchmarking frameworks beyond llm-d-benchmark.


## Legacy Benchmark Tooling

See [benchmark_legacy.md](benchmark_legacy.md) for documentation on the original `benchmark_base.py` tool.
Loading
Loading