-
Notifications
You must be signed in to change notification settings - Fork 14
docs: Update benchmark scenarios matrix and integration strategy #381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 8 commits
3e68abb
dcc28ee
9667ca1
64e1b11
713f80b
31da7d0
d8f48e7
420cf6b
0c6f1c9
5dcf115
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -4,110 +4,139 @@ latency of model-serving pods within the LLM-D Fast Model Actuation workflow. | |
|
|
||
| ## Purpose | ||
| The goal is to quantify and compare how quickly a model-serving duo (server-requesting | ||
| and server-providing pods) becomes available under different actuation conditions such | ||
| as cold starts, wake-ups from a sleeping state, using prewarmed pods, etc. These metrics | ||
| will guide future optimizations for the **Dual-Pods Controller (DPC)**. Ultimately, the goal | ||
| and server-providing pods), when integrated with the Workload Variant Autoscaler (WVA), | ||
| becomes available under three different actuation conditions in order of decreasing | ||
| latency: | ||
|
|
||
| - **Cold start**: creating a new vLLM instance without using a launcher | ||
| - **Warm start**: creating a new vLLM instance in an existing launcher pod | ||
| - **Hot start**: waking a sleeping vLLM instance on an existing launcher pod | ||
|
|
||
| These metrics will guide future optimizations for the **Dual-Pods Controller (DPC)**. Ultimately, the goal | ||
| is *high predictability*, which is defined as achieving close to 100% hit rate of awakening | ||
| available, sleeping pods on cluster GPUs as function of total inference server | ||
| requests for common user scenarios. | ||
|
|
||
| ## Baseline Startup Latency | ||
|
|
||
| **Objective:** | ||
| Measure the time from **deployment (server-request submission)** to **dual-pod readiness**. | ||
|
|
||
| ### Inputs | ||
|
|
||
| | Parameter | Type | Required | Default | Description | | ||
| | ------------------ | ------ | -------- | --------------------------------------- | ------------------------------------------------------------------------------------------------------ | | ||
| | `--namespace` | `str` | **Yes** | — | Openshift namespace to run benchmark | | ||
| | `--yaml` | `str` | **Yes** | — | Path to the server-requesting YAML template file | | ||
| | `--image` | `str` | **Yes*** | — | Image repository for the requester pod. Required *only if* `CONTAINER_IMG_REG` env var is **not** set | | ||
| | `--tag` | `str` | **Yes*** | — | Image tag for the requester pod. Required *only if* `CONTAINER_IMG_VERSION` env var is **not** set | | ||
| | `--cleanup` | `bool` | No | `True` | Whether to clean up created resources after the benchmark | | ||
| | `--iterations` | `int` | No | `1` | Number of times to run each benchmark scenario | | ||
| | `--cluster-domain` | `str` | No | `fmaas-platform-eval.fmaas.res.ibm.com` | Cluster domain for Prometheus GPU metrics query | | ||
| | `--model-path` | `str` | No | `None` | Path to JSON file containing models for scenario (used only in the `new_variant` scenario). | | ||
| | `--scenario` | `str` | No | `"scaling"` | Benchmark scenario to run: `baseline`, `scaling`, or `new_variant`. | | ||
|
|
||
|
|
||
| ### Outputs | ||
|
|
||
| | Output | Description | | ||
| | ---------------------- | -------------------------------------------------------------------------- | | ||
| | `startup_time` | Total time from deployment to readiness | | ||
| | `availability_mode` | Indicates whether the vLLM instance was started cold or resumed from sleep | | ||
|
|
||
| **Example Usage** | ||
| ```bash | ||
| python3 inference_server/benchmark/bechmark_base.py --namespace <str> --yaml <str> --cleanup <bool,default:True> --iterations <int, default:1> --cluster-domain <str> --model-path <str> --scenario <str, default:scaling> --image <str> --tag <str> | ||
| ``` | ||
|
|
||
| **Output Example (Subject to Change)** | ||
|
|
||
| ``` | ||
| 2025-12-01 13:59:52,031 - INFO - scale-request-3-1764615426-4pztx-dual-lhv7s:scale-request-3-1764615426-v9jkh bound through a HIT. | ||
| 2025-12-01 13:59:52,053 - INFO - Checking readiness of Requester Pod:scale-request-3-1764615426-ktcqd | ||
| 2025-12-01 13:59:52,496 - INFO - Checking readiness of Requester Pod:scale-request-3-1764615426-ktcqd | ||
| 2025-12-01 13:59:53,930 - INFO - Checking readiness of Requester Pod:scale-request-3-1764615426-ktcqd | ||
| 2025-12-01 13:59:53,962 - INFO - Checking readiness of Requester Pod:scale-request-3-1764615426-ktcqd | ||
| 2025-12-01 13:59:53,972 - INFO - Checking readiness of Requester Pod:scale-request-3-1764615426-ktcqd | ||
| 2025-12-01 13:59:55,900 - INFO - Checking readiness of Requester Pod:scale-request-3-1764615426-ktcqd | ||
| 2025-12-01 14:00:03,738 - INFO - Checking readiness of Requester Pod:scale-request-3-1764615426-ktcqd | ||
| 2025-12-01 14:00:33,850 - INFO - Checking readiness of Requester Pod:scale-request-3-1764615426-ktcqd | ||
| 2025-12-01 14:01:03,904 - INFO - Checking readiness of Requester Pod:scale-request-3-1764615426-ktcqd | ||
| 2025-12-01 14:01:22,404 - INFO - Checking readiness of Requester Pod:scale-request-3-1764615426-ktcqd | ||
| 2025-12-01 14:01:22,405 - INFO - Requester Pod:scale-request-3-1764615426-ktcqd ready after 108s on node:fmaas-vllm-d-wv25b-worker-h100-3-hmtwn using GPU:GPU-ca8ae222-50e0-b69e-16f2-e49dac1afe28 | ||
| 2025-12-01 14:01:22,405 - INFO - scale-request-3-1764615426-ktcqd-dual-pptzq:scale-request-3-1764615426-ktcqd bound through a COLD START. | ||
| 2025-12-01 14:01:22,405 - INFO - ✅ All pods {'scale-request-3-1764615426-hvxjg', 'scale-request-3-1764615426-v9jkh', 'scale-request-3-1764615426-ktcqd'} Ready after 108.97s | ||
| replicaset.apps "scale-request-3-1764615426" deleted | ||
| pod "scale-request-3-1764615426-9hlb2-dual-dgcg2" deleted | ||
| pod "scale-request-3-1764615426-hvxjg-dual-59hc8" deleted | ||
| pod "scale-request-3-1764615426-4pztx-dual-lhv7s" deleted | ||
| pod "scale-request-3-1764615426-ktcqd-dual-pptzq" deleted | ||
| 2025-12-01 14:01:32,868 - INFO - --------------------------------------------------------------------- | ||
|
|
||
| Total Runs: 15 | ||
| Successful Runs: 15 | ||
| Failed Runs: 0 | ||
| Requester Pods | ||
| Min: 9s, | ||
| Max: 318s | ||
| Average: 125.4s | ||
| Median: 115s | ||
| Hits: 3/6 (50%) | ||
| Hit Wake-up Times | ||
| Min: 9s, | ||
| Max: 18s | ||
| Average: 13.0s | ||
| ``` | ||
|
|
||
| ## Benchmarking Scenarios (WIP) | ||
|
|
||
| | Scenario | Description | | ||
| | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- | | ||
| | **Fast Replica Scale Up** | As a ModelService Owner, I can scale up the number of active replicas for a variant from 0 to 1 or 1 to 2 with minimal latency | | ||
| | **Introducing New Variant** | As a ModelService Owner, I can deploy a newly released variant in anticipation of user requests | | ||
| | **Free Up Cluster Resources** | As a Cluster Operator, I can reduce/deactivate resource intensive variants to make space for numerous smaller model variants | | ||
| ## Measurement Layers | ||
|
|
||
| FMA benchmarking uses a layered measurement model. Layer 1 is what FMA benchmarks own | ||
| directly. Layer 2 bridges actuation to inference readiness (i.e., latency for inference | ||
| requests to get responses back in an FMA-enabled context). Layer 3 is out of FMA's | ||
| direct scope but is referenced for completeness and handoff to other frameworks. | ||
|
|
||
| | Layer | Focus | Metrics | Measured By | | ||
| | ----- | ----- | ------- | ----------- | | ||
| | **L1: Actuation** | Requester pod readiness | T_actuation (requester creation to readiness), T_wake (DPC wakes sleeping vLLM instance), Hit_rate (GPU hits), T_launcher (launcher creates new vLLM instance) | llm-d-benchmark new harness | | ||
| | **L2: Inference Readiness** | First inference response | T_e2e (requester creation to first inference response), T_first_token (requester ready to first inference response) | llm-d-benchmark nop/inference-perf harness | | ||
| | **L3: Steady-State** | Throughput/latency | T_actuation (requester creation to readiness), TPOT (time per output token), throughput, queue depth, KV cache usage, replica stability | llm-d-benchmark / WVA | | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I find L3 surprising and confusing. The difference from L1 to L2 is not like the difference from L2 to L3. The difference from L1 to L2 is more of the same thing: more latency (picking a later event that stops the clock). I expected L3 to differ in the same way; perhaps start the clock earlier (e.g., when an inference client sends a request). But mostly L3 is about different stuff, not latency. I do not see how WVA plays a role in any of the things listed for L3 (nor L2 nor L1), which is also a surprise. I expected to see WVA involved somewhere.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. L3 is intentionally different in kind from L1/L2. L1 and L2 are stop-the-clock latency measurements that FMA owns. L3 is about steady-state performance metrics (TPOT, throughput, queue depth, KV cache usage, replica stability) that become relevant when FMA is integrated with WVA and the broader llm-d stack. The inclusion of T_actuation in L3 is the bridge: it lets WVA and llm-d-benchmark consumers see how FMA actuation latency affects their steady-state metrics. The pitch is: look at how your metrics perform with FMA included. I had Claude double check that these metrics are currently being captured: TPOT, throughput, KV cache usage, and queue depth are in llm-d-benchmark's schema v0.2 and
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is there one and only one metric that appears in multiple layers?
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As stated in my previous comment, T_actuation is the bridge that we care most about. It'''s the one metric that FMA uniquely contributes to the L3 picture, letting WVA and llm-d-benchmark consumers see how actuation latency affects their steady-state metrics. The other L1/L2 metrics are either FMA-internal (T_wake, Hit_rate, T_launcher) or already captured by L3 tools in their own terms (TTFT maps to T_first_token). |
||
|
|
||
| **Metric definitions:** | ||
|
|
||
| - **T_actuation**: Time from requester pod creation (ReplicaSet scale-up) to requester pod readiness (`/ready` probe passes), which implies the DPC has bound the requester to a server-providing pod and the vLLM instance is serving. | ||
| - **T_wake**: Request-response time for the DPC's `/wake_up` call to a sleeping vLLM instance on the server-providing pod. A part of T_actuation when a hot start occurs. | ||
| - **Hit_rate**: Fraction of requesters that get bound to an existing sleeping pod on the correct GPU (hit) vs. requiring a cold start (i.e., new vLLM instance in existing launcher pod or new launcher pod + new vLLM instance). | ||
aavarghese marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| - **T_launcher**: Time from the launcher receiving a create request to the new vLLM instance reporting healthy. Includes the benefit of vLLM module preloading. | ||
rubambiza marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| - **T_e2e**: Total time from requester pod creation to first successful inference response. Spans the full path: requester scheduling, DPC binding, instance wake-up or launcher instance creation, vLLM ready, first inference (T_actuation + T_first_token). | ||
| - **T_first_token**: Time from requester pod readiness to first successful inference response received through the server-providing pod's vLLM instance (time-to-first-token, post-actuation). | ||
|
|
||
| ## Benchmarking Scenarios | ||
|
|
||
| | Scenario | Description | | ||
| | ---------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------- | | ||
| | **Fast Replica Scale Up** | As a ModelService Owner, I can scale up the number of active replicas for a variant from 0 to 1 or 1 to N with minimal latency | | ||
| | **Introducing New Variant** | As a ModelService Owner, I can deploy a newly released variant in anticipation of user requests | | ||
| | **Resource Request Justification** | As a Workload Owner, I can stress-test my namespace's resources to justify more resource requests (routes, gateways, GPUs) from cluster operator | | ||
| | **Maintenance Planning** | As a Cluster Operator, I can validate the cluster performance is similar or better after node maintenance schedules and upgrades | | ||
| | **Maintenance Planning** | As a Cluster Operator, I can validate the cluster performance is similar or better after node maintenance schedules and upgrades | | ||
|
|
||
|
|
||
| ## Benchmarking Matrix | ||
|
|
||
| ### Actuation Paths (columns) | ||
|
|
||
| The columns represent the different paths FMA can take to satisfy a new server-requesting pod, | ||
| using the team's established terminology: | ||
|
|
||
| | Actuation Path | What It Measures | llm-d-benchmark Config | | ||
| | ------------------------- | ---------------- | ---------------------- | | ||
| | **Cold Start** | No launcher, no sleeping pods. Raw Kubernetes deploy-to-ready latency (non-FMA baseline, or FMA milestone 2 without launcher). | `-t standalone` comparison baseline | | ||
| | **Warm Start** | A launcher pod exists (pre-created by the LPC) but no sleeping instance is available on the assigned GPU. Launcher creates a new vLLM instance with the benefit of module preloading. | `-t fma` with default `LLMDBENCH_FMA_LAUNCHER_*` env vars | | ||
| | **Hot Start** | A sleeping vLLM instance exists on the correct GPU. DPC sends `/wake_up`. Best-case actuation path. | `-t fma` with `LLMDBENCH_VLLM_COMMON_ENABLE_SLEEP_MODE=true`, `LLMDBENCH_FMA_DUAL_POD_SLEEPER_LIMIT>=1` | | ||
|
|
||
| > **Note on simulation:** Any of the above paths can be exercised with mock GPUs | ||
| > (`llm-d-inference-sim` image or launcher `--mock-mode`) for CI pipelines and scenario | ||
| > prototyping. Simulation is an orthogonal testing mode, not a separate actuation path. | ||
| > | ||
| > **Note on caching:** Model tensor caching (via PVC) and CUDA graph compilation caching | ||
| > are orthogonal to the actuation paths above. Either or both can be enabled for any path | ||
| > besides Hot Start (where the instance is already loaded). Caching configuration is | ||
| > controlled via `LLMDBENCH_VLLM_COMMON_EXTRA_PVC_NAME` and `LLMDBENCH_VLLM_COMMON_VLLM_CACHE_ROOT` | ||
| > in the `fma.sh` scenario. | ||
|
|
||
| ### Matrix | ||
|
|
||
| Cell annotations indicate which measurement layers apply: | ||
| - **L1** -- Layer 1 actuation metrics (T_actuation, T_wake, Hit_rate, T_launcher) | ||
| - **L1+L2** -- Actuation metrics plus inference readiness (T_first_token, T_e2e) | ||
| - **P** -- Planned but not yet implemented | ||
| - **--** -- Not applicable to this combination | ||
|
|
||
| | Scenario | Cold Start | Warm Start | Hot Start | | ||
| | ---------------------------------- | :--------: | :--------: | :-------: | | ||
| | **Fast Replica Scale Up** | L1+L2 | L1+L2 | L1+L2 | | ||
| | **Introducing New Variant** | L1+L2 | L1+L2 | -- | | ||
| | **Resource Request Justification** | L1 | L1 | L1 | | ||
| | **Maintenance Planning** | L1+L2 | L1+L2 | L1+L2 | | ||
|
|
||
|
|
||
| ### Scenario Rationale | ||
|
|
||
| | FMA Scenario | Why Included | llm-d-benchmark Applicability | | ||
| | ---------------------------------- | ------------ | ----------------------------- | | ||
| | **Fast Replica Scale Up** | Core FMA value proposition: how fast can the DPC bring replicas online? Directly measures the benefit of sleep/wake and launcher preloading over cold starts. | `fma.sh` scenario with varying `LLMDBENCH_VLLM_COMMON_REPLICAS`; `inference-scheduling` guide with `-t fma` | | ||
| | **Introducing New Variant** | Measures actuation latency for a previously unseen model (cold cache, no sleeping instances for this model). Assumes the LPC has already created the needed launchers. Captures the "day 1" deployment experience. | `fma.sh` with `LLMDBENCH_DEPLOY_MODEL_LIST` variations; nop harness for pure actuation timing | | ||
| | **Resource Request Justification** | Stress-tests namespace resources across multiple concurrent models/variants to produce data for capacity planning and resource justification to cluster operators. | `fma.sh` with multi-model list; DoE experiment with replica/model treatments | | ||
| | **Maintenance Planning** | Regression baseline: run the same scenarios before and after node maintenance or upgrades. Detects performance regressions in actuation latency. | Any guide scenario as regression baseline with `-t fma`; compare pre/post results | | ||
|
|
||
| ### Actuation Path Rationale | ||
|
|
||
| | Actuation Path | Why Included | | ||
| | ---------------- | ------------ | | ||
| | **Cold Start** | Baseline without FMA (or FMA milestone 2 without launcher). Establishes the raw Kubernetes deploy-to-ready latency that all FMA paths should improve upon. | | ||
| | **Warm Start** | Measures the launcher's contribution when no sleeping instance is available. LPC has pre-created launcher pods, and the launcher creates a new vLLM instance with module preloading benefit. | | ||
| | **Hot Start** | Best-case FMA path. DPC sends `/wake_up` to a sleeping vLLM instance on the correct GPU. Measures sleep-to-wake latency. | | ||
|
|
||
|
|
||
| ## Integration Strategy | ||
|
|
||
| ### Current State | ||
|
|
||
| FMA is being integrated as a third deploy method (`-t fma`) in [llm-d-benchmark](https://github.com/llm-d/llm-d-benchmark), | ||
| alongside `standalone` and `modelservice`. This work is tracked on the | ||
| [`fma` branch](https://github.com/manoelmarques/llm-d-benchmark/tree/fma) and includes: | ||
|
|
||
| ## Benchmarking Matrix (WIP) | ||
| - **`scenarios/examples/fma.sh`** -- Scenario configuration with sleep mode enabled, model caching PVC, and FMA image references. | ||
| - **`setup/steps/07_deploy_fma_models.py`** -- Standup step that deploys InferenceServerConfig, LauncherConfig, LauncherPopulationPolicy, and requester ReplicaSet CRs. Installs FMA CRDs and the `fma-controllers` Helm chart. Waits for dual-pod controller and launcher-populator readiness. | ||
| - **`setup/env.sh`** -- 35+ new `LLMDBENCH_FMA_*` environment variables covering chart version, image registry/tags, dual-pod configuration, launcher configuration, and requester resource limits. | ||
| - **`setup/run.sh`** -- FMA endpoint discovery via Kubernetes service labels (`stood-up-via=fma`). | ||
| - **`setup/teardown.sh`** -- Ordered teardown: FMA custom resources first, then wait for the dual-pods controller to remove finalizers, then uninstall the FMA Helm release. | ||
|
|
||
| | Scenario | Cold Start (No Launcher) | Cold Start (w/ Launcher) | Caching (No Launcher) | Caching (w/ Launcher) | Scale Up (No Sleep) | Scale Up (Sleep + GPU Hit/Bind) | | ||
| | ----------------------------- | ------------------------- | ------------------------- | -------------------- | --------------------- | ------------------- | ------------------------------- | | ||
| | **Introducing New Variant** | | | | | | | | ||
| | **Fast Replica Scale Up** | | | | | | | | ||
| | **Free Up Cluster Resources** | | | | | | | | ||
| | **Resource Request Justification** | | | | | | | | ||
| | **Maintenance Planning** | | | | | | | | ||
| The existing llm-d-benchmark harnesses (nop, inference-perf, vllm-benchmark) can run after | ||
| FMA standup to measure L2 and L3 metrics. | ||
|
|
||
| ### Next Steps | ||
|
|
||
| ### Next steps | ||
| 1. **Upstream the `fma` branch** into llm-d-benchmark, aligning with the declarative | ||
| Python architecture in [PR #848](https://github.com/llm-d/llm-d-benchmark/pull/848). | ||
| 2. **Add FMA-specific experiment YAML** for Design of Experiments (DoE) treatments: | ||
| replica count, sleep mode on/off, sleeper limit, model variant combinations. | ||
| 3. **Add actuation-specific metrics collection** in the nop harness: T_actuation, T_wake, | ||
| Hit_rate parsed from FMA pod events and DPC logs. | ||
| 4. **Consider Grafana integration** for visual actuation metrics (scale-up latency | ||
| dashboards), following the pattern in [WVA PR #900](https://github.com/llm-d/llm-d-workload-variant-autoscaler/pull/900). | ||
| 5. **Maintain framework-agnostic interface**: the FMA benchmark lifecycle (deploy, measure, | ||
| teardown) should remain pluggable into other benchmarking frameworks beyond llm-d-benchmark. | ||
|
|
||
| This benchmarking framework may be integrated into an existing or a new LLM-D benchmark harness. It should: | ||
|
|
||
| - Continuously measure actuation latency across GPU and node changes in the cluster, plus various model variants. | ||
| ## Legacy Benchmark Tooling | ||
|
|
||
| - Validate improvements across llm-d releases and DPC changes. | ||
| See [benchmark_legacy.md](benchmark_legacy.md) for documentation on the original `benchmark_base.py` tool. | ||
Uh oh!
There was an error while loading. Please reload this page.