The Dual-Pods Benchmarking Tool measures and reports the startup and readiness latency of model-serving pods within the LLM-D Fast Model Actuation workflow.
The goal is to quantify and compare how quickly a model-serving duo (server-requesting and server-providing pods) becomes available under different actuation conditions such as cold starts, wake-ups from a sleeping state, using prewarmed pods, etc. These metrics will guide future optimizations for the Dual-Pods Controller (DPC). Ultimately, the goal is high predictability, which is defined as achieving close to 100% hit rate of awakening available, sleeping pods on cluster GPUs as function of total inference server requests for common user scenarios.
FMA benchmarking uses a layered measurement model. Layer 1 is what FMA benchmarks own directly. Layer 2 bridges actuation to inference readiness. Layer 3 is out of FMA's direct scope but is referenced for completeness and handoff to other frameworks.
| Layer | Focus | Metrics | Measured By |
|---|---|---|---|
| L1: Actuation | Requester pod readiness | T_actuation (requester creation to readiness), T_wake (launcher wakes sleeping vLLM instance), Hit_rate (% GPU hits), T_launcher (launcher creates new vLLM instance) | llm-d-benchmark new harness |
| L2: Inference Readiness | First inference response | T_first_token (requester ready to first inference response), T_e2e (requester creation to first inference response) | FMA + llm-d-benchmark nop/inference-perf harness |
| L3: Steady-State | Throughput/latency | TPOT (time per output token), throughput, queue depth, KV cache usage, replica stability | llm-d-benchmark / WVA |
Metric definitions:
- T_actuation: Time from requester pod creation (ReplicaSet scale-up) to requester pod readiness (
/readyprobe passes), which implies the DPC has bound the requester to a server-providing pod and the vLLM instance is serving. - T_wake: Time from the DPC instructing the launcher to wake a sleeping vLLM instance on the server-providing pod to that instance reporting ready to serve. A subset of T_actuation when a GPU hit occurs.
- Hit_rate: Percentage of scale-up events where the DPC binds a requester to an existing sleeping pod on the correct GPU (hit) vs. requiring a cold start (miss).
- T_launcher: Time from the launcher receiving a create request to the new vLLM instance reporting healthy. Includes the benefit of vLLM module preloading.
- T_first_token: Time from requester pod readiness to first successful inference response received through the server-providing pod's vLLM instance (time-to-first-token, post-actuation).
- T_e2e: Total time from requester pod creation to first successful inference response. Spans the full path: requester scheduling, DPC binding, launcher wake/create, vLLM ready, first inference (T_actuation + T_first_token).
| Scenario | Description |
|---|---|
| Fast Replica Scale Up | As a ModelService Owner, I can scale up the number of active replicas for a variant from 0 to 1 or 1 to N with minimal latency |
| Introducing New Variant | As a ModelService Owner, I can deploy a newly released variant in anticipation of user requests |
| Free Up Cluster Resources | As a Cluster Operator, I can reduce/deactivate resource intensive variants to make space for numerous smaller model variants |
| Resource Request Justification | As a Workload Owner, I can stress-test my namespace's resources to justify more resource requests (routes, gateways, GPUs) from cluster operator |
| Maintenance Planning | As a Cluster Operator, I can validate the cluster performance is similar or better after node maintenance schedules and upgrades |
Cell annotations indicate which measurement layers apply:
- L1 -- Layer 1 actuation metrics (T_actuation, T_wake, Hit_rate, T_launcher)
- L1+L2 -- Actuation metrics plus inference readiness (T_first_token, T_e2e)
- P -- Planned but not yet implemented
- -- -- Not applicable to this combination
| Scenario | Cold Start (Standalone) | Cold Start (Launcher) | Wake from Sleep (GPU Hit) | Wake from Sleep (GPU Miss) | Model Swap (Launcher) | Cached Model (PVC) | Simulated (Mock GPUs) |
|---|---|---|---|---|---|---|---|
| Fast Replica Scale Up | L1 | L1+L2 | L1+L2 | L1 | -- | L1+L2 | L1 |
| Introducing New Variant | L1 | L1+L2 | -- | -- | L1+L2 | L1+L2 | L1 |
| Free Up Cluster Resources | -- | P | P | -- | P | -- | P |
| Resource Request Justification | L1 | L1 | L1 | L1 | P | L1 | L1 |
| Maintenance Planning | L1 | L1+L2 | L1+L2 | L1 | P | L1+L2 | -- |
| FMA Scenario | Why Included | llm-d-benchmark Applicability |
|---|---|---|
| Fast Replica Scale Up | Core FMA value proposition: how fast can the DPC bring replicas online? Directly measures the benefit of sleep/wake and launcher preloading over cold starts. | fma.sh scenario with varying LLMDBENCH_VLLM_COMMON_REPLICAS; inference-scheduling guide with -t fma |
| Introducing New Variant | Tests the full FMA deployment path: InferenceServerConfig + LauncherConfig + LauncherPopulationPolicy creation, followed by requester ReplicaSet. Captures model download, launcher instance creation, and dual-pod binding. | fma.sh with LLMDBENCH_DEPLOY_MODEL_LIST variations; nop harness for pure actuation timing |
| Free Up Cluster Resources | Validates the reverse path: sleeping/deactivating variants. Important for cluster operators managing GPU capacity across tenants. Measures GPU release latency. | FMA-specific: scale down + verify GPU release via Prometheus. Not yet in llm-d-benchmark |
| Resource Request Justification | Stress-tests namespace resources across multiple concurrent models/variants to produce data for capacity planning and resource justification to cluster operators. | fma.sh with multi-model list; DoE experiment with replica/model treatments |
| Maintenance Planning | Regression baseline: run the same scenarios before and after node maintenance or upgrades. Detects performance regressions in actuation latency. | Any guide scenario as regression baseline with -t fma; compare pre/post results |
| Condition | What It Tests | llm-d-benchmark Config |
|---|---|---|
| Cold Start (Standalone) | No launcher, no sleeping pods. Raw Kubernetes deploy-to-ready latency. Serves as the baseline for all other conditions. | -t standalone comparison baseline |
| Cold Start (w/ Launcher) | Launcher is present and pre-loads vLLM Python modules. Measures the launcher's contribution to reducing cold start latency vs. standalone. | -t fma with default LLMDBENCH_FMA_LAUNCHER_* env vars |
| Wake from Sleep (GPU Hit) | A sleeping vLLM instance exists on the correct GPU. Measures the sleep-to-wake latency, which is the best-case actuation path. | -t fma with LLMDBENCH_VLLM_COMMON_ENABLE_SLEEP_MODE=true, LLMDBENCH_FMA_DUAL_POD_SLEEPER_LIMIT>=1 |
| Wake from Sleep (GPU Miss) | A sleeping pod exists but on the wrong GPU or node. Requires cold start despite available sleeping capacity. Measures the miss penalty and validates DPC scheduling decisions. | -t fma, requires multi-node cluster with asymmetric GPU placement |
| Model Swap (Launcher) | The launcher swaps the model on an existing vLLM instance without restarting the process. Tests the launcher's dynamic model management capability. | -t fma, sequential model deployment via 07_deploy_fma_models.py |
| Cached Model (PVC) | Model weights are pre-cached on a PersistentVolumeClaim, eliminating download time. Isolates the non-download portion of actuation latency. | -t fma with LLMDBENCH_VLLM_COMMON_EXTRA_PVC_NAME + LLMDBENCH_VLLM_COMMON_VLLM_CACHE_ROOT (configured in fma.sh) |
| Simulated (Mock GPUs) | Uses llm-d-inference-sim image or launcher GPU mock mode. No real GPUs required. For CI pipelines and scenario prototyping. |
simulated-accelerators guide or launcher --mock-mode |
FMA is being integrated as a third deploy method (-t fma) in llm-d-benchmark,
alongside standalone and modelservice. This work is tracked on the
fma branch and includes:
scenarios/examples/fma.sh-- Scenario configuration with sleep mode enabled, model caching PVC, and FMA image references.setup/steps/07_deploy_fma_models.py-- Standup step that deploys InferenceServerConfig, LauncherConfig, LauncherPopulationPolicy, and requester ReplicaSet CRs. Installs FMA CRDs and thefma-controllersHelm chart. Waits for dual-pod controller and launcher-populator readiness.setup/env.sh-- 35+ newLLMDBENCH_FMA_*environment variables covering chart version, image registry/tags, dual-pod configuration, launcher configuration, and requester resource limits.setup/run.sh-- FMA endpoint discovery via Kubernetes service labels (stood-up-via=fma).setup/teardown.sh-- Ordered teardown: FMA custom resources first, then wait for the dual-pods controller to remove finalizers, then uninstall the FMA Helm release.
The existing llm-d-benchmark harnesses (nop, inference-perf, vllm-benchmark) can run after FMA standup to measure L2 and L3 metrics.
- Upstream the
fmabranch into llm-d-benchmark, aligning with the declarative Python architecture in PR #848. - Add FMA-specific experiment YAML for Design of Experiments (DoE) treatments: replica count, sleep mode on/off, sleeper limit, model variant combinations.
- Add actuation-specific metrics collection in the nop harness: T_actuation, T_wake, Hit_rate parsed from FMA pod events and DPC logs.
- Consider Grafana integration for visual actuation metrics (scale-up latency dashboards), following the pattern in WVA PR #900.
- Maintain framework-agnostic interface: the FMA benchmark lifecycle (deploy, measure, teardown) should remain pluggable into other benchmarking frameworks beyond llm-d-benchmark.
See benchmark_legacy.md for documentation on the original benchmark_base.py tool.