Dual-Pods Benchmarking Tool

The Dual-Pods Benchmarking Tool measures and reports the startup and readiness latency of model-serving pods within the LLM-D Fast Model Actuation workflow.

Purpose

The goal is to quantify and compare how quickly a model-serving duo (server-requesting and server-providing pods) becomes available under different actuation conditions such as cold starts, wake-ups from a sleeping state, using prewarmed pods, etc. These metrics will guide future optimizations for the Dual-Pods Controller (DPC). Ultimately, the goal is high predictability, which is defined as achieving close to 100% hit rate of awakening available, sleeping pods on cluster GPUs as function of total inference server requests for common user scenarios.

Measurement Layers

FMA benchmarking uses a layered measurement model. Layer 1 is what FMA benchmarks own directly. Layer 2 bridges actuation to inference readiness. Layer 3 is out of FMA's direct scope but is referenced for completeness and handoff to other frameworks.

Layer	Focus	Metrics	Measured By
L1: Actuation	Requester pod readiness	T_actuation (requester creation to readiness), T_wake (launcher wakes sleeping vLLM instance), Hit_rate (% GPU hits), T_launcher (launcher creates new vLLM instance)	llm-d-benchmark new harness
L2: Inference Readiness	First inference response	T_first_token (requester ready to first inference response), T_e2e (requester creation to first inference response)	FMA + llm-d-benchmark nop/inference-perf harness
L3: Steady-State	Throughput/latency	TPOT (time per output token), throughput, queue depth, KV cache usage, replica stability	llm-d-benchmark / WVA

Metric definitions:

T_actuation: Time from requester pod creation (ReplicaSet scale-up) to requester pod readiness (/ready probe passes), which implies the DPC has bound the requester to a server-providing pod and the vLLM instance is serving.
T_wake: Time from the DPC instructing the launcher to wake a sleeping vLLM instance on the server-providing pod to that instance reporting ready to serve. A subset of T_actuation when a GPU hit occurs.
Hit_rate: Percentage of scale-up events where the DPC binds a requester to an existing sleeping pod on the correct GPU (hit) vs. requiring a cold start (miss).
T_launcher: Time from the launcher receiving a create request to the new vLLM instance reporting healthy. Includes the benefit of vLLM module preloading.
T_first_token: Time from requester pod readiness to first successful inference response received through the server-providing pod's vLLM instance (time-to-first-token, post-actuation).
T_e2e: Total time from requester pod creation to first successful inference response. Spans the full path: requester scheduling, DPC binding, launcher wake/create, vLLM ready, first inference (T_actuation + T_first_token).

Benchmarking Scenarios

Scenario	Description
Fast Replica Scale Up	As a ModelService Owner, I can scale up the number of active replicas for a variant from 0 to 1 or 1 to N with minimal latency
Introducing New Variant	As a ModelService Owner, I can deploy a newly released variant in anticipation of user requests
Free Up Cluster Resources	As a Cluster Operator, I can reduce/deactivate resource intensive variants to make space for numerous smaller model variants
Resource Request Justification	As a Workload Owner, I can stress-test my namespace's resources to justify more resource requests (routes, gateways, GPUs) from cluster operator
Maintenance Planning	As a Cluster Operator, I can validate the cluster performance is similar or better after node maintenance schedules and upgrades

Benchmarking Matrix

Cell annotations indicate which measurement layers apply:

L1 -- Layer 1 actuation metrics (T_actuation, T_wake, Hit_rate, T_launcher)
L1+L2 -- Actuation metrics plus inference readiness (T_first_token, T_e2e)
P -- Planned but not yet implemented
-- -- Not applicable to this combination

Scenario	Cold Start (Standalone)	Cold Start (Launcher)	Wake from Sleep (GPU Hit)	Wake from Sleep (GPU Miss)	Model Swap (Launcher)	Cached Model (PVC)	Simulated (Mock GPUs)
Fast Replica Scale Up	L1	L1+L2	L1+L2	L1	--	L1+L2	L1
Introducing New Variant	L1	L1+L2	--	--	L1+L2	L1+L2	L1
Free Up Cluster Resources	--	P	P	--	P	--	P
Resource Request Justification	L1	L1	L1	L1	P	L1	L1
Maintenance Planning	L1	L1+L2	L1+L2	L1	P	L1+L2	--

Scenario Rationale

FMA Scenario	Why Included	llm-d-benchmark Applicability
Fast Replica Scale Up	Core FMA value proposition: how fast can the DPC bring replicas online? Directly measures the benefit of sleep/wake and launcher preloading over cold starts.	`fma.sh` scenario with varying `LLMDBENCH_VLLM_COMMON_REPLICAS`; `inference-scheduling` guide with `-t fma`
Introducing New Variant	Tests the full FMA deployment path: InferenceServerConfig + LauncherConfig + LauncherPopulationPolicy creation, followed by requester ReplicaSet. Captures model download, launcher instance creation, and dual-pod binding.	`fma.sh` with `LLMDBENCH_DEPLOY_MODEL_LIST` variations; nop harness for pure actuation timing
Free Up Cluster Resources	Validates the reverse path: sleeping/deactivating variants. Important for cluster operators managing GPU capacity across tenants. Measures GPU release latency.	FMA-specific: scale down + verify GPU release via Prometheus. Not yet in llm-d-benchmark
Resource Request Justification	Stress-tests namespace resources across multiple concurrent models/variants to produce data for capacity planning and resource justification to cluster operators.	`fma.sh` with multi-model list; DoE experiment with replica/model treatments
Maintenance Planning	Regression baseline: run the same scenarios before and after node maintenance or upgrades. Detects performance regressions in actuation latency.	Any guide scenario as regression baseline with `-t fma`; compare pre/post results

Actuation Condition Rationale

Condition	What It Tests	llm-d-benchmark Config
Cold Start (Standalone)	No launcher, no sleeping pods. Raw Kubernetes deploy-to-ready latency. Serves as the baseline for all other conditions.	`-t standalone` comparison baseline
Cold Start (w/ Launcher)	Launcher is present and pre-loads vLLM Python modules. Measures the launcher's contribution to reducing cold start latency vs. standalone.	`-t fma` with default `LLMDBENCH_FMA_LAUNCHER_*` env vars
Wake from Sleep (GPU Hit)	A sleeping vLLM instance exists on the correct GPU. Measures the sleep-to-wake latency, which is the best-case actuation path.	`-t fma` with `LLMDBENCH_VLLM_COMMON_ENABLE_SLEEP_MODE=true`, `LLMDBENCH_FMA_DUAL_POD_SLEEPER_LIMIT>=1`
Wake from Sleep (GPU Miss)	A sleeping pod exists but on the wrong GPU or node. Requires cold start despite available sleeping capacity. Measures the miss penalty and validates DPC scheduling decisions.	`-t fma`, requires multi-node cluster with asymmetric GPU placement
Model Swap (Launcher)	The launcher swaps the model on an existing vLLM instance without restarting the process. Tests the launcher's dynamic model management capability.	`-t fma`, sequential model deployment via `07_deploy_fma_models.py`
Cached Model (PVC)	Model weights are pre-cached on a PersistentVolumeClaim, eliminating download time. Isolates the non-download portion of actuation latency.	`-t fma` with `LLMDBENCH_VLLM_COMMON_EXTRA_PVC_NAME` + `LLMDBENCH_VLLM_COMMON_VLLM_CACHE_ROOT` (configured in `fma.sh`)
Simulated (Mock GPUs)	Uses `llm-d-inference-sim` image or launcher GPU mock mode. No real GPUs required. For CI pipelines and scenario prototyping.	`simulated-accelerators` guide or launcher `--mock-mode`

Integration Strategy

Current State

FMA is being integrated as a third deploy method (-t fma) in llm-d-benchmark, alongside standalone and modelservice. This work is tracked on the fma branch and includes:

scenarios/examples/fma.sh -- Scenario configuration with sleep mode enabled, model caching PVC, and FMA image references.
setup/steps/07_deploy_fma_models.py -- Standup step that deploys InferenceServerConfig, LauncherConfig, LauncherPopulationPolicy, and requester ReplicaSet CRs. Installs FMA CRDs and the fma-controllers Helm chart. Waits for dual-pod controller and launcher-populator readiness.
setup/env.sh -- 35+ new LLMDBENCH_FMA_* environment variables covering chart version, image registry/tags, dual-pod configuration, launcher configuration, and requester resource limits.
setup/run.sh -- FMA endpoint discovery via Kubernetes service labels (stood-up-via=fma).
setup/teardown.sh -- Ordered teardown: FMA custom resources first, then wait for the dual-pods controller to remove finalizers, then uninstall the FMA Helm release.

The existing llm-d-benchmark harnesses (nop, inference-perf, vllm-benchmark) can run after FMA standup to measure L2 and L3 metrics.

Next Steps

Upstream the fma branch into llm-d-benchmark, aligning with the declarative Python architecture in PR #848.
Add FMA-specific experiment YAML for Design of Experiments (DoE) treatments: replica count, sleep mode on/off, sleeper limit, model variant combinations.
Add actuation-specific metrics collection in the nop harness: T_actuation, T_wake, Hit_rate parsed from FMA pod events and DPC logs.
Consider Grafana integration for visual actuation metrics (scale-up latency dashboards), following the pattern in WVA PR #900.
Maintain framework-agnostic interface: the FMA benchmark lifecycle (deploy, measure, teardown) should remain pluggable into other benchmarking frameworks beyond llm-d-benchmark.

Legacy Benchmark Tooling

See benchmark_legacy.md for documentation on the original benchmark_base.py tool.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dual-Pods Benchmarking Tool

Purpose

Measurement Layers

Benchmarking Scenarios

Benchmarking Matrix

Scenario Rationale

Actuation Condition Rationale

Integration Strategy

Current State

Next Steps

Legacy Benchmark Tooling

FilesExpand file tree

benchmark.md

Latest commit

History

benchmark.md

File metadata and controls

Dual-Pods Benchmarking Tool

Purpose

Measurement Layers

Benchmarking Scenarios

Benchmarking Matrix

Scenario Rationale

Actuation Condition Rationale

Integration Strategy

Current State

Next Steps

Legacy Benchmark Tooling