docs: Update benchmark scenarios matrix and integration strategy by rubambiza · Pull Request #381 · llm-d-incubation/llm-d-fast-model-actuation

rubambiza · 2026-03-25T14:10:43Z

Summary

Add layered measurement model (L1 actuation, L2 inference readiness, L3 steady-state) with metric definitions using precise FMA terminology (requester, launcher, DPC)
Populate 4x3 benchmarking matrix: 4 user-story scenarios (Fast Replica Scale Up, Introducing New Variant, Resource Request Justification, Maintenance Planning) x 3 actuation paths (Cold Start, Warm Start, Hot Start)
Adopt the team's hot/warm/cold start taxonomy for actuation paths
Add scenario rationale and actuation path rationale tables for team discussion
Note caching and simulation as orthogonal dimensions rather than matrix columns
Document llm-d-benchmark integration strategy via the fma branch and reference WVA PR #900
Mention WVA integration in the Purpose section
Extract legacy benchmark_base.py docs to benchmark_legacy.md

Test plan

Verify markdown renders correctly on GitHub (tables, links, collapsible section)
Cross-check llm-d-benchmark references against the fma branch
Review matrix cell annotations (L1, L1+L2, P, --) for accuracy

🤖 Generated with Claude Code

Add layered measurement model (L1 actuation, L2 inference readiness, L3 steady-state), populate the 5x7 benchmarking matrix with layer annotations, add scenario and actuation condition rationale tables, document llm-d-benchmark integration via the fma branch, and move legacy benchmark_base.py docs to a reference section. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Gloire Rubambiza <gloire@ibm.com>

Move benchmark_base.py CLI docs, inputs/outputs tables, and output examples to benchmark_legacy.md. Replace with a cross-reference link in benchmark.md to keep the main doc focused on the forward-looking scenarios matrix and integration strategy. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Gloire Rubambiza <gloire@ibm.com>

Signed-off-by: Gloire Rubambiza <gloire@ibm.com>

Clarify T_actuation, T_wake, T_launcher, T_first_token, and T_e2e definitions using requester/launcher/server-providing pod terminology instead of generic "pod readiness" and "server-request submission". Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Gloire Rubambiza <gloire@ibm.com>

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Gloire Rubambiza <gloire@ibm.com>

inference_server/benchmark/benchmark.md

inference_server/benchmark/benchmark_legacy.md

inference_server/benchmark/benchmark.md

MikeSpreitzer · 2026-03-27T15:08:11Z

inference_server/benchmark/benchmark.md

+- **Hit_rate**: Percentage of scale-up events where the DPC binds a requester to an existing sleeping pod on the correct GPU (hit) vs. requiring a cold start (miss).
+- **T_launcher**: Time from the launcher receiving a create request to the new vLLM instance reporting healthy. Includes the benefit of vLLM module preloading.
+- **T_first_token**: Time from requester pod readiness to first successful inference response received through the server-providing pod's vLLM instance (time-to-first-token, post-actuation).
+- **T_e2e**: Total time from requester pod creation to first successful inference response. Spans the full path: requester scheduling, DPC binding, launcher wake/create, vLLM ready, first inference (T_actuation + T_first_token).


"launcher wake/create" is not clear about the fact that there are three cases:

wake sleeping vllm instance

create new vllm instance in existing launcher

create launcher and then create new vllm instance in that launcher

All three may also include deletion of sleeping vllm instance(s) to free up GPU memory (but doing it for case 3 is not designed yet, and the design for the first two cases is incomplete).

Good point. How about replacing "launcher wake/create" with the explicit path: "requester scheduling, DPC binding, instance wake-up or launcher instance creation, vLLM ready, first inference"? That avoids collapsing the three cases into ambiguous shorthand.

On deletion of sleeping instances: from what I prompted Claude about the current paths in the code, sleeper deletion happens to respect the sleeperLimit during DPC reconciliation (inference-server.go:758-796), not as part of the wake path itself. But it's good to know where the design is headed, and the benchmark should leave room for these future paths.

inference_server/benchmark/benchmark.md

- Fix T_wake: DPC sends /wake_up directly, not via launcher - Change "subset" to "part" for T_wake relation to T_actuation - Hit_rate: use "fraction of requesters" instead of "scale-up events" - T_e2e: spell out three satisfaction paths explicitly - L2: remove "FMA +" from Measured By column - L3: add T_actuation for WVA integration - Reframe matrix columns around FMA's three satisfaction paths - Replace "GPU Hit/Miss" with "Wake Sleeping Instance" and "Create Instance (Existing Launcher)" - Pull "Simulated" out as orthogonal testing mode - Move actuation path definitions above the matrix table - Fix bechmark_base typo in benchmark_legacy.md Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Gloire Rubambiza <gloire@ibm.com>

inference_server/benchmark/benchmark.md

MikeSpreitzer · 2026-03-30T20:17:28Z

inference_server/benchmark/benchmark.md

+- **T_wake**: Time from the DPC sending `/wake_up` to a sleeping vLLM instance on the server-providing pod to that instance reporting ready to serve. A part of T_actuation when a GPU hit occurs.
+- **Hit_rate**: Fraction of requesters that get bound to an existing sleeping pod on the correct GPU (hit) vs. requiring a cold start (miss).
+- **T_launcher**: Time from the launcher receiving a create request to the new vLLM instance reporting healthy. Includes the benefit of vLLM module preloading.
+- **T_first_token**: Time from requester pod readiness to first successful inference response received through the server-providing pod's vLLM instance (time-to-first-token, post-actuation).


Are these inference requests that stream the response back?

It would be helpful to explicitly say this somewhere early.

Fair point. I've updated the Measurement Layers intro to clarify: "Layer 2 bridges actuation to inference readiness (i.e., latency for inference requests to get responses back in an FMA-enabled context)."

Those words still do not distinguish between inference requests that get the whole response back at once vs. inference requests that get the response back as a stream of chunks. The time-to-first-token can not be measured if the response comes back all at once. If streaming is used then this document must clearly say so, early.

BTW, is time to first chunk the same as time to first token? If not, then isn't a different approach to measuring time-to-first-token required? Or a redefinition of the metric to be time-to-first-chunk?

Fair point. I had noted the streaming: true somewhere in the benchmarking code, so I had Claude verify that llm-d-benchmark's inference-perf harness uses streaming responses (the profile configs explicitly set streaming: true), so TTFT measurement is valid. I've updated the L2 description to say "streaming inference requests" and the T_first_token definition to say "first streamed token" with an explicit note that it requires streaming inference requests.

On time-to-first-chunk vs time-to-first-token: in practice with vLLM's streaming API, each SSE chunk contains one token, so they are equivalent. But the distinction is worth being aware of if other serving frameworks batch tokens into chunks.

MikeSpreitzer · 2026-04-01T06:06:18Z

inference_server/benchmark/benchmark.md

 The **Dual-Pods Benchmarking Tool** measures and reports the startup and readiness
 latency of model-serving pods within the LLM-D Fast Model Actuation workflow.


If I understand correctly, WVA is intended to be in this picture. If so then it should be mentioned here.

Fixed. The Purpose section now mentions WVA integration.

inference_server/benchmark/benchmark.md

MikeSpreitzer · 2026-04-01T06:15:06Z

inference_server/benchmark/benchmark.md

+| ----- | ----- | ------- | ----------- |
+| **L1: Actuation** | Requester pod readiness | T_actuation (requester creation to readiness), T_wake (DPC wakes sleeping vLLM instance), Hit_rate (GPU hits), T_launcher (launcher creates new vLLM instance) | llm-d-benchmark new harness |
+| **L2: Inference Readiness** | First inference response | T_first_token (requester ready to first inference response), T_e2e (requester creation to first inference response) | llm-d-benchmark nop/inference-perf harness |
+| **L3: Steady-State** | Throughput/latency | T_actuation (requester creation to readiness), TPOT (time per output token), throughput, queue depth, KV cache usage, replica stability | llm-d-benchmark / WVA |


I find L3 surprising and confusing. The difference from L1 to L2 is not like the difference from L2 to L3. The difference from L1 to L2 is more of the same thing: more latency (picking a later event that stops the clock). I expected L3 to differ in the same way; perhaps start the clock earlier (e.g., when an inference client sends a request). But mostly L3 is about different stuff, not latency. I do not see how WVA plays a role in any of the things listed for L3 (nor L2 nor L1), which is also a surprise. I expected to see WVA involved somewhere.

L3 is intentionally different in kind from L1/L2. L1 and L2 are stop-the-clock latency measurements that FMA owns. L3 is about steady-state performance metrics (TPOT, throughput, queue depth, KV cache usage, replica stability) that become relevant when FMA is integrated with WVA and the broader llm-d stack. The inclusion of T_actuation in L3 is the bridge: it lets WVA and llm-d-benchmark consumers see how FMA actuation latency affects their steady-state metrics. The pitch is: look at how your metrics perform with FMA included.

I had Claude double check that these metrics are currently being captured: TPOT, throughput, KV cache usage, and queue depth are in llm-d-benchmark's schema v0.2 and process_metrics.py; replica stability, KV cache, and queue depth are captured via Prometheus in WVA PR #900. Integration should be possible.

inference_server/benchmark/benchmark.md

MikeSpreitzer · 2026-04-01T06:44:19Z

inference_server/benchmark/benchmark.md

+
+| Scenario                           | Cold Start (Standalone) | Cold Start (Launcher) | Wake Sleeping Instance | Create Instance (Existing Launcher) | Model Swap (Launcher) | Cached Model (PVC) |
+| ---------------------------------- | :---------------------: | :-------------------: | :--------------------: | :---------------------------------: | :-------------------: | :-----------------: |
+| **Fast Replica Scale Up**          | L1                      | L1+L2                 | L1+L2                  | L1+L2                               | --                    | L1+L2               |


Why does L2 not apply in the "Cold Start (Standalone)" column?

Originally excluded because standalone mode (-t standalone) deploys raw vLLM pods without the FMA stack. But we can still send an inference request after a standalone pod is ready and measure TTFT as a comparison baseline. Changed Cold Start to L1+L2.

MikeSpreitzer · 2026-04-01T06:44:38Z

inference_server/benchmark/benchmark.md

+| **Fast Replica Scale Up**          | L1                      | L1+L2                 | L1+L2                  | L1+L2                               | --                    | L1+L2               |
+| **Introducing New Variant**        | L1                      | L1+L2                 | --                     | L1+L2                               | L1+L2                 | L1+L2               |
+| **Free Up Cluster Resources**      | --                      | P                     | P                      | --                                  | P                     | --                  |
+| **Resource Request Justification** | L1                      | L1                    | L1                     | L1                                  | P                     | L1                  |


Why does L1 apply but not L2 in this row?

This scenario is about stress-testing resource capacity at scale (many models, many requesters), where the primary concern is whether all requesters come up and how fast. L2 (sending inference requests to each) is possible but adds complexity for a scenario focused on capacity planning. Keeping it at L1 is a judgment call, but we could add L2 in a follow-up if it proves valuable.

This scenario is about stress-testing resource capacity at scale (many models, many requesters), where the primary concern is whether all requesters come up and how fast.

That is not the definition given in this document. If what you said is accurate then please update the definition.

Fixed. Updated the scenario description to: "stress-test my namespace's resource capacity at scale (many models, many requesters) to produce data justifying more resource requests."

inference_server/benchmark/benchmark.md

Signed-off-by: Gloire Rubambiza <gloire@ibm.com>

MikeSpreitzer · 2026-04-01T13:35:24Z

inference_server/benchmark/benchmark.md

+
+- **T_actuation**: Time from requester pod creation (ReplicaSet scale-up) to requester pod readiness (`/ready` probe passes), which implies the DPC has bound the requester to a server-providing pod and the vLLM instance is serving.
+- **T_wake**: Time from the DPC sending `/wake_up` to a sleeping vLLM instance on the server-providing pod to that instance reporting ready to serve. A part of T_actuation when a GPU hit occurs.
+- **Hit_rate**: Fraction of requesters that get bound to an existing sleeping pod on the correct GPU (hit) vs. requiring a cold start (i.e., new vLLM instance in existing launcher pod or new launcher pod + new vLLM instance).


This is defining the term "cold start" differently than I think we have been using it. I think that we have been using a taxonomy like the following.

hot start: wake up an existing sleeping vllm instance

warm start: create a new vllm instance in an existing launcher

cold start: create a new vllm instance without using the launcher (either using milestone 2 of FMA, or not using FMA)

no defined name: create a new launcher and then a vllm instance in that launcher

I think that you can define Hit_rate without worrying about the subtleties of those terms. All you need to say is that Hit_rate is the fraction of server-requesting Pods that are satisfied by waking up an existing sleeping vllm instance.

Good taxonomy. I've adopted it: "Wake Sleeping Instance" is now Hot Start, "Create Instance (Existing Launcher)" is now Warm Start, and the non-launcher baseline is Cold Start. Hit_rate now simply says "fraction of server-requesting Pods that get satisfied by waking a sleeping vllm instance."

@rubambiza: did you push that change? I do not see it here. I still see "... (hit) vs. requiring a cold start (i.e., new vLLM instance in existing launcher pod or new launcher pod + new vLLM instance)". In rubambiza/llm-d-fast-model-actuation, branch docs/update-benchmark-scenarios points to commit 420cf6b .

Thanks for catching it. Oversight between editing in vim and with Claude. Fixed -- Hit_rate now reads: "Fraction of server-requesting Pods that get satisfied by waking a sleeping vLLM instance."

MikeSpreitzer · 2026-04-01T13:39:41Z

inference_server/benchmark/benchmark.md

+
+| Scenario                           | Cold Start (Standalone) | Cold Start (Launcher) | Wake Sleeping Instance | Create Instance (Existing Launcher) | Model Swap (Launcher) | Cached Model (PVC) |
+| ---------------------------------- | :---------------------: | :-------------------: | :--------------------: | :---------------------------------: | :-------------------: | :-----------------: |
+| **Fast Replica Scale Up**          | L1                      | L1+L2                 | L1+L2                  | L1+L2                               | --                    | L1+L2               |


Why is throughput not applicable everywhere?

L3 steady-state metrics describe ongoing serving performance under load. They don't all apply to one-shot operations like Cold Start in the Resource Request Justification scenario, where the goal is capacity validation rather than sustained inference.

After more thought, I've removed "Free Up Cluster Resources" from the matrix. Deactivating/sleeping variants is a scaling decision that belongs to WVA, not FMA. The matrix is now 4 scenarios x 3 actuation paths. We can revisit when the FMA/WVA integration boundary is clearer.

I see that you envision a scaling test scenario (oddly called "Resource Request Justification") and this one is not envisioned to include running enough requests to measure throughput. But I do not see why the other scenarios would not involve throughput.

To be frank, this is a judgment call to start small, especially with the overhauls happening in llm-d benchmarking and WVA. We can expand the surface once the abstractions start to become clearer.

- Adopt hot/warm/cold start taxonomy for actuation paths - Remove Model Swap and Cached Model columns (orthogonal) - Remove Free Up Cluster Resources scenario (WVA scope) - Update Purpose with WVA mention and bullet-pointed paths - Fix T_wake to use /wake_up request-response time - Add L2 intro clarification for streaming responses - Update matrix to 4 scenarios x 3 actuation paths - Fix Introducing New Variant rationale Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Gloire Rubambiza <gloire@ibm.com>

MikeSpreitzer · 2026-04-02T04:23:47Z

inference_server/benchmark/benchmark.md

-and server-providing pods) becomes available under different actuation conditions such
-as cold starts, wake-ups from a sleeping state, using prewarmed pods, etc. These metrics
-will guide future optimizations for the **Dual-Pods Controller (DPC)**. Ultimately, the goal
+and server-providing pods), when integrated with the Workload Variant Autoscaler (WVA),


I do not think that this is where WVA comes in. The L1, L2, and L3 metrics do not involve WVA. I think that the reasons to talk about WVA are:

some Benchmarking Scenarios(s) might involve WVA (I am not actually sure about this),

we want to demonstrate FMA working with WVA, and

we view WVA as pioneering the 2nd generation of benchmarking and think that is where FMA benchmarking should appear (because the 1st generation benchmark framework is unable to conceive of FMA).

Good point. I've decoupled WVA from the measurement description itself. The Purpose now describes FMA's three actuation paths without mentioning WVA, then adds a separate paragraph explaining the WVA relationship: some scenarios may involve WVA-triggered scaling, and we want to demonstrate FMA working with WVA as an integrated system.

- Simplify Hit_rate definition per Mike's suggestion - Add explicit streaming requirement for L2/T_first_token - Decouple WVA from measurement model in Purpose section - Update Resource Request Justification description to match actual intent (stress-test capacity at scale) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Gloire Rubambiza <gloire@ibm.com>

rubambiza added 2 commits March 24, 2026 14:42

rubambiza requested a review from aavarghese March 25, 2026 14:18

Adding suggested fix from Ansu.

9667ca1

Signed-off-by: Gloire Rubambiza <gloire@ibm.com>

rubambiza requested review from MikeSpreitzer, lionelvillard, manoelmarques and waltforme March 25, 2026 14:33

rubambiza added 2 commits March 25, 2026 13:47

docs: Spell out TPOT acronym in measurement layers

713f80b

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Gloire Rubambiza <gloire@ibm.com>

MikeSpreitzer reviewed Mar 27, 2026

View reviewed changes

inference_server/benchmark/benchmark.md Outdated Show resolved Hide resolved

MikeSpreitzer reviewed Mar 27, 2026

View reviewed changes

inference_server/benchmark/benchmark.md Outdated Show resolved Hide resolved

waltforme reviewed Mar 27, 2026

View reviewed changes

inference_server/benchmark/benchmark_legacy.md Outdated Show resolved Hide resolved

MikeSpreitzer reviewed Mar 27, 2026

View reviewed changes

inference_server/benchmark/benchmark.md Outdated Show resolved Hide resolved

MikeSpreitzer reviewed Mar 27, 2026

View reviewed changes

inference_server/benchmark/benchmark.md Outdated Show resolved Hide resolved

aavarghese reviewed Mar 27, 2026

View reviewed changes

inference_server/benchmark/benchmark.md Outdated Show resolved Hide resolved

aavarghese reviewed Mar 27, 2026

View reviewed changes

inference_server/benchmark/benchmark.md Outdated Show resolved Hide resolved

MikeSpreitzer reviewed Mar 27, 2026

View reviewed changes

inference_server/benchmark/benchmark.md Outdated Show resolved Hide resolved

MikeSpreitzer reviewed Mar 30, 2026

View reviewed changes

inference_server/benchmark/benchmark.md Outdated Show resolved Hide resolved

MikeSpreitzer reviewed Mar 30, 2026

View reviewed changes

inference_server/benchmark/benchmark.md Outdated Show resolved Hide resolved

MikeSpreitzer reviewed Mar 30, 2026

View reviewed changes

inference_server/benchmark/benchmark.md Show resolved Hide resolved

MikeSpreitzer reviewed Mar 30, 2026

View reviewed changes

MikeSpreitzer reviewed Apr 1, 2026

View reviewed changes