Skip to content

docs: Update benchmark scenarios matrix and integration strategy#381

Open
rubambiza wants to merge 9 commits intollm-d-incubation:mainfrom
rubambiza:docs/update-benchmark-scenarios
Open

docs: Update benchmark scenarios matrix and integration strategy#381
rubambiza wants to merge 9 commits intollm-d-incubation:mainfrom
rubambiza:docs/update-benchmark-scenarios

Conversation

@rubambiza
Copy link
Copy Markdown
Collaborator

@rubambiza rubambiza commented Mar 25, 2026

Summary

  • Add layered measurement model (L1 actuation, L2 inference readiness, L3 steady-state) with metric definitions using precise FMA terminology (requester, launcher, DPC)
  • Populate 4x3 benchmarking matrix: 4 user-story scenarios (Fast Replica Scale Up, Introducing New Variant, Resource Request Justification, Maintenance Planning) x 3 actuation paths (Cold Start, Warm Start, Hot Start)
  • Adopt the team's hot/warm/cold start taxonomy for actuation paths
  • Add scenario rationale and actuation path rationale tables for team discussion
  • Note caching and simulation as orthogonal dimensions rather than matrix columns
  • Document llm-d-benchmark integration strategy via the fma branch and reference WVA PR #900
  • Mention WVA integration in the Purpose section
  • Extract legacy benchmark_base.py docs to benchmark_legacy.md

Test plan

  • Verify markdown renders correctly on GitHub (tables, links, collapsible section)
  • Cross-check llm-d-benchmark references against the fma branch
  • Review matrix cell annotations (L1, L1+L2, P, --) for accuracy

🤖 Generated with Claude Code

Add layered measurement model (L1 actuation, L2 inference readiness,
L3 steady-state), populate the 5x7 benchmarking matrix with layer
annotations, add scenario and actuation condition rationale tables,
document llm-d-benchmark integration via the fma branch, and move
legacy benchmark_base.py docs to a reference section.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Gloire Rubambiza <gloire@ibm.com>
Move benchmark_base.py CLI docs, inputs/outputs tables, and output
examples to benchmark_legacy.md. Replace with a cross-reference link
in benchmark.md to keep the main doc focused on the forward-looking
scenarios matrix and integration strategy.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Gloire Rubambiza <gloire@ibm.com>
@rubambiza rubambiza requested a review from aavarghese March 25, 2026 14:18
Signed-off-by: Gloire Rubambiza <gloire@ibm.com>
Clarify T_actuation, T_wake, T_launcher, T_first_token, and T_e2e
definitions using requester/launcher/server-providing pod terminology
instead of generic "pod readiness" and "server-request submission".

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Gloire Rubambiza <gloire@ibm.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Gloire Rubambiza <gloire@ibm.com>
- **Hit_rate**: Percentage of scale-up events where the DPC binds a requester to an existing sleeping pod on the correct GPU (hit) vs. requiring a cold start (miss).
- **T_launcher**: Time from the launcher receiving a create request to the new vLLM instance reporting healthy. Includes the benefit of vLLM module preloading.
- **T_first_token**: Time from requester pod readiness to first successful inference response received through the server-providing pod's vLLM instance (time-to-first-token, post-actuation).
- **T_e2e**: Total time from requester pod creation to first successful inference response. Spans the full path: requester scheduling, DPC binding, launcher wake/create, vLLM ready, first inference (T_actuation + T_first_token).
Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"launcher wake/create" is not clear about the fact that there are three cases:

  1. wake sleeping vllm instance
  2. create new vllm instance in existing launcher
  3. create launcher and then create new vllm instance in that launcher

All three may also include deletion of sleeping vllm instance(s) to free up GPU memory (but doing it for case 3 is not designed yet, and the design for the first two cases is incomplete).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. How about replacing "launcher wake/create" with the explicit path: "requester scheduling, DPC binding, instance wake-up or launcher instance creation, vLLM ready, first inference"? That avoids collapsing the three cases into ambiguous shorthand.

On deletion of sleeping instances: from what I prompted Claude about the current paths in the code, sleeper deletion happens to respect the sleeperLimit during DPC reconciliation (inference-server.go:758-796), not as part of the wake path itself. But it's good to know where the design is headed, and the benchmark should leave room for these future paths.

- Fix T_wake: DPC sends /wake_up directly, not via launcher
- Change "subset" to "part" for T_wake relation to T_actuation
- Hit_rate: use "fraction of requesters" instead of "scale-up events"
- T_e2e: spell out three satisfaction paths explicitly
- L2: remove "FMA +" from Measured By column
- L3: add T_actuation for WVA integration
- Reframe matrix columns around FMA's three satisfaction paths
- Replace "GPU Hit/Miss" with "Wake Sleeping Instance" and
  "Create Instance (Existing Launcher)"
- Pull "Simulated" out as orthogonal testing mode
- Move actuation path definitions above the matrix table
- Fix bechmark_base typo in benchmark_legacy.md

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Gloire Rubambiza <gloire@ibm.com>
- **T_wake**: Time from the DPC sending `/wake_up` to a sleeping vLLM instance on the server-providing pod to that instance reporting ready to serve. A part of T_actuation when a GPU hit occurs.
- **Hit_rate**: Fraction of requesters that get bound to an existing sleeping pod on the correct GPU (hit) vs. requiring a cold start (miss).
- **T_launcher**: Time from the launcher receiving a create request to the new vLLM instance reporting healthy. Includes the benefit of vLLM module preloading.
- **T_first_token**: Time from requester pod readiness to first successful inference response received through the server-providing pod's vLLM instance (time-to-first-token, post-actuation).
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these inference requests that stream the response back?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to explicitly say this somewhere early.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. I've updated the Measurement Layers intro to clarify: "Layer 2 bridges actuation to inference readiness (i.e., latency for inference requests to get responses back in an FMA-enabled context)."

Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those words still do not distinguish between inference requests that get the whole response back at once vs. inference requests that get the response back as a stream of chunks. The time-to-first-token can not be measured if the response comes back all at once. If streaming is used then this document must clearly say so, early.

BTW, is time to first chunk the same as time to first token? If not, then isn't a different approach to measuring time-to-first-token required? Or a redefinition of the metric to be time-to-first-chunk?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. I had noted the streaming: true somewhere in the benchmarking code, so I had Claude verify that llm-d-benchmark's inference-perf harness uses streaming responses (the profile configs explicitly set streaming: true), so TTFT measurement is valid. I've updated the L2 description to say "streaming inference requests" and the T_first_token definition to say "first streamed token" with an explicit note that it requires streaming inference requests.

On time-to-first-chunk vs time-to-first-token: in practice with vLLM's streaming API, each SSE chunk contains one token, so they are equivalent. But the distinction is worth being aware of if other serving frameworks batch tokens into chunks.

Comment on lines 2 to 3
The **Dual-Pods Benchmarking Tool** measures and reports the startup and readiness
latency of model-serving pods within the LLM-D Fast Model Actuation workflow.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, WVA is intended to be in this picture. If so then it should be mentioned here.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. The Purpose section now mentions WVA integration.

| ----- | ----- | ------- | ----------- |
| **L1: Actuation** | Requester pod readiness | T_actuation (requester creation to readiness), T_wake (DPC wakes sleeping vLLM instance), Hit_rate (GPU hits), T_launcher (launcher creates new vLLM instance) | llm-d-benchmark new harness |
| **L2: Inference Readiness** | First inference response | T_first_token (requester ready to first inference response), T_e2e (requester creation to first inference response) | llm-d-benchmark nop/inference-perf harness |
| **L3: Steady-State** | Throughput/latency | T_actuation (requester creation to readiness), TPOT (time per output token), throughput, queue depth, KV cache usage, replica stability | llm-d-benchmark / WVA |
Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find L3 surprising and confusing. The difference from L1 to L2 is not like the difference from L2 to L3. The difference from L1 to L2 is more of the same thing: more latency (picking a later event that stops the clock). I expected L3 to differ in the same way; perhaps start the clock earlier (e.g., when an inference client sends a request). But mostly L3 is about different stuff, not latency. I do not see how WVA plays a role in any of the things listed for L3 (nor L2 nor L1), which is also a surprise. I expected to see WVA involved somewhere.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L3 is intentionally different in kind from L1/L2. L1 and L2 are stop-the-clock latency measurements that FMA owns. L3 is about steady-state performance metrics (TPOT, throughput, queue depth, KV cache usage, replica stability) that become relevant when FMA is integrated with WVA and the broader llm-d stack. The inclusion of T_actuation in L3 is the bridge: it lets WVA and llm-d-benchmark consumers see how FMA actuation latency affects their steady-state metrics. The pitch is: look at how your metrics perform with FMA included.

I had Claude double check that these metrics are currently being captured: TPOT, throughput, KV cache usage, and queue depth are in llm-d-benchmark's schema v0.2 and process_metrics.py; replica stability, KV cache, and queue depth are captured via Prometheus in WVA PR #900. Integration should be possible.


| Scenario | Cold Start (Standalone) | Cold Start (Launcher) | Wake Sleeping Instance | Create Instance (Existing Launcher) | Model Swap (Launcher) | Cached Model (PVC) |
| ---------------------------------- | :---------------------: | :-------------------: | :--------------------: | :---------------------------------: | :-------------------: | :-----------------: |
| **Fast Replica Scale Up** | L1 | L1+L2 | L1+L2 | L1+L2 | -- | L1+L2 |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does L2 not apply in the "Cold Start (Standalone)" column?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally excluded because standalone mode (-t standalone) deploys raw vLLM pods without the FMA stack. But we can still send an inference request after a standalone pod is ready and measure TTFT as a comparison baseline. Changed Cold Start to L1+L2.

| **Fast Replica Scale Up** | L1 | L1+L2 | L1+L2 | L1+L2 | -- | L1+L2 |
| **Introducing New Variant** | L1 | L1+L2 | -- | L1+L2 | L1+L2 | L1+L2 |
| **Free Up Cluster Resources** | -- | P | P | -- | P | -- |
| **Resource Request Justification** | L1 | L1 | L1 | L1 | P | L1 |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does L1 apply but not L2 in this row?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This scenario is about stress-testing resource capacity at scale (many models, many requesters), where the primary concern is whether all requesters come up and how fast. L2 (sending inference requests to each) is possible but adds complexity for a scenario focused on capacity planning. Keeping it at L1 is a judgment call, but we could add L2 in a follow-up if it proves valuable.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This scenario is about stress-testing resource capacity at scale (many models, many requesters), where the primary concern is whether all requesters come up and how fast.

That is not the definition given in this document. If what you said is accurate then please update the definition.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Updated the scenario description to: "stress-test my namespace's resource capacity at scale (many models, many requesters) to produce data justifying more resource requests."

Signed-off-by: Gloire Rubambiza <gloire@ibm.com>

- **T_actuation**: Time from requester pod creation (ReplicaSet scale-up) to requester pod readiness (`/ready` probe passes), which implies the DPC has bound the requester to a server-providing pod and the vLLM instance is serving.
- **T_wake**: Time from the DPC sending `/wake_up` to a sleeping vLLM instance on the server-providing pod to that instance reporting ready to serve. A part of T_actuation when a GPU hit occurs.
- **Hit_rate**: Fraction of requesters that get bound to an existing sleeping pod on the correct GPU (hit) vs. requiring a cold start (i.e., new vLLM instance in existing launcher pod or new launcher pod + new vLLM instance).
Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is defining the term "cold start" differently than I think we have been using it. I think that we have been using a taxonomy like the following.

  • hot start: wake up an existing sleeping vllm instance
  • warm start: create a new vllm instance in an existing launcher
  • cold start: create a new vllm instance without using the launcher (either using milestone 2 of FMA, or not using FMA)
  • no defined name: create a new launcher and then a vllm instance in that launcher

I think that you can define Hit_rate without worrying about the subtleties of those terms. All you need to say is that Hit_rate is the fraction of server-requesting Pods that are satisfied by waking up an existing sleeping vllm instance.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good taxonomy. I've adopted it: "Wake Sleeping Instance" is now Hot Start, "Create Instance (Existing Launcher)" is now Warm Start, and the non-launcher baseline is Cold Start. Hit_rate now simply says "fraction of server-requesting Pods that get satisfied by waking a sleeping vllm instance."

Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rubambiza: did you push that change? I do not see it here. I still see "... (hit) vs. requiring a cold start (i.e., new vLLM instance in existing launcher pod or new launcher pod + new vLLM instance)". In rubambiza/llm-d-fast-model-actuation, branch docs/update-benchmark-scenarios points to commit 420cf6b .

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching it. Oversight between editing in vim and with Claude. Fixed -- Hit_rate now reads: "Fraction of server-requesting Pods that get satisfied by waking a sleeping vLLM instance."


| Scenario | Cold Start (Standalone) | Cold Start (Launcher) | Wake Sleeping Instance | Create Instance (Existing Launcher) | Model Swap (Launcher) | Cached Model (PVC) |
| ---------------------------------- | :---------------------: | :-------------------: | :--------------------: | :---------------------------------: | :-------------------: | :-----------------: |
| **Fast Replica Scale Up** | L1 | L1+L2 | L1+L2 | L1+L2 | -- | L1+L2 |
Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is throughput not applicable everywhere?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L3 steady-state metrics describe ongoing serving performance under load. They don't all apply to one-shot operations like Cold Start in the Resource Request Justification scenario, where the goal is capacity validation rather than sustained inference.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After more thought, I've removed "Free Up Cluster Resources" from the matrix. Deactivating/sleeping variants is a scaling decision that belongs to WVA, not FMA. The matrix is now 4 scenarios x 3 actuation paths. We can revisit when the FMA/WVA integration boundary is clearer.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you envision a scaling test scenario (oddly called "Resource Request Justification") and this one is not envisioned to include running enough requests to measure throughput. But I do not see why the other scenarios would not involve throughput.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be frank, this is a judgment call to start small, especially with the overhauls happening in llm-d benchmarking and WVA. We can expand the surface once the abstractions start to become clearer.

- Adopt hot/warm/cold start taxonomy for actuation paths
- Remove Model Swap and Cached Model columns (orthogonal)
- Remove Free Up Cluster Resources scenario (WVA scope)
- Update Purpose with WVA mention and bullet-pointed paths
- Fix T_wake to use /wake_up request-response time
- Add L2 intro clarification for streaming responses
- Update matrix to 4 scenarios x 3 actuation paths
- Fix Introducing New Variant rationale

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Gloire Rubambiza <gloire@ibm.com>
and server-providing pods) becomes available under different actuation conditions such
as cold starts, wake-ups from a sleeping state, using prewarmed pods, etc. These metrics
will guide future optimizations for the **Dual-Pods Controller (DPC)**. Ultimately, the goal
and server-providing pods), when integrated with the Workload Variant Autoscaler (WVA),
Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think that this is where WVA comes in. The L1, L2, and L3 metrics do not involve WVA. I think that the reasons to talk about WVA are:

  • some Benchmarking Scenarios(s) might involve WVA (I am not actually sure about this),
  • we want to demonstrate FMA working with WVA, and
  • we view WVA as pioneering the 2nd generation of benchmarking and think that is where FMA benchmarking should appear (because the 1st generation benchmark framework is unable to conceive of FMA).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I've decoupled WVA from the measurement description itself. The Purpose now describes FMA's three actuation paths without mentioning WVA, then adds a separate paragraph explaining the WVA relationship: some scenarios may involve WVA-triggered scaling, and we want to demonstrate FMA working with WVA as an integrated system.

- Simplify Hit_rate definition per Mike's suggestion
- Add explicit streaming requirement for L2/T_first_token
- Decouple WVA from measurement model in Purpose section
- Update Resource Request Justification description to match
  actual intent (stress-test capacity at scale)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Gloire Rubambiza <gloire@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants