-
Notifications
You must be signed in to change notification settings - Fork 14
docs: Update benchmark scenarios matrix and integration strategy #381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 5 commits
3e68abb
dcc28ee
9667ca1
64e1b11
713f80b
31da7d0
d8f48e7
420cf6b
0c6f1c9
5dcf115
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -11,103 +11,109 @@ is *high predictability*, which is defined as achieving close to 100% hit rate o | |||||
| available, sleeping pods on cluster GPUs as function of total inference server | ||||||
| requests for common user scenarios. | ||||||
|
|
||||||
| ## Baseline Startup Latency | ||||||
|
|
||||||
| **Objective:** | ||||||
| Measure the time from **deployment (server-request submission)** to **dual-pod readiness**. | ||||||
|
|
||||||
| ### Inputs | ||||||
|
|
||||||
| | Parameter | Type | Required | Default | Description | | ||||||
| | ------------------ | ------ | -------- | --------------------------------------- | ------------------------------------------------------------------------------------------------------ | | ||||||
| | `--namespace` | `str` | **Yes** | — | Openshift namespace to run benchmark | | ||||||
| | `--yaml` | `str` | **Yes** | — | Path to the server-requesting YAML template file | | ||||||
| | `--image` | `str` | **Yes*** | — | Image repository for the requester pod. Required *only if* `CONTAINER_IMG_REG` env var is **not** set | | ||||||
| | `--tag` | `str` | **Yes*** | — | Image tag for the requester pod. Required *only if* `CONTAINER_IMG_VERSION` env var is **not** set | | ||||||
| | `--cleanup` | `bool` | No | `True` | Whether to clean up created resources after the benchmark | | ||||||
| | `--iterations` | `int` | No | `1` | Number of times to run each benchmark scenario | | ||||||
| | `--cluster-domain` | `str` | No | `fmaas-platform-eval.fmaas.res.ibm.com` | Cluster domain for Prometheus GPU metrics query | | ||||||
| | `--model-path` | `str` | No | `None` | Path to JSON file containing models for scenario (used only in the `new_variant` scenario). | | ||||||
| | `--scenario` | `str` | No | `"scaling"` | Benchmark scenario to run: `baseline`, `scaling`, or `new_variant`. | | ||||||
|
|
||||||
|
|
||||||
| ### Outputs | ||||||
|
|
||||||
| | Output | Description | | ||||||
| | ---------------------- | -------------------------------------------------------------------------- | | ||||||
| | `startup_time` | Total time from deployment to readiness | | ||||||
| | `availability_mode` | Indicates whether the vLLM instance was started cold or resumed from sleep | | ||||||
|
|
||||||
| **Example Usage** | ||||||
| ```bash | ||||||
| python3 inference_server/benchmark/bechmark_base.py --namespace <str> --yaml <str> --cleanup <bool,default:True> --iterations <int, default:1> --cluster-domain <str> --model-path <str> --scenario <str, default:scaling> --image <str> --tag <str> | ||||||
| ``` | ||||||
|
|
||||||
| **Output Example (Subject to Change)** | ||||||
|
|
||||||
| ``` | ||||||
| 2025-12-01 13:59:52,031 - INFO - scale-request-3-1764615426-4pztx-dual-lhv7s:scale-request-3-1764615426-v9jkh bound through a HIT. | ||||||
| 2025-12-01 13:59:52,053 - INFO - Checking readiness of Requester Pod:scale-request-3-1764615426-ktcqd | ||||||
| 2025-12-01 13:59:52,496 - INFO - Checking readiness of Requester Pod:scale-request-3-1764615426-ktcqd | ||||||
| 2025-12-01 13:59:53,930 - INFO - Checking readiness of Requester Pod:scale-request-3-1764615426-ktcqd | ||||||
| 2025-12-01 13:59:53,962 - INFO - Checking readiness of Requester Pod:scale-request-3-1764615426-ktcqd | ||||||
| 2025-12-01 13:59:53,972 - INFO - Checking readiness of Requester Pod:scale-request-3-1764615426-ktcqd | ||||||
| 2025-12-01 13:59:55,900 - INFO - Checking readiness of Requester Pod:scale-request-3-1764615426-ktcqd | ||||||
| 2025-12-01 14:00:03,738 - INFO - Checking readiness of Requester Pod:scale-request-3-1764615426-ktcqd | ||||||
| 2025-12-01 14:00:33,850 - INFO - Checking readiness of Requester Pod:scale-request-3-1764615426-ktcqd | ||||||
| 2025-12-01 14:01:03,904 - INFO - Checking readiness of Requester Pod:scale-request-3-1764615426-ktcqd | ||||||
| 2025-12-01 14:01:22,404 - INFO - Checking readiness of Requester Pod:scale-request-3-1764615426-ktcqd | ||||||
| 2025-12-01 14:01:22,405 - INFO - Requester Pod:scale-request-3-1764615426-ktcqd ready after 108s on node:fmaas-vllm-d-wv25b-worker-h100-3-hmtwn using GPU:GPU-ca8ae222-50e0-b69e-16f2-e49dac1afe28 | ||||||
| 2025-12-01 14:01:22,405 - INFO - scale-request-3-1764615426-ktcqd-dual-pptzq:scale-request-3-1764615426-ktcqd bound through a COLD START. | ||||||
| 2025-12-01 14:01:22,405 - INFO - ✅ All pods {'scale-request-3-1764615426-hvxjg', 'scale-request-3-1764615426-v9jkh', 'scale-request-3-1764615426-ktcqd'} Ready after 108.97s | ||||||
| replicaset.apps "scale-request-3-1764615426" deleted | ||||||
| pod "scale-request-3-1764615426-9hlb2-dual-dgcg2" deleted | ||||||
| pod "scale-request-3-1764615426-hvxjg-dual-59hc8" deleted | ||||||
| pod "scale-request-3-1764615426-4pztx-dual-lhv7s" deleted | ||||||
| pod "scale-request-3-1764615426-ktcqd-dual-pptzq" deleted | ||||||
| 2025-12-01 14:01:32,868 - INFO - --------------------------------------------------------------------- | ||||||
|
|
||||||
| Total Runs: 15 | ||||||
| Successful Runs: 15 | ||||||
| Failed Runs: 0 | ||||||
| Requester Pods | ||||||
| Min: 9s, | ||||||
| Max: 318s | ||||||
| Average: 125.4s | ||||||
| Median: 115s | ||||||
| Hits: 3/6 (50%) | ||||||
| Hit Wake-up Times | ||||||
| Min: 9s, | ||||||
| Max: 18s | ||||||
| Average: 13.0s | ||||||
| ``` | ||||||
|
|
||||||
| ## Benchmarking Scenarios (WIP) | ||||||
|
|
||||||
| | Scenario | Description | | ||||||
| | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- | | ||||||
| | **Fast Replica Scale Up** | As a ModelService Owner, I can scale up the number of active replicas for a variant from 0 to 1 or 1 to 2 with minimal latency | | ||||||
| | **Introducing New Variant** | As a ModelService Owner, I can deploy a newly released variant in anticipation of user requests | | ||||||
| | **Free Up Cluster Resources** | As a Cluster Operator, I can reduce/deactivate resource intensive variants to make space for numerous smaller model variants | | ||||||
| ## Measurement Layers | ||||||
|
|
||||||
| FMA benchmarking uses a layered measurement model. Layer 1 is what FMA benchmarks own | ||||||
| directly. Layer 2 bridges actuation to inference readiness. Layer 3 is out of FMA's | ||||||
| direct scope but is referenced for completeness and handoff to other frameworks. | ||||||
|
|
||||||
| | Layer | Focus | Metrics | Measured By | | ||||||
| | ----- | ----- | ------- | ----------- | | ||||||
| | **L1: Actuation** | Requester pod readiness | T_actuation (requester creation to readiness), T_wake (launcher wakes sleeping vLLM instance), Hit_rate (% GPU hits), T_launcher (launcher creates new vLLM instance) | llm-d-benchmark new harness | | ||||||
| | **L2: Inference Readiness** | First inference response | T_first_token (requester ready to first inference response), T_e2e (requester creation to first inference response) | FMA + llm-d-benchmark nop/inference-perf harness | | ||||||
| | **L3: Steady-State** | Throughput/latency | TPOT (time per output token), throughput, queue depth, KV cache usage, replica stability | llm-d-benchmark / WVA | | ||||||
aavarghese marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| **Metric definitions:** | ||||||
|
|
||||||
| - **T_actuation**: Time from requester pod creation (ReplicaSet scale-up) to requester pod readiness (`/ready` probe passes), which implies the DPC has bound the requester to a server-providing pod and the vLLM instance is serving. | ||||||
| - **T_wake**: Time from the DPC instructing the launcher to wake a sleeping vLLM instance on the server-providing pod to that instance reporting ready to serve. A subset of T_actuation when a GPU hit occurs. | ||||||
aavarghese marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| - **Hit_rate**: Percentage of scale-up events where the DPC binds a requester to an existing sleeping pod on the correct GPU (hit) vs. requiring a cold start (miss). | ||||||
aavarghese marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
MikeSpreitzer marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| - **T_launcher**: Time from the launcher receiving a create request to the new vLLM instance reporting healthy. Includes the benefit of vLLM module preloading. | ||||||
rubambiza marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
| - **T_first_token**: Time from requester pod readiness to first successful inference response received through the server-providing pod's vLLM instance (time-to-first-token, post-actuation). | ||||||
aavarghese marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| - **T_e2e**: Total time from requester pod creation to first successful inference response. Spans the full path: requester scheduling, DPC binding, launcher wake/create, vLLM ready, first inference (T_actuation + T_first_token). | ||||||
|
||||||
| err, retry := ctl.enforceSleeperBudget(ctx, serverDat, requestingPod, int(lc.Spec.MaxSleepingInstances)) |
serverDat.GPUIndices is empty in the launcher-based case), this is a recognized shortcoming that needs to be fixed for milestone 3 (| // TODO(waltforme): enforceSleeperBudget should be revised for launcher-based server-providing Pods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify, you are effectively arguing for a fourth column (let's call it "luke warm start" for now), right? I want to confirm because it would require more rethinking for the matrix.
Re: instance deletion, thanks for the clarification.
Uh oh!
There was an error while loading. Please reload this page.