Skip to content

Commit aea61ad

Browse files
authored
docs: Rename benchmark guide and add single-model benchmark section (#1048)
* docs: Rename benchmark guide and add single-model section Made-with: Cursor * docs: Rename title to Running WVA Scaling Benchmarks Made-with: Cursor * docs: Show tested commands with expected output for single-model benchmark Made-with: Cursor * docs: Replace all kubectl references with oc Made-with: Cursor
1 parent 1737afc commit aea61ad

2 files changed

Lines changed: 253 additions & 160 deletions

File tree

Lines changed: 253 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,253 @@
1+
# Running WVA Scaling Benchmarks
2+
3+
Step-by-step guide for deploying and running WVA scaling benchmarks on an OpenShift cluster. This covers both **single-model** and **multi-model** benchmarks, from cluster access to running the tests and interpreting results.
4+
5+
## Prerequisites
6+
7+
### Required Tools
8+
9+
Verify the following tools are installed on your machine:
10+
11+
```bash
12+
oc version --client
13+
oc version --client # includes kubectl functionality
14+
helm version --short
15+
yq --version
16+
jq --version
17+
go version
18+
```
19+
20+
If any are missing, install via Homebrew: `brew install openshift-cli helm yq jq go`
21+
22+
### Required Access
23+
24+
- OpenShift cluster credentials (API URL + token)
25+
- HuggingFace token with access to the models you want to deploy
26+
27+
---
28+
29+
## Step 1: Log In to the OpenShift Cluster
30+
31+
Get your login token from the OpenShift web console:
32+
33+
1. Open the OpenShift console in your browser
34+
2. Click your username (top right) → **Copy login command**
35+
3. Click **Display Token**
36+
4. Copy the `oc login` command and run it:
37+
38+
```bash
39+
oc login --token=sha256~XXXXXXXXXXXXXXXXXXXX --server=https://api.your-cluster.example.com:6443
40+
```
41+
42+
Verify access and confirm which cluster you're connected to:
43+
44+
```bash
45+
oc whoami
46+
oc whoami --show-console
47+
oc whoami --show-server
48+
```
49+
50+
Check available GPUs on the cluster:
51+
52+
```bash
53+
oc get nodes -o jsonpath='{range .items[?(@.status.allocatable.nvidia\.com/gpu)]}{.metadata.name}{"\t"}{.metadata.labels.nvidia\.com/gpu\.product}{"\n"}{end}'
54+
```
55+
56+
---
57+
58+
## Step 2: Set Up Your Namespace
59+
60+
First, check which namespaces you already have access to:
61+
62+
```bash
63+
oc projects
64+
```
65+
66+
If you have an existing namespace you can use, use that as `<your-namespace>` in the commands below.
67+
68+
If you have cluster-admin access, create a fresh namespace:
69+
70+
```bash
71+
oc new-project <your-namespace>
72+
```
73+
74+
> **Note**: If you get a `Forbidden` error, you don't have permission to create namespaces. Contact the cluster admin to get admin access or have a namespace created for you.
75+
76+
Label the namespace for OpenShift user-workload monitoring (so Prometheus can scrape metrics):
77+
78+
```bash
79+
oc label namespace <your-namespace> openshift.io/user-monitoring=true --overwrite
80+
```
81+
82+
---
83+
84+
## Step 3: Export Your HuggingFace Token
85+
86+
The only environment variable you need to export is the HuggingFace token (required for model downloads):
87+
88+
```bash
89+
export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxx"
90+
```
91+
92+
All other configuration is passed directly to the deploy/test commands in later steps.
93+
94+
---
95+
96+
## Step 4: Clone the Repository
97+
98+
If you haven't already:
99+
100+
```bash
101+
git clone https://github.com/llm-d/llm-d-workload-variant-autoscaler.git
102+
cd llm-d-workload-variant-autoscaler
103+
```
104+
105+
Make sure you're on the correct branch:
106+
107+
```bash
108+
git checkout main
109+
# Or check out a specific PR branch:
110+
# gh pr checkout <pr-number>
111+
```
112+
113+
---
114+
115+
## Step 5a: Run the Single-Model Benchmark
116+
117+
The single-model benchmark tests WVA scaling behavior with one model under different workload patterns. Scenario configurations are defined in `test/benchmark/scenarios/`.
118+
119+
| Scenario | Prompt Tokens | Output Tokens | Rate | What it tests |
120+
|----------|--------------|---------------|------|---------------|
121+
| `prefill_heavy` | 4000 | 1000 | 20 RPS | Prefill (prompt processing) — long input, short output |
122+
| `decode_heavy` | 1000 | 4000 | 20 RPS | Decode (token generation) — short input, long output |
123+
124+
### 1. Deploy Single-Model Infrastructure
125+
126+
```bash
127+
make deploy-e2e-infra \
128+
ENVIRONMENT=openshift \
129+
WVA_NS=<your-namespace> LLMD_NS=<your-namespace> \
130+
E2E_EMULATED_LLMD_NAMESPACE=<your-namespace> \
131+
NAMESPACE_SCOPED=true SKIP_BUILD=true \
132+
DECODE_REPLICAS=1 IMG_TAG=v0.6.0 LLM_D_RELEASE=v0.6.0 \
133+
DEPLOY_PROMETHEUS_ADAPTER=false
134+
```
135+
136+
Wait for all pods to be ready:
137+
138+
```bash
139+
oc get pods -n <your-namespace>
140+
```
141+
142+
Expected output — vLLM decode pod, EPP, gateway, and WVA controller all `Running`:
143+
144+
```
145+
NAME READY STATUS RESTARTS AGE
146+
gaie-inference-scheduling-epp-... 1/1 Running 0 4m
147+
infra-inference-scheduling-inference-gateway-istio-... 1/1 Running 0 4m
148+
ms-inference-scheduling-llm-d-modelservice-decode-... 1/1 Running 0 4m
149+
workload-variant-autoscaler-controller-manager-... 1/1 Running 0 2m
150+
workload-variant-autoscaler-controller-manager-... 1/1 Running 0 2m
151+
```
152+
153+
### 2. Run the Prefill Heavy Benchmark
154+
155+
```bash
156+
make test-benchmark \
157+
ENVIRONMENT=openshift \
158+
E2E_EMULATED_LLMD_NAMESPACE=<your-namespace> \
159+
BENCHMARK_SCENARIO=prefill_heavy
160+
```
161+
162+
### 3. Run the Decode Heavy Benchmark
163+
164+
```bash
165+
make test-benchmark \
166+
ENVIRONMENT=openshift \
167+
E2E_EMULATED_LLMD_NAMESPACE=<your-namespace> \
168+
BENCHMARK_SCENARIO=decode_heavy
169+
```
170+
171+
Each benchmark run takes approximately 15–20 minutes (30s warmup + 600s load generation + monitoring overhead).
172+
173+
### Expected Output
174+
175+
On success, the test prints a results summary and exits with code 0:
176+
177+
```
178+
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 6 Skipped
179+
--- PASS: TestBenchmark
180+
PASS
181+
```
182+
183+
The results summary includes:
184+
- TTFT and ITL latency percentiles (p50, p90, p99)
185+
- Avg/max replicas and replica timeline
186+
- KV cache utilization, vLLM queue depth, and EPP queue depth
187+
- Achieved RPS, error count, and incomplete request count
188+
189+
### What the Benchmark Does
190+
191+
1. Finds the Helm-deployed decode deployment in the namespace
192+
2. Creates a VariantAutoscaling (VA) resource (min=1, max=10, cost=10)
193+
3. Creates an HPA with external metric `wva_desired_replicas`
194+
4. Patches EPP ConfigMap with flow control and scorer weights
195+
5. Launches a GuideLLM load generation job with the scenario parameters
196+
6. Monitors replicas, KV cache utilization, and queue depth every 15s
197+
7. Extracts and reports TTFT, ITL, throughput, and error metrics
198+
199+
### 4. Cleanup
200+
201+
```bash
202+
oc delete project <your-namespace>
203+
```
204+
205+
---
206+
207+
## Step 5b: Run the Multi-Model Benchmark
208+
209+
The multi-model benchmark tests WVA scaling across multiple models sharing the same infrastructure.
210+
211+
Replace `<your-namespace>` with your namespace:
212+
213+
```bash
214+
# 1. Undeploy previous run (clean slate)
215+
make undeploy-multi-model-infra \
216+
ENVIRONMENT=openshift \
217+
WVA_NS=<your-namespace> LLMD_NS=<your-namespace> \
218+
MODELS="Qwen/Qwen3-0.6B,unsloth/Meta-Llama-3.1-8B"
219+
220+
# 2. Deploy multi-model infrastructure
221+
make deploy-multi-model-infra \
222+
ENVIRONMENT=openshift \
223+
WVA_NS=<your-namespace> LLMD_NS=<your-namespace> \
224+
NAMESPACE_SCOPED=true SKIP_BUILD=true \
225+
DECODE_REPLICAS=1 IMG_TAG=v0.6.0 LLM_D_RELEASE=v0.6.0 \
226+
DEPLOY_PROMETHEUS_ADAPTER=false \
227+
MODELS="Qwen/Qwen3-0.6B,unsloth/Meta-Llama-3.1-8B"
228+
229+
# 3. Run the benchmark
230+
make test-multi-model-scaling \
231+
ENVIRONMENT=openshift \
232+
LLMD_NS=<your-namespace> \
233+
MODELS="Qwen/Qwen3-0.6B,unsloth/Meta-Llama-3.1-8B"
234+
```
235+
236+
Expected result: `make test-multi-model-scaling` passes with exit code 0.
237+
238+
### Monitor During the Benchmark
239+
240+
In a separate terminal, watch the scaling behavior:
241+
242+
```bash
243+
watch oc get hpa -n <your-namespace>
244+
watch oc get variantautoscaling -n <your-namespace>
245+
```
246+
247+
### Cleanup
248+
249+
```bash
250+
oc delete project <your-namespace>
251+
```
252+
253+
---

0 commit comments

Comments
 (0)