|
| 1 | +# Running WVA Scaling Benchmarks |
| 2 | + |
| 3 | +Step-by-step guide for deploying and running WVA scaling benchmarks on an OpenShift cluster. This covers both **single-model** and **multi-model** benchmarks, from cluster access to running the tests and interpreting results. |
| 4 | + |
| 5 | +## Prerequisites |
| 6 | + |
| 7 | +### Required Tools |
| 8 | + |
| 9 | +Verify the following tools are installed on your machine: |
| 10 | + |
| 11 | +```bash |
| 12 | +oc version --client |
| 13 | +oc version --client # includes kubectl functionality |
| 14 | +helm version --short |
| 15 | +yq --version |
| 16 | +jq --version |
| 17 | +go version |
| 18 | +``` |
| 19 | + |
| 20 | +If any are missing, install via Homebrew: `brew install openshift-cli helm yq jq go` |
| 21 | + |
| 22 | +### Required Access |
| 23 | + |
| 24 | +- OpenShift cluster credentials (API URL + token) |
| 25 | +- HuggingFace token with access to the models you want to deploy |
| 26 | + |
| 27 | +--- |
| 28 | + |
| 29 | +## Step 1: Log In to the OpenShift Cluster |
| 30 | + |
| 31 | +Get your login token from the OpenShift web console: |
| 32 | + |
| 33 | +1. Open the OpenShift console in your browser |
| 34 | +2. Click your username (top right) → **Copy login command** |
| 35 | +3. Click **Display Token** |
| 36 | +4. Copy the `oc login` command and run it: |
| 37 | + |
| 38 | +```bash |
| 39 | +oc login --token=sha256~XXXXXXXXXXXXXXXXXXXX --server=https://api.your-cluster.example.com:6443 |
| 40 | +``` |
| 41 | + |
| 42 | +Verify access and confirm which cluster you're connected to: |
| 43 | + |
| 44 | +```bash |
| 45 | +oc whoami |
| 46 | +oc whoami --show-console |
| 47 | +oc whoami --show-server |
| 48 | +``` |
| 49 | + |
| 50 | +Check available GPUs on the cluster: |
| 51 | + |
| 52 | +```bash |
| 53 | +oc get nodes -o jsonpath='{range .items[?(@.status.allocatable.nvidia\.com/gpu)]}{.metadata.name}{"\t"}{.metadata.labels.nvidia\.com/gpu\.product}{"\n"}{end}' |
| 54 | +``` |
| 55 | + |
| 56 | +--- |
| 57 | + |
| 58 | +## Step 2: Set Up Your Namespace |
| 59 | + |
| 60 | +First, check which namespaces you already have access to: |
| 61 | + |
| 62 | +```bash |
| 63 | +oc projects |
| 64 | +``` |
| 65 | + |
| 66 | +If you have an existing namespace you can use, use that as `<your-namespace>` in the commands below. |
| 67 | + |
| 68 | +If you have cluster-admin access, create a fresh namespace: |
| 69 | + |
| 70 | +```bash |
| 71 | +oc new-project <your-namespace> |
| 72 | +``` |
| 73 | + |
| 74 | +> **Note**: If you get a `Forbidden` error, you don't have permission to create namespaces. Contact the cluster admin to get admin access or have a namespace created for you. |
| 75 | +
|
| 76 | +Label the namespace for OpenShift user-workload monitoring (so Prometheus can scrape metrics): |
| 77 | + |
| 78 | +```bash |
| 79 | +oc label namespace <your-namespace> openshift.io/user-monitoring=true --overwrite |
| 80 | +``` |
| 81 | + |
| 82 | +--- |
| 83 | + |
| 84 | +## Step 3: Export Your HuggingFace Token |
| 85 | + |
| 86 | +The only environment variable you need to export is the HuggingFace token (required for model downloads): |
| 87 | + |
| 88 | +```bash |
| 89 | +export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxx" |
| 90 | +``` |
| 91 | + |
| 92 | +All other configuration is passed directly to the deploy/test commands in later steps. |
| 93 | + |
| 94 | +--- |
| 95 | + |
| 96 | +## Step 4: Clone the Repository |
| 97 | + |
| 98 | +If you haven't already: |
| 99 | + |
| 100 | +```bash |
| 101 | +git clone https://github.com/llm-d/llm-d-workload-variant-autoscaler.git |
| 102 | +cd llm-d-workload-variant-autoscaler |
| 103 | +``` |
| 104 | + |
| 105 | +Make sure you're on the correct branch: |
| 106 | + |
| 107 | +```bash |
| 108 | +git checkout main |
| 109 | +# Or check out a specific PR branch: |
| 110 | +# gh pr checkout <pr-number> |
| 111 | +``` |
| 112 | + |
| 113 | +--- |
| 114 | + |
| 115 | +## Step 5a: Run the Single-Model Benchmark |
| 116 | + |
| 117 | +The single-model benchmark tests WVA scaling behavior with one model under different workload patterns. Scenario configurations are defined in `test/benchmark/scenarios/`. |
| 118 | + |
| 119 | +| Scenario | Prompt Tokens | Output Tokens | Rate | What it tests | |
| 120 | +|----------|--------------|---------------|------|---------------| |
| 121 | +| `prefill_heavy` | 4000 | 1000 | 20 RPS | Prefill (prompt processing) — long input, short output | |
| 122 | +| `decode_heavy` | 1000 | 4000 | 20 RPS | Decode (token generation) — short input, long output | |
| 123 | + |
| 124 | +### 1. Deploy Single-Model Infrastructure |
| 125 | + |
| 126 | +```bash |
| 127 | +make deploy-e2e-infra \ |
| 128 | + ENVIRONMENT=openshift \ |
| 129 | + WVA_NS=<your-namespace> LLMD_NS=<your-namespace> \ |
| 130 | + E2E_EMULATED_LLMD_NAMESPACE=<your-namespace> \ |
| 131 | + NAMESPACE_SCOPED=true SKIP_BUILD=true \ |
| 132 | + DECODE_REPLICAS=1 IMG_TAG=v0.6.0 LLM_D_RELEASE=v0.6.0 \ |
| 133 | + DEPLOY_PROMETHEUS_ADAPTER=false |
| 134 | +``` |
| 135 | + |
| 136 | +Wait for all pods to be ready: |
| 137 | + |
| 138 | +```bash |
| 139 | +oc get pods -n <your-namespace> |
| 140 | +``` |
| 141 | + |
| 142 | +Expected output — vLLM decode pod, EPP, gateway, and WVA controller all `Running`: |
| 143 | + |
| 144 | +``` |
| 145 | +NAME READY STATUS RESTARTS AGE |
| 146 | +gaie-inference-scheduling-epp-... 1/1 Running 0 4m |
| 147 | +infra-inference-scheduling-inference-gateway-istio-... 1/1 Running 0 4m |
| 148 | +ms-inference-scheduling-llm-d-modelservice-decode-... 1/1 Running 0 4m |
| 149 | +workload-variant-autoscaler-controller-manager-... 1/1 Running 0 2m |
| 150 | +workload-variant-autoscaler-controller-manager-... 1/1 Running 0 2m |
| 151 | +``` |
| 152 | + |
| 153 | +### 2. Run the Prefill Heavy Benchmark |
| 154 | + |
| 155 | +```bash |
| 156 | +make test-benchmark \ |
| 157 | + ENVIRONMENT=openshift \ |
| 158 | + E2E_EMULATED_LLMD_NAMESPACE=<your-namespace> \ |
| 159 | + BENCHMARK_SCENARIO=prefill_heavy |
| 160 | +``` |
| 161 | + |
| 162 | +### 3. Run the Decode Heavy Benchmark |
| 163 | + |
| 164 | +```bash |
| 165 | +make test-benchmark \ |
| 166 | + ENVIRONMENT=openshift \ |
| 167 | + E2E_EMULATED_LLMD_NAMESPACE=<your-namespace> \ |
| 168 | + BENCHMARK_SCENARIO=decode_heavy |
| 169 | +``` |
| 170 | + |
| 171 | +Each benchmark run takes approximately 15–20 minutes (30s warmup + 600s load generation + monitoring overhead). |
| 172 | + |
| 173 | +### Expected Output |
| 174 | + |
| 175 | +On success, the test prints a results summary and exits with code 0: |
| 176 | + |
| 177 | +``` |
| 178 | +SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 6 Skipped |
| 179 | +--- PASS: TestBenchmark |
| 180 | +PASS |
| 181 | +``` |
| 182 | + |
| 183 | +The results summary includes: |
| 184 | +- TTFT and ITL latency percentiles (p50, p90, p99) |
| 185 | +- Avg/max replicas and replica timeline |
| 186 | +- KV cache utilization, vLLM queue depth, and EPP queue depth |
| 187 | +- Achieved RPS, error count, and incomplete request count |
| 188 | + |
| 189 | +### What the Benchmark Does |
| 190 | + |
| 191 | +1. Finds the Helm-deployed decode deployment in the namespace |
| 192 | +2. Creates a VariantAutoscaling (VA) resource (min=1, max=10, cost=10) |
| 193 | +3. Creates an HPA with external metric `wva_desired_replicas` |
| 194 | +4. Patches EPP ConfigMap with flow control and scorer weights |
| 195 | +5. Launches a GuideLLM load generation job with the scenario parameters |
| 196 | +6. Monitors replicas, KV cache utilization, and queue depth every 15s |
| 197 | +7. Extracts and reports TTFT, ITL, throughput, and error metrics |
| 198 | + |
| 199 | +### 4. Cleanup |
| 200 | + |
| 201 | +```bash |
| 202 | +oc delete project <your-namespace> |
| 203 | +``` |
| 204 | + |
| 205 | +--- |
| 206 | + |
| 207 | +## Step 5b: Run the Multi-Model Benchmark |
| 208 | + |
| 209 | +The multi-model benchmark tests WVA scaling across multiple models sharing the same infrastructure. |
| 210 | + |
| 211 | +Replace `<your-namespace>` with your namespace: |
| 212 | + |
| 213 | +```bash |
| 214 | +# 1. Undeploy previous run (clean slate) |
| 215 | +make undeploy-multi-model-infra \ |
| 216 | + ENVIRONMENT=openshift \ |
| 217 | + WVA_NS=<your-namespace> LLMD_NS=<your-namespace> \ |
| 218 | + MODELS="Qwen/Qwen3-0.6B,unsloth/Meta-Llama-3.1-8B" |
| 219 | + |
| 220 | +# 2. Deploy multi-model infrastructure |
| 221 | +make deploy-multi-model-infra \ |
| 222 | + ENVIRONMENT=openshift \ |
| 223 | + WVA_NS=<your-namespace> LLMD_NS=<your-namespace> \ |
| 224 | + NAMESPACE_SCOPED=true SKIP_BUILD=true \ |
| 225 | + DECODE_REPLICAS=1 IMG_TAG=v0.6.0 LLM_D_RELEASE=v0.6.0 \ |
| 226 | + DEPLOY_PROMETHEUS_ADAPTER=false \ |
| 227 | + MODELS="Qwen/Qwen3-0.6B,unsloth/Meta-Llama-3.1-8B" |
| 228 | + |
| 229 | +# 3. Run the benchmark |
| 230 | +make test-multi-model-scaling \ |
| 231 | + ENVIRONMENT=openshift \ |
| 232 | + LLMD_NS=<your-namespace> \ |
| 233 | + MODELS="Qwen/Qwen3-0.6B,unsloth/Meta-Llama-3.1-8B" |
| 234 | +``` |
| 235 | + |
| 236 | +Expected result: `make test-multi-model-scaling` passes with exit code 0. |
| 237 | + |
| 238 | +### Monitor During the Benchmark |
| 239 | + |
| 240 | +In a separate terminal, watch the scaling behavior: |
| 241 | + |
| 242 | +```bash |
| 243 | +watch oc get hpa -n <your-namespace> |
| 244 | +watch oc get variantautoscaling -n <your-namespace> |
| 245 | +``` |
| 246 | + |
| 247 | +### Cleanup |
| 248 | + |
| 249 | +```bash |
| 250 | +oc delete project <your-namespace> |
| 251 | +``` |
| 252 | + |
| 253 | +--- |
0 commit comments