Commit 52c12e2
authored
Benchmark Phase 1: Enable framework and chatops action for benchmarking (#900)
* Feat: Add benchmark framework with scale-up-latency scenario and /benchmark kind ChatOps trigger
Add test/benchmark/ package with a Ginkgo-based scale-up-latency benchmark
that measures autoscaler performance through 4 phases: baseline, spike,
sustained, and cooldown. Collects scale-up/down latency, max replicas,
KV cache usage, queue depth, and replica oscillation via Prometheus range
queries.
- test/benchmark/suite_test.go: test suite with Prometheus port-forward setup
- test/benchmark/config.go: env-based config with tunable phase durations
- test/benchmark/benchmark_test.go: 4-phase ordered scenario using e2e fixtures
- test/benchmark/prometheus.go: QueryRangeAvg helper for Prometheus range queries
- test/utils/e2eutils.go: export PrometheusClient.API() for range queries
- .github/workflows/ci-benchmark.yaml: /benchmark kind ChatOps workflow
- Makefile: add test-benchmark and test-benchmark-with-setup targets,
exclude benchmark from unit tests
* Feat: Add ephemeral Grafana snapshot capture to benchmark framework
Deploy an in-cluster Grafana instance during benchmarks via helm upgrade
on the existing kube-prometheus-stack, create a 5-panel dashboard
(replicas, desired replicas, KV cache, queue depth, saturation metrics),
and capture a snapshot covering the full benchmark time range.
- deploy/grafana/: Helm values (anonymous admin, Prometheus datasource,
sidecar dashboard provisioning) and dashboard JSON
- test/benchmark/grafana.go: DeployGrafana, NewGrafanaClient,
CreateSnapshot, RenderPanel
- test/benchmark/config.go: GrafanaEnabled, GrafanaSnapshotFile fields
- test/benchmark/suite_test.go: wire Grafana deploy/teardown into
BeforeSuite/AfterSuite
- test/benchmark/benchmark_test.go: capture snapshot in AfterAll,
include URL in results JSON
- ci-benchmark.yaml: enable Grafana, upload snapshot, link in PR comment
* Feat: Export Grafana snapshot JSON and render panels to PNG
Persist benchmark Grafana data to GitHub Actions artifacts so results
survive the ephemeral Kind cluster:
1. Snapshot JSON export: fetch full snapshot via GET /api/snapshots/:key
and save as re-importable JSON (POST to any Grafana to restore).
2. Panel PNG rendering: enable grafana-image-renderer plugin and render
all 5 dashboard panels to individual PNG files.
- grafana.go: CreateSnapshot now returns SnapshotResult (key+URL),
add ExportSnapshotJSON and RenderAllPanels methods
- config.go: add GrafanaSnapshotJSONFile and GrafanaPanelDir fields
- benchmark-dashboard.json: add explicit panel IDs for stable rendering
- benchmark-grafana-values.yaml: enable imageRenderer with resource limits
- ci-benchmark.yaml: pass new env vars, upload JSON + PNGs as artifacts
* Fix wrong Prometheus metric name in benchmark (gpu_cache → kv_cache)
The benchmark queried vllm:gpu_cache_usage_perc which doesn't exist.
The actual metric emitted by the vLLM simulator is vllm:kv_cache_usage_perc
as defined in internal/constants/metrics.go.
* Fix staticcheck SA4004: remove unconditionally terminated loop
Use direct index access for the single Grafana pod instead of
iterating with a for-range that always returns on first iteration.
* Fix Grafana image pull: use docker.io with Kind pre-load
Grafana only publishes images to Docker Hub, not quay.io.
Pre-load the image into Kind before running the benchmark
and set imagePullPolicy: IfNotPresent to avoid runtime pulls.
* Match benchmark spike phase to e2e parallel load test flow
- Add 30s ramp-up wait after load generation starts (like e2e)
- Monitor VA status for scale-up intent before checking deployment
- Monitor HPA for scale-up confirmation (separate stage)
- Monitor deployment for actual replica changes (10m timeout)
- Add detailed diagnostics every 30s: VA status, HPA conditions,
HPA metrics, load pod phases, Prometheus metric values
- Clean up existing jobs before creating new ones (like e2e)
- Log service endpoint count during readiness check
* Fix benchmark load generation: reduce workers to 1, add failed pod log collection
Match e2e's maxSingleReplicaWorkers=1 for single-replica deployments to avoid
overwhelming the simulator's max-num-seqs queue. Also collect pod logs when load
pods fail to aid diagnosis of connectivity/runtime issues.
* Add in-cluster connectivity probe to diagnose load pod failures
The load pod cannot connect to the service (24 attempts, all fail).
Adding a diagnostic probe pod that runs curl -v and DNS resolution
to determine if the issue is DNS, routing, or HTTP status code mismatch.
* Add Grafana image renderer sidecar for panel PNG export
The base grafana:11.4.0 image has no image renderer installed, causing
"no image renderer available/installed" when rendering panel PNGs.
Adding grafana-image-renderer:3.11.6 as a sidecar container and
pre-loading it into Kind in CI.
* Fix Grafana dashboard to use actual WVA metric names
- Panel 1: Replace kube_deployment_spec_replicas (requires kube-state-metrics)
with wva_desired_replicas and wva_current_replicas
- Panel 2: Replace wva_desired_replicas with wva_desired_ratio (more useful)
- Panel 5: Replace non-existent wva_saturation_score/wva_capacity_score
with wva_desired/current_replicas and scaling rate
- Fix label references: variant_name not variant
* Embed Grafana panel images in PR comment via release assets
Upload rendered panel PNGs as prerelease assets and embed them
directly in the PR comment under a collapsible details section.
Also fix dashboard queries to use actual WVA metric names.
* Fix Grafana dashboard datasource for file-provisioned mode
File-provisioned dashboards do not resolve ${DS_PROMETHEUS} template
variables. Remove uid from all panel datasource references so Grafana
auto-selects the default Prometheus datasource.
* Fix WVA metric queries and CI permissions for panel images
WVA metrics get namespace="workload-variant-autoscaler-system" from
Prometheus scraping (not the VA namespace), so remove the
namespace=~"llm-d.*" filter from WVA metric queries in the dashboard.
vLLM metrics keep the filter since they are scraped from llm-d-sim.
Change benchmark-kind job permissions to contents:write so the workflow
can create GitHub releases to host rendered panel PNG images.
* Route benchmark load through Gateway/EPP (full llm-d stack)
Change benchmark to send load through the Gateway service instead of
directly to the model service. Traffic now flows through the full
llm-d stack: Gateway → HTTPRoute → InferencePool → EPP → model pods.
The benchmark model service pods already have the
llm-d.ai/inferenceServing label, so the InferencePool discovers
them automatically.
Add GatewayServiceName/GatewayServicePort config fields (env:
GATEWAY_SERVICE_NAME, GATEWAY_SERVICE_PORT) and EPP/Gateway
readiness checks in BeforeSuite.
* Fix EPP pod label selector to match inferencepool chart
The inferencepool chart labels EPP pods with
inferencepool=<epp-service-name>, not
app.kubernetes.io/name=inferencepool. Use the same label selector
as the e2e scale-from-zero test.
* Address PR #900 review comments: dedup config, extract setup, streamline suite
- Extract shared test config to test/testconfig/config.go; both E2EConfig
and BenchmarkConfig now embed testconfig.SharedConfig (comment 5)
- Move BenchmarkResults to results.go for reuse across scenarios (comment 6)
- Rename benchmark_test.go to scale_up_latency_benchmark_test.go (comment 7)
- Add scenario description comment (comment 8)
- Extract common setup to setup_test.go with SetupBenchmarkScenario(),
CaptureResultsAndGrafana(), GatewayTargetURL() (comment 9)
- Remove redundant infra verification from BeforeSuite — install.sh
already verifies WVA, Gateway, EPP, Prometheus (comment 12)
- Move Grafana deployment to install.sh via INSTALL_GRAFANA env var;
remove DeployGrafana() call from suite_test.go (comments 2, 13)
- Remove "(matches e2e flow)" comments (comment 10)
- Fix vaAttempt % 3 spacing (comment 11)
* Remove assertions from benchmark phases — observe, don't test
Benchmark phases should only observe and record metrics, not assert
on replica counts. Prometheus already monitors replicas for all
deployments.
- Phase 2: Replace VA/HPA/deployment Eventually blocks with simple
observation loop that logs and records scale-up time + max replicas
- Phase 3: Replace Expect on deployment Get with warning log
- Phase 4: Replace Expect/IsNotFound on deployment Get with warning log
- Remove DeployGrafana and dumpGrafanaDiagnostics from grafana.go
(Grafana deployed via install.sh INSTALL_GRAFANA env var)
- Remove unused imports: fmt, corev1, errors
* Pre-check Grafana service existence before SetUpPortForward
SetUpPortForward uses Expect() internally which fatally fails the
entire suite if the service is not found. Add a pre-check in
NewGrafanaClient that returns an error gracefully, allowing the
suite to continue without Grafana when the service is missing.
* Create benchmark-dashboard ConfigMap before Grafana deployment
The Grafana deployment references a benchmark-dashboard ConfigMap
volume but deploy_benchmark_grafana() never created it, causing
the pod to fail with CreateContainerConfigError.
* Address review: make setup helpers private and add GinkgoHelper
- Rename SetupBenchmarkScenario → setupBenchmarkScenario (private)
- Rename CaptureResultsAndGrafana → captureResultsAndGrafana (private)
- Rename GatewayTargetURL → gatewayTargetURL (private)
- Add GinkgoHelper() to setup/capture functions so failures
report at the caller's line instead of inside the helper
- Update doc comment to clarify fresh resource creation1 parent 68315a7 commit 52c12e2
16 files changed
Lines changed: 1998 additions & 144 deletions
File tree
- .github/workflows
- deploy
- grafana
- test
- benchmark
- e2e
- fixtures
- testconfig
- utils
Large diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
91 | 91 | | |
92 | 92 | | |
93 | 93 | | |
94 | | - | |
| 94 | + | |
95 | 95 | | |
96 | 96 | | |
97 | 97 | | |
| |||
269 | 269 | | |
270 | 270 | | |
271 | 271 | | |
272 | | - | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
| 286 | + | |
| 287 | + | |
| 288 | + | |
| 289 | + | |
| 290 | + | |
| 291 | + | |
| 292 | + | |
| 293 | + | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
| 297 | + | |
273 | 298 | | |
274 | 299 | | |
275 | 300 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
0 commit comments