Skip to content

Commit 52c12e2

Browse files
authored
Benchmark Phase 1: Enable framework and chatops action for benchmarking (#900)
* Feat: Add benchmark framework with scale-up-latency scenario and /benchmark kind ChatOps trigger Add test/benchmark/ package with a Ginkgo-based scale-up-latency benchmark that measures autoscaler performance through 4 phases: baseline, spike, sustained, and cooldown. Collects scale-up/down latency, max replicas, KV cache usage, queue depth, and replica oscillation via Prometheus range queries. - test/benchmark/suite_test.go: test suite with Prometheus port-forward setup - test/benchmark/config.go: env-based config with tunable phase durations - test/benchmark/benchmark_test.go: 4-phase ordered scenario using e2e fixtures - test/benchmark/prometheus.go: QueryRangeAvg helper for Prometheus range queries - test/utils/e2eutils.go: export PrometheusClient.API() for range queries - .github/workflows/ci-benchmark.yaml: /benchmark kind ChatOps workflow - Makefile: add test-benchmark and test-benchmark-with-setup targets, exclude benchmark from unit tests * Feat: Add ephemeral Grafana snapshot capture to benchmark framework Deploy an in-cluster Grafana instance during benchmarks via helm upgrade on the existing kube-prometheus-stack, create a 5-panel dashboard (replicas, desired replicas, KV cache, queue depth, saturation metrics), and capture a snapshot covering the full benchmark time range. - deploy/grafana/: Helm values (anonymous admin, Prometheus datasource, sidecar dashboard provisioning) and dashboard JSON - test/benchmark/grafana.go: DeployGrafana, NewGrafanaClient, CreateSnapshot, RenderPanel - test/benchmark/config.go: GrafanaEnabled, GrafanaSnapshotFile fields - test/benchmark/suite_test.go: wire Grafana deploy/teardown into BeforeSuite/AfterSuite - test/benchmark/benchmark_test.go: capture snapshot in AfterAll, include URL in results JSON - ci-benchmark.yaml: enable Grafana, upload snapshot, link in PR comment * Feat: Export Grafana snapshot JSON and render panels to PNG Persist benchmark Grafana data to GitHub Actions artifacts so results survive the ephemeral Kind cluster: 1. Snapshot JSON export: fetch full snapshot via GET /api/snapshots/:key and save as re-importable JSON (POST to any Grafana to restore). 2. Panel PNG rendering: enable grafana-image-renderer plugin and render all 5 dashboard panels to individual PNG files. - grafana.go: CreateSnapshot now returns SnapshotResult (key+URL), add ExportSnapshotJSON and RenderAllPanels methods - config.go: add GrafanaSnapshotJSONFile and GrafanaPanelDir fields - benchmark-dashboard.json: add explicit panel IDs for stable rendering - benchmark-grafana-values.yaml: enable imageRenderer with resource limits - ci-benchmark.yaml: pass new env vars, upload JSON + PNGs as artifacts * Fix wrong Prometheus metric name in benchmark (gpu_cache → kv_cache) The benchmark queried vllm:gpu_cache_usage_perc which doesn't exist. The actual metric emitted by the vLLM simulator is vllm:kv_cache_usage_perc as defined in internal/constants/metrics.go. * Fix staticcheck SA4004: remove unconditionally terminated loop Use direct index access for the single Grafana pod instead of iterating with a for-range that always returns on first iteration. * Fix Grafana image pull: use docker.io with Kind pre-load Grafana only publishes images to Docker Hub, not quay.io. Pre-load the image into Kind before running the benchmark and set imagePullPolicy: IfNotPresent to avoid runtime pulls. * Match benchmark spike phase to e2e parallel load test flow - Add 30s ramp-up wait after load generation starts (like e2e) - Monitor VA status for scale-up intent before checking deployment - Monitor HPA for scale-up confirmation (separate stage) - Monitor deployment for actual replica changes (10m timeout) - Add detailed diagnostics every 30s: VA status, HPA conditions, HPA metrics, load pod phases, Prometheus metric values - Clean up existing jobs before creating new ones (like e2e) - Log service endpoint count during readiness check * Fix benchmark load generation: reduce workers to 1, add failed pod log collection Match e2e's maxSingleReplicaWorkers=1 for single-replica deployments to avoid overwhelming the simulator's max-num-seqs queue. Also collect pod logs when load pods fail to aid diagnosis of connectivity/runtime issues. * Add in-cluster connectivity probe to diagnose load pod failures The load pod cannot connect to the service (24 attempts, all fail). Adding a diagnostic probe pod that runs curl -v and DNS resolution to determine if the issue is DNS, routing, or HTTP status code mismatch. * Add Grafana image renderer sidecar for panel PNG export The base grafana:11.4.0 image has no image renderer installed, causing "no image renderer available/installed" when rendering panel PNGs. Adding grafana-image-renderer:3.11.6 as a sidecar container and pre-loading it into Kind in CI. * Fix Grafana dashboard to use actual WVA metric names - Panel 1: Replace kube_deployment_spec_replicas (requires kube-state-metrics) with wva_desired_replicas and wva_current_replicas - Panel 2: Replace wva_desired_replicas with wva_desired_ratio (more useful) - Panel 5: Replace non-existent wva_saturation_score/wva_capacity_score with wva_desired/current_replicas and scaling rate - Fix label references: variant_name not variant * Embed Grafana panel images in PR comment via release assets Upload rendered panel PNGs as prerelease assets and embed them directly in the PR comment under a collapsible details section. Also fix dashboard queries to use actual WVA metric names. * Fix Grafana dashboard datasource for file-provisioned mode File-provisioned dashboards do not resolve ${DS_PROMETHEUS} template variables. Remove uid from all panel datasource references so Grafana auto-selects the default Prometheus datasource. * Fix WVA metric queries and CI permissions for panel images WVA metrics get namespace="workload-variant-autoscaler-system" from Prometheus scraping (not the VA namespace), so remove the namespace=~"llm-d.*" filter from WVA metric queries in the dashboard. vLLM metrics keep the filter since they are scraped from llm-d-sim. Change benchmark-kind job permissions to contents:write so the workflow can create GitHub releases to host rendered panel PNG images. * Route benchmark load through Gateway/EPP (full llm-d stack) Change benchmark to send load through the Gateway service instead of directly to the model service. Traffic now flows through the full llm-d stack: Gateway → HTTPRoute → InferencePool → EPP → model pods. The benchmark model service pods already have the llm-d.ai/inferenceServing label, so the InferencePool discovers them automatically. Add GatewayServiceName/GatewayServicePort config fields (env: GATEWAY_SERVICE_NAME, GATEWAY_SERVICE_PORT) and EPP/Gateway readiness checks in BeforeSuite. * Fix EPP pod label selector to match inferencepool chart The inferencepool chart labels EPP pods with inferencepool=<epp-service-name>, not app.kubernetes.io/name=inferencepool. Use the same label selector as the e2e scale-from-zero test. * Address PR #900 review comments: dedup config, extract setup, streamline suite - Extract shared test config to test/testconfig/config.go; both E2EConfig and BenchmarkConfig now embed testconfig.SharedConfig (comment 5) - Move BenchmarkResults to results.go for reuse across scenarios (comment 6) - Rename benchmark_test.go to scale_up_latency_benchmark_test.go (comment 7) - Add scenario description comment (comment 8) - Extract common setup to setup_test.go with SetupBenchmarkScenario(), CaptureResultsAndGrafana(), GatewayTargetURL() (comment 9) - Remove redundant infra verification from BeforeSuite — install.sh already verifies WVA, Gateway, EPP, Prometheus (comment 12) - Move Grafana deployment to install.sh via INSTALL_GRAFANA env var; remove DeployGrafana() call from suite_test.go (comments 2, 13) - Remove "(matches e2e flow)" comments (comment 10) - Fix vaAttempt % 3 spacing (comment 11) * Remove assertions from benchmark phases — observe, don't test Benchmark phases should only observe and record metrics, not assert on replica counts. Prometheus already monitors replicas for all deployments. - Phase 2: Replace VA/HPA/deployment Eventually blocks with simple observation loop that logs and records scale-up time + max replicas - Phase 3: Replace Expect on deployment Get with warning log - Phase 4: Replace Expect/IsNotFound on deployment Get with warning log - Remove DeployGrafana and dumpGrafanaDiagnostics from grafana.go (Grafana deployed via install.sh INSTALL_GRAFANA env var) - Remove unused imports: fmt, corev1, errors * Pre-check Grafana service existence before SetUpPortForward SetUpPortForward uses Expect() internally which fatally fails the entire suite if the service is not found. Add a pre-check in NewGrafanaClient that returns an error gracefully, allowing the suite to continue without Grafana when the service is missing. * Create benchmark-dashboard ConfigMap before Grafana deployment The Grafana deployment references a benchmark-dashboard ConfigMap volume but deploy_benchmark_grafana() never created it, causing the pod to fail with CreateContainerConfigError. * Address review: make setup helpers private and add GinkgoHelper - Rename SetupBenchmarkScenario → setupBenchmarkScenario (private) - Rename CaptureResultsAndGrafana → captureResultsAndGrafana (private) - Rename GatewayTargetURL → gatewayTargetURL (private) - Add GinkgoHelper() to setup/capture functions so failures report at the caller's line instead of inside the helper - Update doc comment to clarify fresh resource creation
1 parent 68315a7 commit 52c12e2

16 files changed

Lines changed: 1998 additions & 144 deletions

.github/workflows/ci-benchmark.yaml

Lines changed: 410 additions & 0 deletions
Large diffs are not rendered by default.

Makefile

Lines changed: 27 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -91,7 +91,7 @@ vet: ## Run go vet against code.
9191

9292
.PHONY: test
9393
test: manifests generate fmt vet setup-envtest helm ## Run tests.
94-
KUBEBUILDER_ASSETS="$(shell $(ENVTEST) use $(ENVTEST_K8S_VERSION) --bin-dir $(LOCALBIN) -p path)" PATH="$(LOCALBIN):$(PATH)" go test $$(go list ./... | grep -v /e2e) -coverprofile cover.out
94+
KUBEBUILDER_ASSETS="$(shell $(ENVTEST) use $(ENVTEST_K8S_VERSION) --bin-dir $(LOCALBIN) -p path)" PATH=$(LOCALBIN):$(PATH) go test $$(go list ./... | grep -v /e2e | grep -v /benchmark) -coverprofile cover.out
9595

9696
# Creates a multi-node Kind cluster
9797
# Adds emulated GPU labels and capacities per node
@@ -269,7 +269,32 @@ test-e2e-smoke-with-setup: deploy-e2e-infra test-e2e-smoke
269269
# Convenience target that deploys infra + runs full test suite.
270270
# Set DELETE_CLUSTER=true to delete Kind cluster after tests (default: keep cluster for debugging).
271271
.PHONY: test-e2e-full-with-setup
272-
test-e2e-full-with-setup: deploy-e2e-infra test-e2e-full
272+
test-e2e-full-with-setup: deploy-e2e-infra test-e2e-full
273+
274+
# Benchmark targets
275+
.PHONY: test-benchmark
276+
test-benchmark: manifests generate fmt vet ## Run benchmark tests (scale-up-latency scenario)
277+
@echo "Running benchmark tests..."
278+
KUBECONFIG=$(KUBECONFIG) \
279+
ENVIRONMENT=$(ENVIRONMENT) \
280+
WVA_NAMESPACE=$(CONTROLLER_NAMESPACE) \
281+
LLMD_NAMESPACE=$(E2E_EMULATED_LLMD_NAMESPACE) \
282+
MONITORING_NAMESPACE=$(E2E_MONITORING_NAMESPACE) \
283+
USE_SIMULATOR=$(USE_SIMULATOR) \
284+
SCALER_BACKEND=$(SCALER_BACKEND) \
285+
MODEL_ID=$(MODEL_ID) \
286+
go test ./test/benchmark/ -timeout 30m -v -ginkgo.v \
287+
-ginkgo.label-filter="benchmark"; \
288+
TEST_EXIT_CODE=$$?; \
289+
echo ""; \
290+
echo "=========================================="; \
291+
echo "Benchmark execution completed. Exit code: $$TEST_EXIT_CODE"; \
292+
echo "=========================================="; \
293+
exit $$TEST_EXIT_CODE
294+
295+
# Convenience target that deploys infra + runs benchmark tests.
296+
.PHONY: test-benchmark-with-setup
297+
test-benchmark-with-setup: deploy-e2e-infra test-benchmark
273298

274299
.PHONY: lint
275300
lint: golangci-lint ## Run golangci-lint linter
Lines changed: 212 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,212 @@
1+
{
2+
"annotations": {
3+
"list": [
4+
{
5+
"builtIn": 1,
6+
"datasource": { "type": "grafana", "uid": "-- Grafana --" },
7+
"enable": true,
8+
"hide": true,
9+
"iconColor": "rgba(0, 211, 255, 1)",
10+
"name": "Annotations & Alerts",
11+
"type": "dashboard"
12+
}
13+
]
14+
},
15+
"editable": true,
16+
"fiscalYearStartMonth": 0,
17+
"graphTooltip": 1,
18+
"id": null,
19+
"links": [],
20+
"panels": [
21+
{
22+
"id": 1,
23+
"title": "Deployment Replicas",
24+
"type": "timeseries",
25+
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
26+
"datasource": { "type": "prometheus" },
27+
"fieldConfig": {
28+
"defaults": {
29+
"color": { "mode": "palette-classic" },
30+
"custom": {
31+
"drawStyle": "line",
32+
"lineWidth": 2,
33+
"fillOpacity": 10,
34+
"pointSize": 5,
35+
"showPoints": "auto",
36+
"spanNulls": true
37+
},
38+
"unit": "short",
39+
"min": 0
40+
},
41+
"overrides": []
42+
},
43+
"targets": [
44+
{
45+
"expr": "wva_desired_replicas",
46+
"legendFormat": "desired {{variant_name}}",
47+
"refId": "A"
48+
},
49+
{
50+
"expr": "wva_current_replicas",
51+
"legendFormat": "current {{variant_name}}",
52+
"refId": "B"
53+
}
54+
],
55+
"options": { "legend": { "displayMode": "list", "placement": "bottom" } }
56+
},
57+
{
58+
"id": 2,
59+
"title": "WVA Desired Ratio",
60+
"type": "timeseries",
61+
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 },
62+
"datasource": { "type": "prometheus" },
63+
"fieldConfig": {
64+
"defaults": {
65+
"color": { "mode": "palette-classic" },
66+
"custom": {
67+
"drawStyle": "line",
68+
"lineWidth": 2,
69+
"fillOpacity": 10,
70+
"spanNulls": true
71+
},
72+
"unit": "short",
73+
"min": 0
74+
},
75+
"overrides": []
76+
},
77+
"targets": [
78+
{
79+
"expr": "wva_desired_ratio",
80+
"legendFormat": "ratio {{variant_name}}",
81+
"refId": "A"
82+
}
83+
],
84+
"options": { "legend": { "displayMode": "list", "placement": "bottom" } }
85+
},
86+
{
87+
"id": 3,
88+
"title": "KV Cache Usage",
89+
"type": "timeseries",
90+
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 8 },
91+
"datasource": { "type": "prometheus" },
92+
"fieldConfig": {
93+
"defaults": {
94+
"color": { "mode": "palette-classic" },
95+
"custom": {
96+
"drawStyle": "line",
97+
"lineWidth": 2,
98+
"fillOpacity": 20,
99+
"spanNulls": true
100+
},
101+
"unit": "percentunit",
102+
"min": 0,
103+
"max": 1,
104+
"thresholds": {
105+
"mode": "absolute",
106+
"steps": [
107+
{ "color": "green", "value": null },
108+
{ "color": "yellow", "value": 0.7 },
109+
{ "color": "red", "value": 0.9 }
110+
]
111+
}
112+
},
113+
"overrides": []
114+
},
115+
"targets": [
116+
{
117+
"expr": "vllm:kv_cache_usage_perc{namespace=~\"llm-d.*\"}",
118+
"legendFormat": "{{pod}}",
119+
"refId": "A"
120+
},
121+
{
122+
"expr": "avg(vllm:kv_cache_usage_perc{namespace=~\"llm-d.*\"})",
123+
"legendFormat": "avg",
124+
"refId": "B"
125+
}
126+
],
127+
"options": { "legend": { "displayMode": "list", "placement": "bottom" } }
128+
},
129+
{
130+
"id": 4,
131+
"title": "Queue Depth (Requests Waiting)",
132+
"type": "timeseries",
133+
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 8 },
134+
"datasource": { "type": "prometheus" },
135+
"fieldConfig": {
136+
"defaults": {
137+
"color": { "mode": "palette-classic" },
138+
"custom": {
139+
"drawStyle": "line",
140+
"lineWidth": 2,
141+
"fillOpacity": 10,
142+
"spanNulls": true
143+
},
144+
"unit": "short",
145+
"min": 0
146+
},
147+
"overrides": []
148+
},
149+
"targets": [
150+
{
151+
"expr": "vllm:num_requests_waiting{namespace=~\"llm-d.*\"}",
152+
"legendFormat": "{{pod}} waiting",
153+
"refId": "A"
154+
},
155+
{
156+
"expr": "vllm:num_requests_running{namespace=~\"llm-d.*\"}",
157+
"legendFormat": "{{pod}} running",
158+
"refId": "B"
159+
}
160+
],
161+
"options": { "legend": { "displayMode": "list", "placement": "bottom" } }
162+
},
163+
{
164+
"id": 5,
165+
"title": "Scaling Activity",
166+
"type": "timeseries",
167+
"gridPos": { "h": 8, "w": 24, "x": 0, "y": 16 },
168+
"datasource": { "type": "prometheus" },
169+
"fieldConfig": {
170+
"defaults": {
171+
"color": { "mode": "palette-classic" },
172+
"custom": {
173+
"drawStyle": "line",
174+
"lineWidth": 2,
175+
"fillOpacity": 10,
176+
"spanNulls": true
177+
},
178+
"unit": "short",
179+
"min": 0
180+
},
181+
"overrides": []
182+
},
183+
"targets": [
184+
{
185+
"expr": "wva_desired_replicas",
186+
"legendFormat": "desired {{variant_name}}",
187+
"refId": "A"
188+
},
189+
{
190+
"expr": "wva_current_replicas",
191+
"legendFormat": "current {{variant_name}}",
192+
"refId": "B"
193+
},
194+
{
195+
"expr": "rate(wva_replica_scaling_total[2m])",
196+
"legendFormat": "scaling rate {{variant_name}} {{direction}}",
197+
"refId": "C"
198+
}
199+
],
200+
"options": { "legend": { "displayMode": "list", "placement": "bottom" } }
201+
}
202+
],
203+
"schemaVersion": 39,
204+
"tags": ["benchmark", "wva", "autoscaling"],
205+
"templating": { "list": [] },
206+
"time": { "from": "now-30m", "to": "now" },
207+
"timepicker": {},
208+
"timezone": "utc",
209+
"title": "WVA Benchmark: Scale-Up Latency",
210+
"uid": "wva-benchmark-scaleup",
211+
"version": 1
212+
}

0 commit comments

Comments
 (0)