feat: add benchmark-openshift job for Phase 3 by kahilam · Pull Request #947 · llm-d/llm-d-workload-variant-autoscaler

kahilam · 2026-03-27T20:28:09Z

No description provided.

github-actions · 2026-03-27T20:28:18Z

Unsigned commits detected! Please sign your commits.

For instructions on how to set up GPG/SSH signing and verify your commits, please see GitHub Documentation.

github-actions · 2026-03-27T20:31:10Z

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource	Total	Allocated	Available
GPUs	50	25	25

Cluster	Value
Nodes	16 (7 with GPUs)
Total CPU	993 cores
Total Memory	10383 Gi
GPUs required	4 (min) / 6 (recommended)

github-actions · 2026-03-27T20:42:57Z

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource	Total	Allocated	Available
GPUs	50	25	25

Cluster	Value
Nodes	16 (7 with GPUs)
Total CPU	993 cores
Total Memory	10383 Gi
GPUs required	4 (min) / 6 (recommended)

kahilam · 2026-03-27T20:43:36Z

/benchmark openshift

Made-with: Cursor

github-actions · 2026-03-27T21:05:42Z

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource	Total	Allocated	Available
GPUs	50	32	18

Cluster	Value
Nodes	16 (7 with GPUs)
Total CPU	993 cores
Total Memory	10383 Gi
GPUs required	4 (min) / 6 (recommended)

Made-with: Cursor

github-actions · 2026-03-27T21:37:10Z

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource	Total	Allocated	Available
GPUs	50	28	22

Cluster	Value
Nodes	16 (7 with GPUs)
Total CPU	993 cores
Total Memory	10383 Gi
GPUs required	4 (min) / 6 (recommended)

…ken for openshift Made-with: Cursor

github-actions · 2026-03-27T21:50:56Z

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource	Total	Allocated	Available
GPUs	50	28	22

Cluster	Value
Nodes	16 (7 with GPUs)
Total CPU	993 cores
Total Memory	10383 Gi
GPUs required	4 (min) / 6 (recommended)

Made-with: Cursor

github-actions · 2026-03-27T22:18:12Z

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource	Total	Allocated	Available
GPUs	50	29	21

Cluster	Value
Nodes	16 (7 with GPUs)
Total CPU	993 cores
Total Memory	10383 Gi
GPUs required	4 (min) / 6 (recommended)

Made-with: Cursor

github-actions · 2026-03-27T23:11:16Z

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource	Total	Allocated	Available
GPUs	50	28	22

Cluster	Value
Nodes	16 (7 with GPUs)
Total CPU	993 cores
Total Memory	10383 Gi
GPUs required	4 (min) / 6 (recommended)

kahilam · 2026-03-27T23:48:12Z

/benchmark openshift

Made-with: Cursor

github-actions · 2026-03-30T17:05:40Z

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource	Total	Allocated	Available
GPUs	50	38	12

Cluster	Value
Nodes	16 (7 with GPUs)
Total CPU	993 cores
Total Memory	10383 Gi
GPUs required	4 (min) / 6 (recommended)

Made-with: Cursor

github-actions · 2026-03-30T17:19:24Z

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource	Total	Allocated	Available
GPUs	50	39	11

Cluster	Value
Nodes	16 (7 with GPUs)
Total CPU	993 cores
Total Memory	10383 Gi
GPUs required	4 (min) / 6 (recommended)

Made-with: Cursor

github-actions · 2026-03-30T17:29:29Z

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource	Total	Allocated	Available
GPUs	50	39	11

Cluster	Value
Nodes	16 (7 with GPUs)
Total CPU	993 cores
Total Memory	10383 Gi
GPUs required	4 (min) / 6 (recommended)

Made-with: Cursor

github-actions · 2026-03-30T17:39:55Z

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource	Total	Allocated	Available
GPUs	50	39	11

Cluster	Value
Nodes	16 (7 with GPUs)
Total CPU	993 cores
Total Memory	10383 Gi
GPUs required	4 (min) / 6 (recommended)

Made-with: Cursor

github-actions · 2026-03-30T18:07:22Z

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource	Total	Allocated	Available
GPUs	50	45	5

Cluster	Value
Nodes	16 (7 with GPUs)
Total CPU	993 cores
Total Memory	10383 Gi
GPUs required	4 (min) / 6 (recommended)

Made-with: Cursor

github-actions · 2026-03-30T18:30:25Z

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource	Total	Allocated	Available
GPUs	50	41	9

Cluster	Value
Nodes	16 (7 with GPUs)
Total CPU	993 cores
Total Memory	10383 Gi
GPUs required	4 (min) / 6 (recommended)

Made-with: Cursor

github-actions · 2026-04-02T16:28:45Z

Benchmark: scale-up-latency (OpenShift)

Metric	Value
Scale-up time	0.0s
Scale-down time	0.0s
Max replicas	1
Avg KV cache usage	0.000
Avg queue depth	0.0
Replica oscillation (σ)	0.00
Total duration	601s

Environment

Cluster: OpenShift (Real GPUs)
Model: unsloth/Meta-Llama-3.1-8B
Accelerator: H100
Commit: b78dda6
Scaler: prometheus-adapter
Workflow run

The deploy script already enables flow control via the ENABLE_EXPERIMENTAL_FLOW_CONTROL_LAYER env var on the EPP. Patching the EPP with --config-file on v0.5.0-rc.1 causes it to restart and break Gateway routing (HTTP 500). The scale-up latency test proves the Gateway works when the EPP is left untouched. - Skip ensureEPPConfig() call so the EPP is not modified - Restore direct vLLM fallback as safety net if Gateway still fails - Keep EPP config helpers in codebase for future use - All EPP queue metric sampling is retained Made-with: Cursor

The deploy script sets ENABLE_EXPERIMENTAL_FLOW_CONTROL_LAYER env var on the EPP. Using --config-file with featureGates: [flowControl] caused a conflict on v0.5.0-rc.1 that broke Gateway routing (HTTP 500). New approach: - Use --config-text to pass EndpointPickerConfig inline (no volume mount) - Remove the deprecated ENABLE_EXPERIMENTAL_FLOW_CONTROL_LAYER env var since config-text featureGates supersede it - Wait for Gateway health after EPP rollout (5min timeout) - Gateway is now a hard requirement (no fallback to direct vLLM) - Scorer weights: queue=2, kv-cache=2, prefix-cache=3 Made-with: Cursor

The Helm chart already deploys the EPP with --config-file pointing to its own ConfigMap (with scorer weights 2/2/3). Adding a second --config-file or --config-text flag broke the EPP and caused Gateway HTTP 500. New approach: - Find the EPP deployment's existing ConfigMap volume - Update the ConfigMap data to add featureGates: [flowControl] - Trigger a rollout restart via annotation (no arg/volume/env changes) - Wait for Gateway health after EPP restart - Gateway is a hard requirement — no fallback to direct vLLM Made-with: Cursor

github-actions · 2026-04-02T17:42:37Z

Benchmark: scale-up-latency (OpenShift)

Metric	Value
Scale-up time	0.0s
Scale-down time	0.0s
Max replicas	1
Avg KV cache usage	0.000
Avg queue depth	0.0
Replica oscillation (σ)	0.00
Total duration	601s

Environment

Cluster: OpenShift (Real GPUs)
Model: unsloth/Meta-Llama-3.1-8B
Accelerator: H100
Commit: 2c2082c
Scaler: prometheus-adapter
Workflow run

The CI workflow sets E2E_TESTS_ENABLED=true, which causes install.sh to already configure the EPP with: - Image v0.5.0-rc.1 - ENABLE_EXPERIMENTAL_FLOW_CONTROL_LAYER=true env var - ConfigMap with scorer weights queue=2, kv-cache=2, prefix-cache=3 Any modification to the EPP (adding featureGates to config, adding --config-text, etc.) conflicts with the existing env var and breaks Gateway routing (HTTP 500). Replace ensureEPPConfig with verifyEPPConfig that only inspects and logs the EPP state without modifying it. Gateway connectivity is validated with a 5-minute retry before benchmark starts. Made-with: Cursor

github-actions · 2026-04-02T18:38:24Z

Benchmark: scale-up-latency (OpenShift)

Metric	Value
Scale-up time	0.0s
Scale-down time	0.0s
Max replicas	1
Avg KV cache usage	0.000
Avg queue depth	0.0
Replica oscillation (σ)	0.00
Total duration	601s

Environment

Cluster: OpenShift (Real GPUs)
Model: unsloth/Meta-Llama-3.1-8B
Accelerator: H100
Commit: 7ff854c
Scaler: prometheus-adapter
Workflow run

The Gateway returns HTTP 500 with empty body even when EPP is correctly configured (flow control enabled, weights 2/2/3, pod Running/Ready). This is a pre-existing infrastructure issue — not caused by our EPP modifications. Add diagnostics to capture in a single CI run: - EPP pod logs (last 50 lines) after failure - Gateway/Istio pod logs (last 30 lines) - All service ports (not just first port) - InferencePool and InferenceModel resources (via unstructured client) - Verbose curl output with response body and debug headers Made-with: Cursor

github-actions · 2026-04-02T19:35:31Z

Benchmark: scale-up-latency (OpenShift)

⚠️ Benchmark results file not found or could not be parsed.

Environment

Cluster: OpenShift (Real GPUs)
Model: unsloth/Meta-Llama-3.1-8B
Accelerator: H100
Commit: 94dae8f
Scaler: prometheus-adapter
Workflow run

github-actions · 2026-04-02T20:13:38Z

Benchmark: scale-up-latency (OpenShift)

Metric	Value
Scale-up time	0.0s
Scale-down time	0.0s
Max replicas	1
Avg KV cache usage	0.000
Avg queue depth	0.0
Replica oscillation (σ)	0.00
Total duration	601s

Environment

Cluster: OpenShift (Real GPUs)
Model: unsloth/Meta-Llama-3.1-8B
Accelerator: H100
Commit: 94dae8f
Scaler: prometheus-adapter
Workflow run

The cluster's Istio 1.29 only watches inference.networking.k8s.io/v1 InferencePool resources, but the v0.3.0 llm-d guide creates them with inference.networking.x-k8s.io/v1alpha2. This caused Istio to ignore the InferencePool entirely, resulting in cluster_not_found errors from Envoy. install.sh now auto-detects when the cluster supports the v1 CRD and patches the gaie values before helmfile deploy. The HTTPRoute backendRef group is also updated to match. Made-with: Cursor

github-actions · 2026-04-02T22:07:17Z

Benchmark: scale-up-latency (OpenShift)

Metric	Value
Scale-up time	45.1s
Scale-down time	N/A
Max replicas	2
Avg KV cache usage	0.000
Avg queue depth	0.0
Replica oscillation (σ)	0.00
Total duration	908s

Benchmark: prefill-heavy-workload (OpenShift)

Metric	HPA (Baseline)	WVA	Δ
Max Replicas	1	6	+500.0% ↑
Avg Replicas	1.00	3.28	+228.0% ↑
Avg vLLM Queue Depth	172.8	26.8	-84.5% ↓
Avg EPP Queue Depth	270.8	72.2	-73.4% ↓
Avg KV Cache	0.040	0.039	-2.7% ↓
TTFT mean	73.5ms	20.2ms	-72.6% ↓
TTFT p50	70.7s	11.0s	—
TTFT p99	118.2s	72.0s	—
ITL mean	9.90ms	11.31ms	+14.2% ↑
Throughput mean	490.3tok/s	938.4tok/s	+91.4% ↑
Throughput p50	388.6tok/s	653.6tok/s	—
Completed Requests	296	563	+90.2% ↑
Duration	670s	720s	—

HPA Replica Timeline (44 snapshots)

Time (s)	Spec	Ready
15	1	1
30	1	1
45	1	1
60	1	1
75	1	1
90	1	1
105	1	1
120	1	1
135	1	1
150	1	1
165	1	1
180	1	1
195	1	1
210	1	1
225	1	1
240	1	1
255	1	1
270	1	1
285	1	1
300	1	1
315	1	1
330	1	1
345	1	1
360	1	1
375	1	1
390	1	1
405	1	1
420	1	1
435	1	1
450	1	1
465	1	1
480	1	1
495	1	1
510	1	1
525	1	1
540	1	1
555	1	1
570	1	1
585	1	1
600	1	1
615	1	1
630	1	1
645	1	1
660	1	1

WVA Replica Timeline (48 snapshots)

Time (s)	Spec	Ready
15	1	1
30	1	1
45	1	1
60	1	1
75	1	1
90	2	1
105	2	2
120	2	2
135	2	2
150	2	2
165	2	2
180	2	2
195	2	2
210	3	2
225	3	3
240	3	3
255	3	3
270	3	3
285	3	3
300	3	3
315	3	3
330	3	3
345	3	3
360	4	3
375	4	4
390	4	4
405	4	4
420	4	4
435	4	4
450	4	4
465	4	4
480	4	4
495	4	4
510	5	4
525	5	5
540	5	5
555	5	5
570	5	5
585	5	5
600	5	5
615	5	5
630	5	5
645	5	5
660	6	5
675	6	5
690	6	5
705	6	6
720	6	6

Dashboard Panels (4)

prefill comparison

prefill metrics timeline

prefill percentiles

prefill replica timeline

📎 Download artifacts

Environment

Cluster: OpenShift (Real GPUs)
Model: unsloth/Meta-Llama-3.1-8B
Accelerator: H100
Commit: cbca491
Scaler: prometheus-adapter
Workflow run

- Add model_id workflow_dispatch input so benchmarks can be triggered with any HuggingFace model (default: unsloth/Meta-Llama-3.1-8B) - Generate per-autoscaler PDF reports (3-page) matching the colleague's benchmark report format: config summary, time-series charts, and percentile distributions - Show model name dynamically in run-name and PR comment Made-with: Cursor

github-actions · 2026-04-03T00:20:11Z

Benchmark: scale-up-latency (OpenShift)

Metric	Value
Scale-up time	75.1s
Scale-down time	N/A
Max replicas	2
Avg KV cache usage	0.000
Avg queue depth	0.0
Replica oscillation (σ)	0.00
Total duration	907s

Benchmark: prefill-heavy-workload (OpenShift)

Metric	HPA (Baseline)	WVA	Δ
Max Replicas	1	6	+500.0% ↑
Avg Replicas	1.00	3.30	+230.4% ↑
Avg vLLM Queue Depth	148.0	30.7	-79.3% ↓
Avg EPP Queue Depth	298.7	109.6	-63.3% ↓
Avg KV Cache	0.030	0.027	-11.8% ↓
TTFT mean	75.0ms	24.2ms	-67.7% ↓
TTFT p50	74.0s	20.4s	—
TTFT p99	118.4s	72.2s	—
ITL mean	9.28ms	10.16ms	+9.4% ↑
Throughput mean	531.4tok/s	1056.8tok/s	+98.9% ↑
Throughput p50	429.1tok/s	658.2tok/s	—
Completed Requests	321	634	+97.5% ↑
Duration	680s	675s	—

HPA Replica Timeline (45 snapshots)

Time (s)	Spec	Ready
15	1	1
30	1	1
45	1	1
60	1	1
75	1	1
90	1	1
105	1	1
120	1	1
135	1	1
150	1	1
165	1	1
180	1	1
195	1	1
210	1	1
225	1	1
240	1	1
255	1	1
270	1	1
285	1	1
300	1	1
315	1	1
330	1	1
345	1	1
360	1	1
375	1	1
390	1	1
405	1	1
420	1	1
435	1	1
450	1	1
465	1	1
480	1	1
495	1	1
510	1	1
525	1	1
540	1	1
555	1	1
570	1	1
585	1	1
600	1	1
615	1	1
630	1	1
645	1	1
660	1	1
675	1	1

WVA Replica Timeline (45 snapshots)

Time (s)	Spec	Ready
15	1	1
30	1	1
45	1	1
60	1	1
75	1	1
90	2	2
105	2	2
120	2	2
135	2	2
150	2	2
165	2	2
180	2	2
195	2	2
210	3	3
225	3	3
240	3	3
255	3	3
270	3	3
285	3	3
300	3	3
315	3	3
330	3	3
345	3	3
360	4	4
375	4	4
390	4	4
405	4	4
420	4	4
435	4	4
450	4	4
465	4	4
480	5	5
495	5	5
510	5	5
525	5	5
540	5	5
555	5	5
570	5	5
585	5	5
600	6	6
615	6	6
630	6	6
645	6	6
660	6	6
675	6	6

Dashboard Panels (4)

prefill comparison

prefill metrics timeline

prefill percentiles

prefill replica timeline

📎 Download artifacts

Environment

Cluster: OpenShift (Real GPUs)
Model: Qwen/Qwen3-0.6B
Accelerator: H100
Commit: 8933a03
Scaler: prometheus-adapter
Workflow run

- HPA test now creates VA(min=1, max=2, cost=10) + HPA(min=1, max=10) to match colleague's setup instead of pure CPU-based HPA - WVA test cost changed from 30.0 to 10.0 for consistency - Added model_id, va_config, hpa_config, achieved_rps, error_count, incomplete_count fields to result JSON - Enhanced PDF reports with detailed autoscaler configuration section, error/incomplete request tracking, and achieved RPS - PR comment table now includes failed/incomplete requests and RPS rows Made-with: Cursor

The external metrics API (prometheus-adapter) can be transiently unavailable, causing all benchmarks to fail. Change the check from a hard failure to a best-effort warning with diagnostics, so the benchmark runs and collects data even when HPA cannot scale. Made-with: Cursor

KEDA on the OpenShift cluster continuously reclaims the external.metrics.k8s.io APIService, preventing prometheus-adapter from serving wva_desired_replicas. The existing guard only ran during the deploy step and was dead by the time tests started. Add a background guard loop that re-patches the APIService every 8 seconds during the actual benchmark run so HPA can scale. Made-with: Cursor

The cluster already has a working prometheus-adapter setup in workload-variant-autoscaler-monitoring with wva_desired_replicas rules configured. Using SCALER_BACKEND=prometheus-adapter was deploying a second adapter and re-patching the APIService, which then got reclaimed by KEDA, breaking all external metrics. Switch to SCALER_BACKEND=none to preserve the existing working external metrics API setup. Made-with: Cursor

The throughput/ttft/itl fields use omitempty in Go — when GuideLLM metric extraction fails, these keys are absent from the JSON results. Add a safe accessor helper and use .get() throughout the plotting code to handle missing fields gracefully. Made-with: Cursor

The APIService guard patch was failing with: "spec.insecureSkipTLSVerify: Invalid value: true: may not be true if caBundle is present" KEDA sets a caBundle when it reclaims the APIService, which is mutually exclusive with insecureSkipTLSVerify=true. Adding "caBundle": null to the merge patch clears it before setting insecureSkipTLSVerify, matching the state that worked on April 2. Also switches SCALER_BACKEND back to prometheus-adapter and re-adds the APIService guard to the CI run step. Made-with: Cursor

- TTFT mean in PR comment showed "ms" but value was already divided by 1000 (should be "s") - Achieved RPS was always 0.00 because GuideLLM may not expose rate.completed_rate; add fallback: completed_requests / duration Made-with: Cursor

github-actions · 2026-04-04T03:48:20Z

Benchmark: scale-up-latency (OpenShift)

Metric	Value
Scale-up time	0.0s
Scale-down time	0.0s
Max replicas	1
Avg KV cache usage	0.000
Avg queue depth	0.0
Replica oscillation (σ)	0.00
Total duration	600s

Benchmark: prefill-heavy-workload (OpenShift)

Metric	HPA (Baseline)	WVA	Δ
Max Replicas	2	7	+250.0% ↑
Avg Replicas	1.78	3.87	+117.1% ↑
Avg vLLM Queue Depth	123.4	22.3	-81.9% ↓
Avg EPP Queue Depth	128.5	81.3	-36.8% ↓
Avg KV Cache	0.029	0.021	-25.8% ↓
TTFT mean	54.1s	22.4s	-58.7% ↓
TTFT p50	57.0s	16.7s	—
TTFT p99	101.7s	65.2s	—
ITL mean	9.79ms	10.66ms	+8.9% ↑
Throughput mean	861.2tok/s	1155.9tok/s	+34.2% ↑
Throughput p50	599.4tok/s	735.8tok/s	—
Completed Requests	517	691	+33.7% ↑
Failed Requests	4581	8910	—
Incomplete Requests	511	459	—
Achieved RPS	0.78	1.02	—
Duration	665s	675s	—

HPA Replica Timeline (44 snapshots)

Time (s)	Spec	Ready
15	1	1
30	1	1
45	1	1
60	1	1
75	1	1
90	1	1
105	2	1
120	2	2
135	2	2
150	2	2
165	2	2
180	2	2
195	2	2
210	2	2
225	2	2
240	2	2
255	2	2
270	2	2
285	2	2
300	2	2
315	2	2
330	2	2
345	2	2
360	2	2
375	2	2
390	2	2
405	2	2
420	2	2
435	2	2
450	2	2
465	2	2
480	2	2
495	2	2
510	2	2
525	2	2
540	2	2
555	2	2
570	2	2
585	2	2
600	2	2
615	2	2
630	2	2
645	2	2
660	2	2

WVA Replica Timeline (45 snapshots)

Time (s)	Spec	Ready
15	2	1
30	2	2
45	2	2
60	2	2
75	2	2
90	2	2
105	2	2
120	2	2
135	2	2
150	2	2
165	3	2
180	3	3
195	3	3
210	3	3
225	3	3
240	3	3
255	3	3
270	3	3
285	4	3
300	4	4
315	4	4
330	4	4
345	4	4
360	4	4
375	4	4
390	4	4
405	5	4
420	5	5
435	5	5
450	5	5
465	5	5
480	5	5
495	5	5
510	5	5
525	6	5
540	6	6
555	6	6
570	6	6
585	6	6
600	6	6
615	6	6
630	6	6
645	7	6
660	7	7
675	7	7

Dashboard Panels (4)

prefill comparison

prefill metrics timeline

prefill percentiles

prefill replica timeline

📎 Download artifacts

Environment

Cluster: OpenShift (Real GPUs)
Model: Qwen/Qwen3-0.6B
Accelerator: H100
Commit: e64fe7a
Scaler: prometheus-adapter
Workflow run

…aults The benchmark was deploying vLLM with --max-num-seqs=5 (only 5 concurrent requests per pod), causing 2-3% KV cache utilization and ~1 RPS instead of the expected 60-100% KV cache and ~9 RPS. Removing this allows vLLM to use its default (256), matching the colleague's benchmark configuration. Also aligns WVA saturation thresholds (kvSpareTrigger, queueSpareTrigger) to chart defaults (0.1, 3) to match the colleague's setup. Made-with: Cursor

github-actions · 2026-04-06T03:18:09Z

Benchmark: scale-up-latency (OpenShift)

Metric	Value
Scale-up time	0.0s
Scale-down time	0.0s
Max replicas	1
Avg KV cache usage	0.000
Avg queue depth	0.0
Replica oscillation (σ)	0.00
Total duration	604s

Benchmark: prefill-heavy-workload (OpenShift)

Metric	HPA (Baseline)	WVA	Δ
Max Replicas	2	3	+50.0% ↑
Avg Replicas	1.79	2.57	+43.2% ↑
Avg vLLM Queue Depth	62.4	31.1	-50.2% ↓
Avg EPP Queue Depth	98.4	35.9	-63.5% ↓
Avg KV Cache	0.729	0.608	-16.6% ↓
TTFT mean	23.7s	13.4s	-43.5% ↓
TTFT p50	27.6s	10.6s	—
TTFT p99	61.4s	47.1s	—
ITL mean	32.28ms	34.12ms	+5.7% ↑
Throughput mean	7163.8tok/s	6367.9tok/s	-11.1% ↓
Throughput p50	5714.3tok/s	4810.0tok/s	—
Completed Requests	4299	3795	-11.7% ↓
Failed Requests	1729	5093	—
Incomplete Requests	511	3	—
Achieved RPS	6.23	5.54	—
Duration	690s	685s	—

HPA Replica Timeline (46 snapshots)

Time (s)	Spec	Ready
15	1	1
30	1	1
45	1	1
60	1	1
75	1	1
90	1	1
105	2	1
120	2	2
135	2	2
150	2	2
165	2	2
180	2	2
195	2	2
210	2	2
225	2	2
240	2	2
255	2	2
270	2	2
285	2	2
300	2	2
315	2	2
330	2	2
345	2	2
360	2	2
375	2	2
390	2	2
405	2	2
420	2	2
435	2	2
450	2	2
465	2	2
480	2	2
495	2	2
510	2	2
525	2	2
540	2	2
555	2	2
570	2	2
585	2	2
600	2	2
615	2	2
630	2	2
645	2	2
660	2	2
675	2	2
690	2	2

WVA Replica Timeline (45 snapshots)

Time (s)	Spec	Ready
15	2	2
30	2	2
45	2	2
60	2	2
75	2	2
90	2	2
105	2	2
120	2	2
135	2	2
150	2	2
165	3	3
180	3	3
195	3	3
210	3	3
225	3	3
240	3	3
255	3	3
270	3	3
285	3	3
300	3	3
315	3	3
330	3	3
345	3	3
360	3	3
375	3	3
390	3	3
405	3	3
420	3	3
435	3	3
450	3	3
465	3	3
480	3	3
495	3	3
510	2	2
525	2	2
540	2	2
555	2	2
570	2	2
585	3	3
600	3	3
615	3	3
630	3	3
645	3	3
660	3	3
675	3	3

Dashboard Panels (4)

prefill comparison

prefill metrics timeline

prefill percentiles

prefill replica timeline

📎 Download artifacts

Environment

Cluster: OpenShift (Real GPUs)
Model: Qwen/Qwen3-0.6B
Accelerator: H100
Commit: 000a7d1
Scaler: prometheus-adapter
Workflow run

V1 analyzer scales by +1 replica per 30s cycle and blocks during pod transitions, limiting scaling to ~3 replicas in a 600s test. V2 uses demand-based calculation (ceil(requiredCapacity / perReplicaCapacity)) and can jump to the needed replica count in one decision, matching the colleague's benchmark behavior. Made-with: Cursor

kahilam force-pushed the feat/benchmark-phase3-openshift branch from f45fe9b to b806063 Compare March 27, 2026 20:36

feat: add benchmark-openshift job for Phase 3

580a917

Made-with: Cursor

kahilam force-pushed the feat/benchmark-phase3-openshift branch from b806063 to 580a917 Compare March 27, 2026 21:01

fix(benchmark): use correct prometheus service name for openshift

ae4808d

Made-with: Cursor

fix(benchmark): properly configure thanos-querier port-forward and to…

428b725

…ken for openshift Made-with: Cursor

fix(benchmark): only post pr comment if triggered by issue_comment

d7f601a

Made-with: Cursor

test: enable real GPUs for OpenShift benchmark and increase timeouts

f34e1d9

Made-with: Cursor

test: add prefill heavy benchmark scenario comparing HPA and WVA

2d8ec5f

Made-with: Cursor

fix: move behavior field from VA to HPA builder for WVA benchmark

20cfdb8

Made-with: Cursor

fix: resolve compilation errors in prefill heavy benchmark test

5cc2c9e

Made-with: Cursor

fix: update EnsureHPA calls in e2e tests to pass nil behavior

ae56803

Made-with: Cursor

fix: update EPP config builder to use ConfigMap instead of CRD

751c5ac

Made-with: Cursor

fix: increase backoff limit and dump guidellm logs on failure

e94e2e9

Made-with: Cursor

fix: pass HF_TOKEN to GuideLLM jobs to allow tokenizer download

e3a6af0

Made-with: Cursor

Conversation

kahilam commented Mar 27, 2026

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

github-actions bot commented Mar 27, 2026

GPU Pre-flight Check ✅

Uh oh!

github-actions bot commented Mar 27, 2026

GPU Pre-flight Check ✅

Uh oh!

kahilam commented Mar 27, 2026

Uh oh!

github-actions bot commented Mar 27, 2026

GPU Pre-flight Check ✅

Uh oh!

github-actions bot commented Mar 27, 2026

GPU Pre-flight Check ✅

Uh oh!

github-actions bot commented Mar 27, 2026

GPU Pre-flight Check ✅

Uh oh!

github-actions bot commented Mar 27, 2026

GPU Pre-flight Check ✅

Uh oh!

github-actions bot commented Mar 27, 2026

GPU Pre-flight Check ✅

Uh oh!

kahilam commented Mar 27, 2026

Uh oh!

github-actions bot commented Mar 30, 2026

GPU Pre-flight Check ✅

Uh oh!

github-actions bot commented Mar 30, 2026

GPU Pre-flight Check ✅

Uh oh!

github-actions bot commented Mar 30, 2026

GPU Pre-flight Check ✅

Uh oh!

github-actions bot commented Mar 30, 2026

GPU Pre-flight Check ✅

Uh oh!

github-actions bot commented Mar 30, 2026

GPU Pre-flight Check ✅

Uh oh!

github-actions bot commented Mar 30, 2026

GPU Pre-flight Check ✅

Uh oh!

github-actions bot commented Apr 2, 2026

Benchmark: scale-up-latency (OpenShift)

Uh oh!

github-actions bot commented Apr 2, 2026

Benchmark: scale-up-latency (OpenShift)

Uh oh!

github-actions bot commented Apr 2, 2026

Benchmark: scale-up-latency (OpenShift)

Uh oh!

github-actions bot commented Apr 2, 2026

Benchmark: scale-up-latency (OpenShift)

Uh oh!

github-actions bot commented Apr 2, 2026

Benchmark: scale-up-latency (OpenShift)

Uh oh!

github-actions bot commented Apr 2, 2026

Benchmark: scale-up-latency (OpenShift)

Benchmark: prefill-heavy-workload (OpenShift)

prefill comparison

prefill metrics timeline

prefill percentiles

prefill replica timeline

Uh oh!

github-actions bot commented Apr 3, 2026

Benchmark: scale-up-latency (OpenShift)

Benchmark: prefill-heavy-workload (OpenShift)

prefill comparison

prefill metrics timeline

prefill percentiles

prefill replica timeline

Uh oh!

github-actions bot commented Apr 4, 2026