Skip to content

[sync] upstream llm-d main branch 0ee5ba1 [2026-04-06]#52

Merged
anishasthana merged 14 commits intoopendatahub-io:mainfrom
zdtsw-forking:sync/upstream-89ef12b
Apr 6, 2026
Merged

[sync] upstream llm-d main branch 0ee5ba1 [2026-04-06]#52
anishasthana merged 14 commits intoopendatahub-io:mainfrom
zdtsw-forking:sync/upstream-89ef12b

Conversation

@zdtsw
Copy link
Copy Markdown
Member

@zdtsw zdtsw commented Apr 2, 2026

Syncs llm-d/llm-d-workload-variant-autoscaler main branch into ODH main branch.

Upstream commit: llm-d@0ee5ba1

this PR maps to https://github.com/llm-d/llm-d-workload-variant-autoscaler/releases/tag/v0.6.0

Summary by CodeRabbit

  • New Features

    • LeaderWorkerSet (LWS) support: chart, manifests, docs, and runtime detection; example sample added.
    • Benchmarking suite and tools: benchmark scenarios, runners, metric dump & report generators, Grafana dashboard support.
  • Tests

    • E2E re-focused on deterministic correctness; load/perf moved to benchmark tooling.
    • Added scale-from-zero and saturation analyzer e2e suites; granular timing/polling config added.
  • Documentation

    • Updated deploy/test guides, new LWS and queueing-model docs, troubleshooting and samples.
  • Chores

    • Install/deploy scripts modularized; DEPLOY_VA/DEPLOY_HPA default to false.
    • CI Trivy action integrated; RBAC and CRD docs updated (accelerator field deprecated).

zdtsw and others added 10 commits March 30, 2026 13:13
* fix: imipacted old trivy version

Signed-off-by: Wen Zhou <wenzhou@redhat.com>

* Update .github/actions/trivy-scan/action.yml

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Wen Zhou <wenzhou@redhat.com>

---------

Signed-off-by: Wen Zhou <wenzhou@redhat.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…lm-d#937)

* cleanup e2es and remove load tests

Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com>

* address review

Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com>

* add comment for clarity for prom adapter restarts

Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com>

* add saturation v1 focused tests

Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com>

* magic numbers to named constants

Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com>

* rm hard coded deployment name

Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com>

* add comment regarding workload builderfor benchmarking

Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com>

---------

Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com>
* Add shuynh2017 to the OWNERS list

* deprecate desiredOptimizedAlloc.accelerator

* make optional

* remove MinLength
Signed-off-by: Braulio Dumba <Braulio.Dumba@ibm.com>
Signed-off-by: Braulio Dumba <Braulio.Dumba@ibm.com>
* fix: call ApplyDefaults before Validate in saturation config parsing

parseSaturationConfig() called Validate() without first calling
ApplyDefaults(), causing V2 configs with analyzerName: saturation
to fail validation because scaleUpThreshold/scaleDownBoundary
default to zero (omitempty) and Validate() rejects zero values.

This caused the engine to skip all models with "Saturation scaling
config not loaded yet for namespace", resulting in no scaling decisions.

* fix: V2 analyzer fallback when vllm:cache_config_info is absent

When the model server does not emit the vllm:cache_config_info metric
(e.g., llm-d-inference-sim), TotalKvCapacityTokens is 0 and the V2
analyzer skipped the replica entirely, resulting in totalDemand=0 and
no scale-up decisions.

Add computeReplicaCapacityFallback that uses the deployment-derived
capacity from the capacity store and estimates demand from KvCacheUsage
percentage. This allows V2 to produce scaling decisions with any
vLLM-compatible server, not just those emitting cache_config_info.

* test: add Ginkgo unit tests for V2 analyzer fallback and config defaults

- computeReplicaCapacityFallback: 7 tests covering nil store, zero capacity,
  KvCacheUsage-based demand, saturation detection, queue demand, field population
- Analyze with fallback: 2 integration tests for full analyzer path without cache_config_info
- ApplyDefaults before Validate: 4 tests verifying the config parsing fix
  (omitted thresholds, validation ordering, explicit value preservation, default priority)

* address review: apply KvCacheThreshold in fallback, move tests to analyzer_test.go

- Apply KvCacheThreshold to stored capacity in fallback path, matching
  the main code path where k1 = totalTokens * threshold. This ensures
  saturation is detected at the same utilization level regardless of
  which path is taken.
- Move fallback and config default tests into analyzer_test.go alongside
  existing tests (per review feedback).
- Add threshold consistency test verifying 90% KvCacheUsage with 0.8
  threshold triggers appropriate demand levels in the fallback path.

* cleanup: remove unused gpuCount param, rename config shadow, tighten test assertions

- Remove unused gpuCount parameter from computeReplicaCapacityFallback
- Rename config parameter to cfg to avoid shadowing the config package
- Rename test describe from 'parseSaturationConfig...' to
  'SaturationScalingConfig...' (tests config types, not controller func)
- Replace loose '> 0' assertion with computed expected value in
  integration test
* Add shuynh2017 to the OWNERS list

* lws support2

* address comments

* fix merge

* update doc

* fix test error

* install lws on openshift

* fix e2e test

* fix bug: missing a change for scaletarget

* fix e2e test

* fix lws test to use same pool as deployment

* address comments

* remove flaky from test

* double go test timeout

* skip multi-var saturation for lws just like for deployment

* bump timeout to 60m

* mark Scale-From-Zero tests as flaky - never worked locally with main branch code

* add missing rbac

* fix merge issues

* fix import

* add scale-from-zero LWS

* add 1 leader, 0 worker tests, clean env before run
* script to extract benchmark data

* re-org benchmark extract and add benchmark spawn

* add customizations

* dump metrics from prom

* changes to script

* clean up readme and reorg

* address review

* fix base path issue

* address typo
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 2, 2026

📝 Walkthrough

Walkthrough

This PR refactors deployment scripts into shared library modules, adds kube-like adapters, and introduces a scale-target abstraction (ScaleTargetAccessor) with LeaderWorkerSet support across controllers, collectors, engines, and tests. It replaces many deployment/test sleeps with reusable wait helpers, centralizes install/cleanup logic, and extends CI/benchmark tooling. Several end-to-end saturation/scale-to-zero/load tests were removed, replaced, or reworked; new e2e suites and fixtures for saturation-path, scale-from-zero, LWS, and inference-objectives were added. Helm charts, CRDs, RBAC, and go module deps were updated to support LWS and optional behaviors.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Security & Architectural Issues

  • CWE-347 (Improper Verification of Cryptographic Signature): SKIP_HELM_REPO_UPDATE allows skipping Helm repo refresh. Action: require explicit opt-in (no default); display an unmistakable security warning at startup; if skipping, perform additional verification (helm chart provenance/signature or fixed allowlist).

  • CWE-59 / CWE-426 (Symbolic Link / Untrusted Search Path): hack/burst_load_generator.sh is a file pointing to another script without validation. Action: replace pointer with a safe symlink creation step in CI or validate resolved path and file owner/permissions at runtime; avoid executing indirect, writable locations.

  • CWE-200 (Exposure of Sensitive Information): Multiple fixtures/scripts use plain HTTP endpoints (e.g., saturation_threshold_trigger.sh, burst generators). Action: validate and enforce HTTPS schemas for TARGET_URL/TARGET_SERVICE or make TLS optional but explicitly gated; fail fast when insecure transport is detected in sensitive configurations.

  • CWE-20 (Missing Input Validation): saturation_threshold_trigger.sh requires many env vars but lacks robust validation. Action: add strict preflight checks (set -u; validate non-empty, numeric ranges, port format), and fail with clear errors before any network activity.

  • CWE-362 / Race Condition: install_core.sh / deploy/install.sh conditional readiness logic (INFRA_ONLY/E2E flow) and background apiservice-guard that patches external.metrics API may introduce race windows where controllers assume metrics presence. Action: replace name-based heuristics with explicit readiness probes (API discovery + readiness conditions), and ensure any APIService patching is performed with leader-election/coordination and idempotent recovery; log and surface failures to calling process.

  • Privilege/Integrity risk in APIService patching (patching external.metrics.k8s.io): the APISERVICE guard alters cluster APIService target in a background loop. Action: restrict this behavior to operators with documented privileges; add RBAC checks and a safe mode; record auditable events when patching APIService; avoid continuous auto-patching without exponential backoff and clear operator consent.

  • Image provenance inconsistency: tests and helpers pin some images (curl pinned) while others remain floating. Action: centralize image pins and require signed images or reproducible digests for CI runs; document acceptable registries and enforce via CI policy.

  • Secret handling and HF token: infra_llmd.sh creates/updates llm-d-hf-token. Action: ensure secrets are created with restricted RBAC, avoid echoing tokens in logs, and follow least-privilege for Helm/terraform steps; consider sealed secrets or external secret stores for CI-sensitive tokens.

Address each item with small, concrete remediation steps and follow-up instrumented logging and CI checks.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)
test/e2e/pod_scraping_test.go (1)

59-82: ⚠️ Potential issue | 🟠 Major

Don't let EPP discovery exit the retry loop before a service is found.

Line 81 only rechecks err, which is already nil after a successful List. If no -epp service exists yet, this closure makes no failing assertion, Eventually stops retrying, and the next Expect(eppServiceName).NotTo(BeEmpty()) fails immediately.

Suggested fix
 		Eventually(func(g Gomega) {
+			found := false
+
 			// Try expected EPP service first
 			_, err := k8sClient.CoreV1().Services(cfg.LLMDNamespace).Get(ctx,
 				expectedEPPName, metav1.GetOptions{})
 			if err == nil {
 				eppServiceName = expectedEPPName
-				return
-			}
-
-			// If expected service doesn't exist, discover existing EPP services
-			serviceList, err := k8sClient.CoreV1().Services(cfg.LLMDNamespace).List(ctx, metav1.ListOptions{})
-			g.Expect(err).NotTo(HaveOccurred(), "Should be able to list services")
-
-			// Find first EPP service (service name ends with "-epp")
-			for _, svc := range serviceList.Items {
-				if len(svc.Name) > 4 && svc.Name[len(svc.Name)-4:] == "-epp" {
-					eppServiceName = svc.Name
-					GinkgoWriter.Printf("Discovered EPP service: %s\n", eppServiceName)
-					return
+				found = true
+			} else {
+				// If expected service doesn't exist, discover existing EPP services
+				serviceList, err := k8sClient.CoreV1().Services(cfg.LLMDNamespace).List(ctx, metav1.ListOptions{})
+				g.Expect(err).NotTo(HaveOccurred(), "Should be able to list services")
+
+				// Find first EPP service (service name ends with "-epp")
+				for _, svc := range serviceList.Items {
+					if len(svc.Name) > 4 && svc.Name[len(svc.Name)-4:] == "-epp" {
+						eppServiceName = svc.Name
+						found = true
+						GinkgoWriter.Printf("Discovered EPP service: %s\n", eppServiceName)
+						break
+					}
 				}
 			}
 
-			g.Expect(err).NotTo(HaveOccurred(), "EPP service should exist")
+			g.Expect(found).To(BeTrue(), "EPP service should exist")
 		}).Should(Succeed())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/pod_scraping_test.go` around lines 59 - 82, The closure passed to
Eventually can return without failing when no "-epp" service is found because
the final check only reasserts err (which is nil after List); modify the end of
the closure to assert that eppServiceName is non-empty so the closure fails and
Eventually keeps retrying until a service appears. Concretely, after the
for-loop in the Eventually closure replace the final
g.Expect(err).NotTo(HaveOccurred(), "EPP service should exist") with
g.Expect(eppServiceName).NotTo(BeEmpty(), "EPP service should exist") (or an
explicit g.Fail/g.Expect(false).To(BeTrue())), referencing the existing
eppServiceName, the List call (k8sClient.CoreV1().Services(...).List) and the
closure used by Eventually to ensure retries continue until a service is
discovered.
test/e2e/smoke_test.go (1)

637-651: ⚠️ Potential issue | 🟡 Minor

Reject MetricsAvailable=Unknown in this poll.

Line 637 has no default branch. If the controller leaves MetricsAvailable at Unknown, this Eventually succeeds without checking the state this spec claims to validate.

Suggested fix
 				switch condition.Status {
 				case metav1.ConditionFalse:
 					// If metrics are unavailable, reason should indicate why
 					g.Expect(condition.Reason).To(BeElementOf(
 						variantautoscalingv1alpha1.ReasonMetricsMissing,
 						variantautoscalingv1alpha1.ReasonMetricsStale,
 						variantautoscalingv1alpha1.ReasonPrometheusError,
 						variantautoscalingv1alpha1.ReasonMetricsUnavailable,
 					), "When metrics are unavailable, reason should indicate the cause")
 				case metav1.ConditionTrue:
 					// If metrics are available, reason should be MetricsFound
 					g.Expect(condition.Reason).To(Equal(variantautoscalingv1alpha1.ReasonMetricsFound),
 						"When metrics are available, reason should be MetricsFound")
+				default:
+					g.Expect(condition.Status).To(BeElementOf(
+						metav1.ConditionTrue,
+						metav1.ConditionFalse,
+					), "MetricsAvailable should resolve to True or False")
 				}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/smoke_test.go` around lines 637 - 651, The switch on
condition.Status in the MetricsAvailable check must explicitly reject
metav1.ConditionUnknown so the Eventually cannot succeed with an Unknown state;
update the switch in test/e2e/smoke_test.go (the block that inspects
MetricsAvailable) to add a default or explicit metav1.ConditionUnknown branch
that fails the test (e.g., use the G.Expect assertion to assert condition.Status
is not metav1.ConditionUnknown or call
G.Expect(...).To(Not(Equal(metav1.ConditionUnknown))) with a message), keeping
the existing cases for metav1.ConditionFalse and metav1.ConditionTrue and the
referenced reasons like variantautoscalingv1alpha1.ReasonMetricsFound.
test/e2e/scale_from_zero_test.go (3)

393-434: ⚠️ Potential issue | 🟡 Minor

Potential shell injection via modelID in embedded script.

modelID is interpolated directly into a shell script via fmt.Sprintf. If modelID contains shell metacharacters (e.g., "; rm -rf /;), it could execute arbitrary commands inside the job container.

While modelID is typically controlled by config (cfg.ModelID), defense-in-depth suggests sanitization or using environment variables instead of string interpolation.

🛡️ Safer approach using environment variables
 func createScaleFromZeroTriggerJob(name, namespace, gatewayService, modelID string) *batchv1.Job {
 	backoffLimit := int32(3)
 	numRequests := 10

-	script := fmt.Sprintf(`#!/bin/sh
+	script := fmt.Sprintf(`#!/bin/sh
 echo "Scale-from-zero trigger job starting..."
 echo "Sending %d requests to gateway %s:80"
-echo "Model ID: %s"
+echo "Model ID: $MODEL_ID"
 ...
-    -d '{"model":"%s","prompt":"Test prompt for scale-from-zero","max_tokens":50}' 2>&1)
+    -d "{\"model\":\"$MODEL_ID\",\"prompt\":\"Test prompt for scale-from-zero\",\"max_tokens\":50}" 2>&1)
 ...
-`, numRequests, gatewayService, modelID, numRequests, numRequests, gatewayService, modelID)
+`, numRequests, gatewayService, numRequests, numRequests, gatewayService)

 	return &batchv1.Job{
 		...
 		Spec: batchv1.JobSpec{
 			...
 			Template: corev1.PodTemplateSpec{
 				...
 				Spec: corev1.PodSpec{
 					...
 					Containers: []corev1.Container{
 						{
 							...
+							Env: []corev1.EnvVar{
+								{
+									Name:  "MODEL_ID",
+									Value: modelID,
+								},
+							},
 						},
 					},
 				},
 			},
 		},
 	}
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/scale_from_zero_test.go` around lines 393 - 434, The embedded shell
script is vulnerable because fmt.Sprintf directly interpolates modelID (and
other vars) into the script template (see the script variable and fmt.Sprintf
call), allowing shell metacharacters in modelID to execute commands; fix by
removing direct interpolation and instead pass modelID via a safe environment
variable or other typed parameter injection (set an env var like MODEL_ID in the
Job/Pod spec and reference it inside the script), or at minimum escape/sanitize
modelID before insertion; update the creation site where script is built (the
script variable / fmt.Sprintf usage) to use environment variables or a proper
escaping helper for modelID.

66-87: ⚠️ Potential issue | 🟡 Minor

Missing timeout on Eventually block for EPP pods readiness.

Same issue—relies on Gomega's default 1-second timeout for a pod readiness check that may take longer.

🔧 Add explicit timeout
 		Eventually(func(g Gomega) {
 			...
 			g.Expect(hasReadyPod).To(BeTrue(), "At least one EPP pod should be ready")
-		}).Should(Succeed(), "EPP pods should be ready")
+		}, time.Duration(cfg.EventuallyShortSec)*time.Second, time.Duration(cfg.PollIntervalQuickSec)*time.Second).Should(Succeed(), "EPP pods should be ready")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/scale_from_zero_test.go` around lines 66 - 87, The Eventually block
that checks EPP pod readiness currently uses Gomega's default timeout; update
the call to Eventually(...) to include an explicit timeout (and optional polling
interval) to avoid flakiness—e.g., change the invocation of Eventually(func(g
Gomega) { ... }).Should(Succeed(), ...) to Eventually(func(g Gomega) { ... },
2*time.Minute, 5*time.Second).Should(Succeed(), ...) (or another suitable
timeout/interval) so the loop around k8sClient.CoreV1().Pods(...).List and the
readiness checks has enough time to observe a ready pod.

293-305: ⚠️ Potential issue | 🟡 Minor

Missing timeout on job pod readiness Eventually.

This waits for a job pod to be running but uses default timeout. Job scheduling can take several seconds.

🔧 Add explicit timeout
 			Eventually(func(g Gomega) {
 				...
 				g.Expect(pod.Status.Phase).To(Or(
 					Equal(corev1.PodRunning),
 					Equal(corev1.PodSucceeded),
 				), "Job pod should be running or succeeded")
-			}).Should(Succeed())
+			}, time.Duration(cfg.EventuallyShortSec)*time.Second, time.Duration(cfg.PollIntervalQuickSec)*time.Second).Should(Succeed())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/scale_from_zero_test.go` around lines 293 - 305, The Eventually
block that checks the job pod readiness (the closure using
k8sClient.CoreV1().Pods(...), triggerJobName and ctx) relies on the default
Gomega timeout; add an explicit timeout (and optional polling interval) to the
Eventually call so the test waits sufficiently long for scheduling (for example
pass a duration like 30s and a poll interval like 500ms to Eventually). Update
the Eventually(...).Should(Succeed()) invocation to include these
timeout/polling arguments to ensure robust waiting for the job pod to reach
PodRunning or PodSucceeded.
🧹 Nitpick comments (5)
test/e2e/fixtures/saturation_threshold_trigger.sh (1)

39-41: Potential JSON injection if MODEL_ID contains special characters.

MODEL_ID is interpolated directly into JSON without escaping. If it contains ", \, or control characters, the JSON becomes malformed or exploitable (CWE-94). For test fixtures this is low-risk but still violates defensive practices.

Mitigation options

Option 1: Use jq to build JSON safely:

JSON_PAYLOAD=$(jq -n --arg model "$MODEL_ID" --argjson tokens "$MAX_TOKENS" \
  '{model: $model, prompt: "Deterministic saturation threshold crossing prompt", max_tokens: $tokens}')
curl ... -d "$JSON_PAYLOAD"

Option 2: Validate MODEL_ID contains only safe characters:

case "$MODEL_ID" in
  *[\"\\]*) echo "Invalid MODEL_ID"; exit 1 ;;
esac
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/fixtures/saturation_threshold_trigger.sh` around lines 39 - 41, The
curl payload currently interpolates MODEL_ID and MAX_TOKENS directly (in the
RESPONSE assignment) which risks JSON injection; change the shell to build a
safe JSON payload (e.g., create JSON_PAYLOAD using a JSON builder like jq from
MODEL_ID and MAX_TOKENS) and then pass that JSON_PAYLOAD to curl -d, or
alternatively validate MODEL_ID for forbidden characters (quotes, backslashes,
control chars) before constructing the curl command; update the RESPONSE
assignment to use the new JSON_PAYLOAD variable instead of inline interpolation.
deploy/kind-emulator/install.sh (1)

13-14: Consider adding set -u for unbound variable detection.

The script uses set -e and set -o pipefail but not set -u. While unbound variables are handled with ${VAR:-default} syntax in some places, adding set -u would catch accidental use of undefined variables.

 set -e  # Exit on error
 set -o pipefail  # Exit on pipe failure
+set -u  # Exit on unbound variable
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/kind-emulator/install.sh` around lines 13 - 14, Add "set -u" alongside
the existing "set -e" and "set -o pipefail" lines to enable unbound variable
detection; then scan install.sh for intentional uses of unset variables and
update them to safe patterns (e.g., use ${VAR:-default} or explicit checks) or
initialize variables before use so the new -u flag doesn't break expected
behavior.
deploy/install.sh (3)

11-12: Missing set -u for stricter undefined variable handling.

The script uses set -e and set -o pipefail but not set -u. Undefined variable references will silently expand to empty strings, potentially causing subtle bugs.

🔧 Add set -u
 set -e  # Exit on error
 set -o pipefail  # Exit on pipe failure
+set -u  # Exit on undefined variable

Note: This may require auditing uses of ${VAR:-default} pattern for intentionally optional variables.

As per coding guidelines: **/*.sh: SHELL SCRIPT SECURITY (CWE-78) - Use 'set -euo pipefail' at script start.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/install.sh` around lines 11 - 12, The script currently enables 'set
-e' and 'set -o pipefail' but omits 'set -u', so add 'set -u' alongside the
existing 'set -e' and 'set -o pipefail' (i.e., use set -euo pipefail or add a
separate set -u) in deploy/install.sh, then audit usages of unquoted or optional
variables in the script (replace plain $VAR with ${VAR:-default} where
appropriate) to avoid failures from undefined variables and preserve intended
optional behavior.

159-165: Helper function returns string instead of exit code.

should_skip_helm_repo_update() echoes a string, requiring callers to use $() capture and string comparison. Shell idiom would be to return 0/1 for direct use in conditionals.

♻️ Alternative using exit codes
 should_skip_helm_repo_update() {
     local skip="${SKIP_HELM_REPO_UPDATE:-false}"
-    echo "$skip"
+    [ "$skip" = "true" ]
 }

Then callers become:

-if [ "$(should_skip_helm_repo_update)" = "true" ]; then
+if should_skip_helm_repo_update; then
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/install.sh` around lines 159 - 165, The helper
should_skip_helm_repo_update() currently echoes a string; change it to return
POSIX-style exit codes so it can be used directly in conditionals: read
SKIP_HELM_REPO_UPDATE, normalize/compare it against true/1 (case-insensitive)
and if it indicates skipping return 0, otherwise return 1, and do not print
anything to stdout; update callers to use it as a test (e.g., if
should_skip_helm_repo_update; then ...)—referencing the function name
should_skip_helm_repo_update() and the variable SKIP_HELM_REPO_UPDATE to locate
and modify the code.

1165-1169: Helm repo update skip logic duplicated with kind-emulator.

Per context snippet 3, deploy/kind-emulator/install.sh implements its own inline check for SKIP_HELM_REPO_UPDATE rather than using this helper. Consider sourcing this function or documenting the duplication to prevent drift.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/install.sh` around lines 1165 - 1169, The helm repo update skip logic
is duplicated; replace the inline conditional in the KEDA install block with a
call to the existing helper should_skip_helm_repo_update (or source the shared
script that defines it) so both installers use the same check; specifically,
remove the duplicated inline `if [ "$(should_skip_helm_repo_update)" = "true" ];
then ... helm repo update` block and ensure the file sources the common helper
that exports should_skip_helm_repo_update (or add a short comment referencing
the other script if you intentionally keep duplication) so the condition is
centralized and cannot drift.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/actions/trivy-scan/action.yml:
- Around line 11-14: The Trivy action is filtering severities but not failing
the job; add the exit-code input to enforce a non-zero exit when HIGH or
CRITICAL vulnerabilities are found—update the inputs block that currently
contains image-ref, format and severity (look for keys image-ref, format,
severity in the action inputs) and add an exit-code value (e.g., exit-code: 1)
so the workflow fails when those severities are detected.

In `@deploy/install.sh`:
- Line 1116: The kubectl wait invocation uses an unquoted variable LLMD_NS which
can cause word-splitting or glob expansion; update the command that calls
kubectl wait --for=condition=Available deployment --all -n $LLMD_NS
--timeout="$DEPLOY_WAIT_TIMEOUT" to quote the namespace variable (use -n
"$LLMD_NS") and similarly ensure any other shell-expanded variables in that
command are properly quoted so kubectl wait and the DEPLOY_WAIT_TIMEOUT/LLMD_NS
expansions are safe.

In `@test/e2e/fixtures/burst_load_generator.sh`:
- Line 18: The burst load script is using /v1/chat/completions which the
simulator (ghcr.io/llm-d/llm-d-inference-sim:v0.7.1) does not use to update
KV-cache/saturation metrics; change the TARGET_URL (or hardcoded endpoint used
in requests) back to POST /v1/completions so saturation is measured, and in the
curl invocation ensure all shell variables are quoted to prevent command
injection (quote "$CURL_TIMEOUT" and "$TARGET_URL" in the curl command).
- Around line 69-71: The curl invocation in the background burst generator
expands unquoted variables (CURL_TIMEOUT, TARGET_URL, MODEL_ID, MAX_TOKENS)
which allows argument injection; update the command in the
burst_load_generator.sh snippet to pass the destination using --url
"$TARGET_URL", quote all variable expansions ("$CURL_TIMEOUT", "$MODEL_ID",
"$MAX_TOKENS") and any JSON payload pieces, and preserve the existing flags
(e.g. --max-time "$CURL_TIMEOUT" -H "Content-Type: application/json" -d '...')
so that no unquoted variables can introduce additional curl options or change
behavior.

In `@test/e2e/fixtures/saturation_threshold_trigger.sh`:
- Line 12: The script currently uses "set -eu" which doesn't catch failures in
piped commands; update the script startup to include the pipefail option by
enabling "-o pipefail" (or use the combined form "set -euo pipefail") so piped
command failures like in any "curl ... | tail" are detected; change the existing
"set -eu" usage in the saturation_threshold_trigger.sh script to include
pipefail.

In `@test/e2e/saturation_analyzer_path_test.go`:
- Around line 121-133: The helper expectAnalyzerPathLog currently matches any
controller-manager log that already contains "Processing model (V1|V2)" and
modelID, allowing stale logs to satisfy the check; update expectAnalyzerPathLog
(and its use of testutils.PodLogsLabelSelectorContain) to fence to the current
config update by capturing a "since" cursor/timestamp immediately after you
write the ConfigMap and passing that into PodLogsLabelSelectorContain (or add a
sinceTimestamp parameter to expectAnalyzerPathLog) so it only searches logs
emitted after the config change while keeping the existing Eventually timeout
(cfg.EventuallyLongSec) and poll interval (cfg.PollIntervalSec).
- Around line 607-645: Before running runThresholdTriggerJob capture the current
DesiredOptimizedAlloc.NumReplicas as a baseline (use or add a helper like
getDesiredOptimizedReplicas/getDesiredAllocation for vaName) then run
runThresholdTriggerJob and after that use waitForPositiveDesiredAllocation (or a
new waitForDesiredReplicasGreaterThan) to assert the post-trigger
DesiredOptimizedAlloc.NumReplicas is strictly greater than the baseline; update
the spec to compare post-trigger value > baseline rather than just > 0.
- Around line 487-500: The teardown currently deletes and recreates the shared
saturation ConfigMap without waiting or retrying, which can race and leave stale
state; change the logic in the block that references cmExistedBefore,
cmOriginal, saturationConfigMapForRecreate and
k8sClient.CoreV1().ConfigMaps(...) to either: (A) perform an Update with
optimistic retry (use retry.RetryOnConflict or a small loop to Get -> modify ->
Update) against the existing ConfigMap when cmOriginal != nil, or (B) if you
keep the delete/create approach, after Delete poll (e.g., wait.PollImmediate)
until Get returns NotFound before Create, and on Create handle AlreadyExists by
re-trying a Get->Update; ensure failures during teardown cause the test to fail
rather than silently logging warnings.

In `@test/e2e/scale_from_zero_test.go`:
- Around line 60-63: The Eventually call that wraps
k8sClient.CoreV1().Services(...).Get (the EPP service existence check) is
missing an explicit timeout; replace the default with the project's configured
timeout (e.g. use cfg.EventuallyShortSec or cfg.EventuallyTimeout) by passing it
as the first/second argument to Eventually (and optionally a polling interval)
so the check uses the same timeout pattern as other tests in this file; update
the Eventually that asserts the EPP service exists to use cfg.EventuallyShortSec
(and a reasonable interval) to avoid the 1s default.
- Around line 159-167: The Eventually block waiting for VA reconciliation lacks
explicit timeout and polling interval; update the Eventually call that wraps the
anonymous func (which fetches variantautoscalingv1alpha1.VariantAutoscaling via
crClient.Get using client.ObjectKey{Namespace: cfg.LLMDNamespace, Name: vaName}
and asserts GetCondition(..., TypeTargetResolved)) to include explicit timeout
and interval arguments (e.g., Eventually(func(g Gomega) { ... }, 30*time.Second,
1*time.Second) or use WithTimeout/WithPolling) so the test won't hang
indefinitely and will poll at a defined cadence.

In `@test/e2e/suite_test.go`:
- Around line 50-87: The restartPrometheusAdapterPods helper currently only
warns on non-NotFound delete errors and treats success as any single pod being
Ready; change it so non-NotFound delete errors fail the test (make the deleteErr
check for each pod use Expect/delete failure -> NotTo(HaveOccurred()) or
equivalent fatal) and after the pod readiness loop add a discovery check that
the API group/version external.metrics.k8s.io/v1beta1 is available (use the
Kubernetes discovery client via
k8sClient.Discovery().ServerResourcesForGroupVersion("external.metrics.k8s.io/v1beta1")
or similar) and gate the Eventually/Succeed on both pod readiness and successful
discovery of that API; update the same logic in the analogous function/section
around lines 182-223.

In `@test/utils/logging.go`:
- Around line 43-46: The current unbounded io.ReadAll(stream) can OOM; change to
read with a limit by wrapping the stream with io.LimitReader and a defined max
(e.g. maxPodLogBytes) before calling io.ReadAll, keep the existing defer to
stream.Close(), and surface a clear error when content is truncated; locate the
anonymous reader invocation that assigns raw, readErr (the func() { defer
stream.Close(); return io.ReadAll(stream) }()) and replace it with
io.ReadAll(io.LimitReader(stream, maxPodLogBytes)) and add a constant
(maxPodLogBytes) near the top of the file or re-use an existing constant so
tests cannot exhaust memory.

---

Outside diff comments:
In `@test/e2e/pod_scraping_test.go`:
- Around line 59-82: The closure passed to Eventually can return without failing
when no "-epp" service is found because the final check only reasserts err
(which is nil after List); modify the end of the closure to assert that
eppServiceName is non-empty so the closure fails and Eventually keeps retrying
until a service appears. Concretely, after the for-loop in the Eventually
closure replace the final g.Expect(err).NotTo(HaveOccurred(), "EPP service
should exist") with g.Expect(eppServiceName).NotTo(BeEmpty(), "EPP service
should exist") (or an explicit g.Fail/g.Expect(false).To(BeTrue())), referencing
the existing eppServiceName, the List call
(k8sClient.CoreV1().Services(...).List) and the closure used by Eventually to
ensure retries continue until a service is discovered.

In `@test/e2e/scale_from_zero_test.go`:
- Around line 393-434: The embedded shell script is vulnerable because
fmt.Sprintf directly interpolates modelID (and other vars) into the script
template (see the script variable and fmt.Sprintf call), allowing shell
metacharacters in modelID to execute commands; fix by removing direct
interpolation and instead pass modelID via a safe environment variable or other
typed parameter injection (set an env var like MODEL_ID in the Job/Pod spec and
reference it inside the script), or at minimum escape/sanitize modelID before
insertion; update the creation site where script is built (the script variable /
fmt.Sprintf usage) to use environment variables or a proper escaping helper for
modelID.
- Around line 66-87: The Eventually block that checks EPP pod readiness
currently uses Gomega's default timeout; update the call to Eventually(...) to
include an explicit timeout (and optional polling interval) to avoid
flakiness—e.g., change the invocation of Eventually(func(g Gomega) { ...
}).Should(Succeed(), ...) to Eventually(func(g Gomega) { ... }, 2*time.Minute,
5*time.Second).Should(Succeed(), ...) (or another suitable timeout/interval) so
the loop around k8sClient.CoreV1().Pods(...).List and the readiness checks has
enough time to observe a ready pod.
- Around line 293-305: The Eventually block that checks the job pod readiness
(the closure using k8sClient.CoreV1().Pods(...), triggerJobName and ctx) relies
on the default Gomega timeout; add an explicit timeout (and optional polling
interval) to the Eventually call so the test waits sufficiently long for
scheduling (for example pass a duration like 30s and a poll interval like 500ms
to Eventually). Update the Eventually(...).Should(Succeed()) invocation to
include these timeout/polling arguments to ensure robust waiting for the job pod
to reach PodRunning or PodSucceeded.

In `@test/e2e/smoke_test.go`:
- Around line 637-651: The switch on condition.Status in the MetricsAvailable
check must explicitly reject metav1.ConditionUnknown so the Eventually cannot
succeed with an Unknown state; update the switch in test/e2e/smoke_test.go (the
block that inspects MetricsAvailable) to add a default or explicit
metav1.ConditionUnknown branch that fails the test (e.g., use the G.Expect
assertion to assert condition.Status is not metav1.ConditionUnknown or call
G.Expect(...).To(Not(Equal(metav1.ConditionUnknown))) with a message), keeping
the existing cases for metav1.ConditionFalse and metav1.ConditionTrue and the
referenced reasons like variantautoscalingv1alpha1.ReasonMetricsFound.

---

Nitpick comments:
In `@deploy/install.sh`:
- Around line 11-12: The script currently enables 'set -e' and 'set -o pipefail'
but omits 'set -u', so add 'set -u' alongside the existing 'set -e' and 'set -o
pipefail' (i.e., use set -euo pipefail or add a separate set -u) in
deploy/install.sh, then audit usages of unquoted or optional variables in the
script (replace plain $VAR with ${VAR:-default} where appropriate) to avoid
failures from undefined variables and preserve intended optional behavior.
- Around line 159-165: The helper should_skip_helm_repo_update() currently
echoes a string; change it to return POSIX-style exit codes so it can be used
directly in conditionals: read SKIP_HELM_REPO_UPDATE, normalize/compare it
against true/1 (case-insensitive) and if it indicates skipping return 0,
otherwise return 1, and do not print anything to stdout; update callers to use
it as a test (e.g., if should_skip_helm_repo_update; then ...)—referencing the
function name should_skip_helm_repo_update() and the variable
SKIP_HELM_REPO_UPDATE to locate and modify the code.
- Around line 1165-1169: The helm repo update skip logic is duplicated; replace
the inline conditional in the KEDA install block with a call to the existing
helper should_skip_helm_repo_update (or source the shared script that defines
it) so both installers use the same check; specifically, remove the duplicated
inline `if [ "$(should_skip_helm_repo_update)" = "true" ]; then ... helm repo
update` block and ensure the file sources the common helper that exports
should_skip_helm_repo_update (or add a short comment referencing the other
script if you intentionally keep duplication) so the condition is centralized
and cannot drift.

In `@deploy/kind-emulator/install.sh`:
- Around line 13-14: Add "set -u" alongside the existing "set -e" and "set -o
pipefail" lines to enable unbound variable detection; then scan install.sh for
intentional uses of unset variables and update them to safe patterns (e.g., use
${VAR:-default} or explicit checks) or initialize variables before use so the
new -u flag doesn't break expected behavior.

In `@test/e2e/fixtures/saturation_threshold_trigger.sh`:
- Around line 39-41: The curl payload currently interpolates MODEL_ID and
MAX_TOKENS directly (in the RESPONSE assignment) which risks JSON injection;
change the shell to build a safe JSON payload (e.g., create JSON_PAYLOAD using a
JSON builder like jq from MODEL_ID and MAX_TOKENS) and then pass that
JSON_PAYLOAD to curl -d, or alternatively validate MODEL_ID for forbidden
characters (quotes, backslashes, control chars) before constructing the curl
command; update the RESPONSE assignment to use the new JSON_PAYLOAD variable
instead of inline interpolation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: bab8a09b-c01c-4a5a-bfb4-ab8c37b10b6b

📥 Commits

Reviewing files that changed from the base of the PR and between 9e2fe36 and 8b996ee.

📒 Files selected for processing (29)
  • .github/actions/trivy-scan/action.yml
  • Makefile
  • config/prometheus/servicemonitor.yaml
  • deploy/install.sh
  • deploy/kind-emulator/README.md
  • deploy/kind-emulator/install.sh
  • docs/developer-guide/testing.md
  • hack/burst_load_generator.sh
  • hack/burst_load_generator.sh
  • test/benchmark/suite_test.go
  • test/e2e/README.md
  • test/e2e/config.go
  • test/e2e/fixtures/burst_load_generator.sh
  • test/e2e/fixtures/infra_builder.go
  • test/e2e/fixtures/model_service_builder.go
  • test/e2e/fixtures/saturation_threshold_trigger.sh
  • test/e2e/fixtures/va_builder.go
  • test/e2e/fixtures/wait.go
  • test/e2e/fixtures/workload_builder.go
  • test/e2e/limiter_test.go
  • test/e2e/parallel_load_scaleup_test.go
  • test/e2e/pod_scraping_test.go
  • test/e2e/saturation_analyzer_path_test.go
  • test/e2e/saturation_test.go
  • test/e2e/scale_from_zero_test.go
  • test/e2e/scale_to_zero_test.go
  • test/e2e/smoke_test.go
  • test/e2e/suite_test.go
  • test/utils/logging.go
💤 Files with no reviewable changes (3)
  • test/e2e/parallel_load_scaleup_test.go
  • test/e2e/scale_to_zero_test.go
  • test/e2e/saturation_test.go

Comment on lines +11 to +14
with:
image-ref: ${{ inputs.image }}
format: "table"
severity: "HIGH,CRITICAL"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

For aquasecurity/trivy-action v0.35.0, what is the default value of exit-code, and does the action fail the workflow when HIGH/CRITICAL vulnerabilities are detected if exit-code is not set?

💡 Result:

For aquasecurity/trivy-action v0.35.0, the default value of exit-code is 0. When exit-code is not set (i.e., using the default of 0), the action does not fail the workflow when HIGH/CRITICAL vulnerabilities are detected. The workflow step succeeds (exit code 0) even if vulnerabilities are found, unless exit-code is explicitly set to a non-zero value like '1' to fail on specified severities.

Citations:


Set Trivy exit-code to enforce vulnerability gate

Lines 11-14 configure severity filtering but the job never fails when HIGH/CRITICAL findings exist, allowing vulnerable images to pass. This weakens release integrity (CWE-693).

Suggested fix
       with:
         image-ref: ${{ inputs.image }}
         format: "table"
         severity: "HIGH,CRITICAL"
+        exit-code: "1"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
with:
image-ref: ${{ inputs.image }}
format: "table"
severity: "HIGH,CRITICAL"
with:
image-ref: ${{ inputs.image }}
format: "table"
severity: "HIGH,CRITICAL"
exit-code: "1"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/actions/trivy-scan/action.yml around lines 11 - 14, The Trivy action
is filtering severities but not failing the job; add the exit-code input to
enforce a non-zero exit when HIGH or CRITICAL vulnerabilities are found—update
the inputs block that currently contains image-ref, format and severity (look
for keys image-ref, format, severity in the action inputs) and add an exit-code
value (e.g., exit-code: 1) so the workflow fails when those severities are
detected.

Comment thread deploy/install.sh Outdated
# wait timeout here must be long enough for the model to finish loading.
local DEPLOY_WAIT_TIMEOUT="${DEPLOY_WAIT_TIMEOUT:-600s}"
log_info "Waiting for llm-d components to initialize (timeout=${DEPLOY_WAIT_TIMEOUT})..."
kubectl wait --for=condition=Available deployment --all -n $LLMD_NS --timeout="$DEPLOY_WAIT_TIMEOUT" || \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

SC2086: Unquoted variable $LLMD_NS risks word splitting.

If LLMD_NS contains spaces or glob characters, the command will misbehave. Quote the variable.

🔧 Fix quoting
-        kubectl wait --for=condition=Available deployment --all -n $LLMD_NS --timeout="$DEPLOY_WAIT_TIMEOUT" || \
+        kubectl wait --for=condition=Available deployment --all -n "$LLMD_NS" --timeout="$DEPLOY_WAIT_TIMEOUT" || \

As per coding guidelines: **/*.sh: SHELL SCRIPT SECURITY (CWE-78) - Quote all variables to prevent injection.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
kubectl wait --for=condition=Available deployment --all -n $LLMD_NS --timeout="$DEPLOY_WAIT_TIMEOUT" || \
kubectl wait --for=condition=Available deployment --all -n "$LLMD_NS" --timeout="$DEPLOY_WAIT_TIMEOUT" || \
🧰 Tools
🪛 Shellcheck (0.11.0)

[info] 1116-1116: Double quote to prevent globbing and word splitting.

(SC2086)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/install.sh` at line 1116, The kubectl wait invocation uses an unquoted
variable LLMD_NS which can cause word-splitting or glob expansion; update the
command that calls kubectl wait --for=condition=Available deployment --all -n
$LLMD_NS --timeout="$DEPLOY_WAIT_TIMEOUT" to quote the namespace variable (use
-n "$LLMD_NS") and similarly ensure any other shell-expanded variables in that
command are properly quoted so kubectl wait and the DEPLOY_WAIT_TIMEOUT/LLMD_NS
expansions are safe.

# BATCH_SLEEP: Sleep duration between batches (seconds)
# MODEL_ID: Model ID to use in requests
# TARGET_URL: Target URL for requests (e.g., http://service:port/v1/completions)
# TARGET_URL: Target URL for requests (e.g. http://service:port/v1/chat/completions)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

For ghcr.io/llm-d/llm-d-inference-sim:v0.7.1, do KV-cache usage / saturation metrics update for POST /v1/chat/completions, or only for POST /v1/completions?

💡 Result:

KV-cache usage / saturation metrics (vllm:kv_cache_usage_perc or previously vllm:gpu_cache_usage_perc) update only for POST /v1/completions, not for POST /v1/chat/completions in ghcr.io/llm-d/llm-d-inference-sim:v0.7.1.

Citations:


🏁 Script executed:

head -n 75 test/e2e/fixtures/burst_load_generator.sh | tail -n +65

Repository: opendatahub-io/workload-variant-autoscaler

Length of output: 626


Switch to /v1/chat/completions breaks saturation metrics—KV-cache is only tracked on /v1/completions.

Lines 18 and 71 request /v1/chat/completions, but ghcr.io/llm-d/llm-d-inference-sim:v0.7.1 only updates KV-cache/saturation metrics for POST /v1/completions. This burst load test will no longer measure simulator saturation after the endpoint switch.

Quote all shell variables in curl invocation (CWE-78).

Lines 69–71: $CURL_TIMEOUT and $TARGET_URL are unquoted in the curl command, enabling command injection. Use "$CURL_TIMEOUT" and "$TARGET_URL".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/fixtures/burst_load_generator.sh` at line 18, The burst load script
is using /v1/chat/completions which the simulator
(ghcr.io/llm-d/llm-d-inference-sim:v0.7.1) does not use to update
KV-cache/saturation metrics; change the TARGET_URL (or hardcoded endpoint used
in requests) back to POST /v1/completions so saturation is measured, and in the
curl invocation ensure all shell variables are quoted to prevent command
injection (quote "$CURL_TIMEOUT" and "$TARGET_URL" in the curl command).

Comment on lines 69 to +71
(curl -s -o /dev/null --max-time $CURL_TIMEOUT -X POST $TARGET_URL \
-H "Content-Type: application/json" \
-d "{\"model\":\"$MODEL_ID\",\"prompt\":\"Write a detailed explanation of machine learning algorithms.\",\"max_tokens\":$MAX_TOKENS}" || true) &
-d "{\"model\":\"$MODEL_ID\",\"messages\":[{\"role\":\"user\",\"content\":\"Write a detailed explanation of machine learning algorithms.\"}],\"max_tokens\":$MAX_TOKENS}" || true) &
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

fd -type f -name "burst_load_generator.sh"

Repository: opendatahub-io/workload-variant-autoscaler

Length of output: 256


🏁 Script executed:

fd -t f -name "burst_load_generator.sh"

Repository: opendatahub-io/workload-variant-autoscaler

Length of output: 316


🏁 Script executed:

fd "burst_load_generator.sh"

Repository: opendatahub-io/workload-variant-autoscaler

Length of output: 156


🏁 Script executed:

cat -n test/e2e/fixtures/burst_load_generator.sh | sed -n '60,80p'

Repository: opendatahub-io/workload-variant-autoscaler

Length of output: 971


🏁 Script executed:

cat -n test/e2e/fixtures/burst_load_generator.sh | head -40

Repository: opendatahub-io/workload-variant-autoscaler

Length of output: 2066


🏁 Script executed:

cat -n test/e2e/fixtures/burst_load_generator.sh | grep -E '\$[A-Z_]+' | head -30

Repository: opendatahub-io/workload-variant-autoscaler

Length of output: 1196


🏁 Script executed:

cat -n test/e2e/fixtures/burst_load_generator.sh | sed -n '69,71p'

Repository: opendatahub-io/workload-variant-autoscaler

Length of output: 410


Critical: Quote all variables and use --url to prevent argument injection (CWE-88).

Lines 69–71 expand $CURL_TIMEOUT and $TARGET_URL unquoted into curl arguments. Attacker-controlled environment variables can inject curl flags: TARGET_URL="-o /tmp/leak http://evil.com" redirects output; CURL_TIMEOUT="1 -K /tmp/conf" loads arbitrary config. Quote all variables and pass the URL explicitly via --url.

Fix
-    (curl -s -o /dev/null --max-time $CURL_TIMEOUT -X POST $TARGET_URL \
+    (curl -s -o /dev/null --max-time "$CURL_TIMEOUT" -X POST --url "$TARGET_URL" \
       -H "Content-Type: application/json" \
       -d "{\"model\":\"$MODEL_ID\",\"messages\":[{\"role\":\"user\",\"content\":\"Write a detailed explanation of machine learning algorithms.\"}],\"max_tokens\":$MAX_TOKENS}" || true) &
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
(curl -s -o /dev/null --max-time $CURL_TIMEOUT -X POST $TARGET_URL \
-H "Content-Type: application/json" \
-d "{\"model\":\"$MODEL_ID\",\"prompt\":\"Write a detailed explanation of machine learning algorithms.\",\"max_tokens\":$MAX_TOKENS}" || true) &
-d "{\"model\":\"$MODEL_ID\",\"messages\":[{\"role\":\"user\",\"content\":\"Write a detailed explanation of machine learning algorithms.\"}],\"max_tokens\":$MAX_TOKENS}" || true) &
(curl -s -o /dev/null --max-time "$CURL_TIMEOUT" -X POST --url "$TARGET_URL" \
-H "Content-Type: application/json" \
-d "{\"model\":\"$MODEL_ID\",\"messages\":[{\"role\":\"user\",\"content\":\"Write a detailed explanation of machine learning algorithms.\"}],\"max_tokens\":$MAX_TOKENS}" || true) &
🧰 Tools
🪛 Shellcheck (0.11.0)

[info] 69-69: Double quote to prevent globbing and word splitting.

(SC2086)


[info] 69-69: Double quote to prevent globbing and word splitting.

(SC2086)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/fixtures/burst_load_generator.sh` around lines 69 - 71, The curl
invocation in the background burst generator expands unquoted variables
(CURL_TIMEOUT, TARGET_URL, MODEL_ID, MAX_TOKENS) which allows argument
injection; update the command in the burst_load_generator.sh snippet to pass the
destination using --url "$TARGET_URL", quote all variable expansions
("$CURL_TIMEOUT", "$MODEL_ID", "$MAX_TOKENS") and any JSON payload pieces, and
preserve the existing flags (e.g. --max-time "$CURL_TIMEOUT" -H "Content-Type:
application/json" -d '...') so that no unquoted variables can introduce
additional curl options or change behavior.

#
# EXIT: 0 if at least one completion returned HTTP 200; non-zero otherwise.

set -eu
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing pipefail option (CWE-78).

Add -o pipefail to ensure pipeline failures are caught. Current set -eu won't detect failures in piped commands like curl ... | tail.

-set -eu
+set -euo pipefail

As per coding guidelines: "Use 'set -euo pipefail' at script start".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
set -eu
set -euo pipefail
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/fixtures/saturation_threshold_trigger.sh` at line 12, The script
currently uses "set -eu" which doesn't catch failures in piped commands; update
the script startup to include the pipefail option by enabling "-o pipefail" (or
use the combined form "set -euo pipefail") so piped command failures like in any
"curl ... | tail" are detected; change the existing "set -eu" usage in the
saturation_threshold_trigger.sh script to include pipefail.

Comment on lines +607 to +645
It("crosses V1 threshold with bounded requests and recommends scale-up", func() {
By("Configuring aggressive V1 thresholds and unsetting analyzerName")
err := upsertSaturationConfigEntry(
ctx,
cmNamespace,
cmName,
cmKey,
buildSaturationConfigYAMLWithThresholds(
"",
saturationV1KVCacheThreshold,
saturationV1QueueLengthThreshold,
saturationV1KVSpareTrigger,
saturationV1QueueSpareTrigger,
saturationV1ScaleUpThreshold,
saturationV1ScaleDownBoundary,
),
)
Expect(err).NotTo(HaveOccurred())

By("Verifying controller is using V1 analyzer path")
expectAnalyzerPathLog("V1", modelID)

By("Preflighting VA controller signal before bounded trigger traffic")
waitForSaturationInfraSignal(ctx, cfg.LLMDNamespace, vaName)

By("Sending bounded traffic to cross V1 queue/KV thresholds once")
runThresholdTriggerJob(
ctx,
cfg.LLMDNamespace,
serviceName,
modelID,
8000,
saturationV1TriggerRequests,
saturationV1TriggerMaxTokens,
)

By("Verifying desired allocation recommends a positive replica count")
waitForPositiveDesiredAllocation(ctx, cfg.LLMDNamespace, vaName)
})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

This spec never proves the trigger caused a scale-up.

Line 643-644 only wait for DesiredOptimizedAlloc.NumReplicas > 0. If the VA is already positive before runThresholdTriggerJob, this passes even when the aggressive thresholds and bounded workload do nothing. Capture a pre-trigger baseline here and assert the post-trigger recommendation is greater than that value.

Suggested fix
 	It("crosses V1 threshold with bounded requests and recommends scale-up", func() {
+		var baseline int32
+
+		By("Capturing baseline desired replicas before aggressive trigger")
+		Eventually(func(g Gomega) {
+			va := &variantautoscalingv1alpha1.VariantAutoscaling{}
+			g.Expect(crClient.Get(ctx, client.ObjectKey{Name: vaName, Namespace: cfg.LLMDNamespace}, va)).To(Succeed())
+			g.Expect(va.Status.DesiredOptimizedAlloc.NumReplicas).NotTo(BeNil())
+			baseline = *va.Status.DesiredOptimizedAlloc.NumReplicas
+		}, time.Duration(cfg.EventuallyLongSec)*time.Second, time.Duration(cfg.PollIntervalSec)*time.Second).Should(Succeed())
+
 		By("Configuring aggressive V1 thresholds and unsetting analyzerName")
 		err := upsertSaturationConfigEntry(
 			ctx,
 			cmNamespace,
 			cmName,
@@
 		runThresholdTriggerJob(
 			ctx,
 			cfg.LLMDNamespace,
 			serviceName,
 			modelID,
@@
 		)
 
-		By("Verifying desired allocation recommends a positive replica count")
-		waitForPositiveDesiredAllocation(ctx, cfg.LLMDNamespace, vaName)
+		By("Verifying desired allocation increases above the pre-trigger baseline")
+		Eventually(func() int32 {
+			va := &variantautoscalingv1alpha1.VariantAutoscaling{}
+			Expect(crClient.Get(ctx, client.ObjectKey{Name: vaName, Namespace: cfg.LLMDNamespace}, va)).To(Succeed())
+			if va.Status.DesiredOptimizedAlloc.NumReplicas == nil {
+				return -1
+			}
+			return *va.Status.DesiredOptimizedAlloc.NumReplicas
+		}, time.Duration(cfg.EventuallyExtendedSec)*time.Second, time.Duration(cfg.PollIntervalSec)*time.Second).Should(BeNumerically(">", baseline))
 	})
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
It("crosses V1 threshold with bounded requests and recommends scale-up", func() {
By("Configuring aggressive V1 thresholds and unsetting analyzerName")
err := upsertSaturationConfigEntry(
ctx,
cmNamespace,
cmName,
cmKey,
buildSaturationConfigYAMLWithThresholds(
"",
saturationV1KVCacheThreshold,
saturationV1QueueLengthThreshold,
saturationV1KVSpareTrigger,
saturationV1QueueSpareTrigger,
saturationV1ScaleUpThreshold,
saturationV1ScaleDownBoundary,
),
)
Expect(err).NotTo(HaveOccurred())
By("Verifying controller is using V1 analyzer path")
expectAnalyzerPathLog("V1", modelID)
By("Preflighting VA controller signal before bounded trigger traffic")
waitForSaturationInfraSignal(ctx, cfg.LLMDNamespace, vaName)
By("Sending bounded traffic to cross V1 queue/KV thresholds once")
runThresholdTriggerJob(
ctx,
cfg.LLMDNamespace,
serviceName,
modelID,
8000,
saturationV1TriggerRequests,
saturationV1TriggerMaxTokens,
)
By("Verifying desired allocation recommends a positive replica count")
waitForPositiveDesiredAllocation(ctx, cfg.LLMDNamespace, vaName)
})
It("crosses V1 threshold with bounded requests and recommends scale-up", func() {
var baseline int32
By("Capturing baseline desired replicas before aggressive trigger")
Eventually(func(g Gomega) {
va := &variantautoscalingv1alpha1.VariantAutoscaling{}
g.Expect(crClient.Get(ctx, client.ObjectKey{Name: vaName, Namespace: cfg.LLMDNamespace}, va)).To(Succeed())
g.Expect(va.Status.DesiredOptimizedAlloc.NumReplicas).NotTo(BeNil())
baseline = *va.Status.DesiredOptimizedAlloc.NumReplicas
}, time.Duration(cfg.EventuallyLongSec)*time.Second, time.Duration(cfg.PollIntervalSec)*time.Second).Should(Succeed())
By("Configuring aggressive V1 thresholds and unsetting analyzerName")
err := upsertSaturationConfigEntry(
ctx,
cmNamespace,
cmName,
cmKey,
buildSaturationConfigYAMLWithThresholds(
"",
saturationV1KVCacheThreshold,
saturationV1QueueLengthThreshold,
saturationV1KVSpareTrigger,
saturationV1QueueSpareTrigger,
saturationV1ScaleUpThreshold,
saturationV1ScaleDownBoundary,
),
)
Expect(err).NotTo(HaveOccurred())
By("Verifying controller is using V1 analyzer path")
expectAnalyzerPathLog("V1", modelID)
By("Preflighting VA controller signal before bounded trigger traffic")
waitForSaturationInfraSignal(ctx, cfg.LLMDNamespace, vaName)
By("Sending bounded traffic to cross V1 queue/KV thresholds once")
runThresholdTriggerJob(
ctx,
cfg.LLMDNamespace,
serviceName,
modelID,
8000,
saturationV1TriggerRequests,
saturationV1TriggerMaxTokens,
)
By("Verifying desired allocation increases above the pre-trigger baseline")
Eventually(func() int32 {
va := &variantautoscalingv1alpha1.VariantAutoscaling{}
Expect(crClient.Get(ctx, client.ObjectKey{Name: vaName, Namespace: cfg.LLMDNamespace}, va)).To(Succeed())
if va.Status.DesiredOptimizedAlloc.NumReplicas == nil {
return -1
}
return *va.Status.DesiredOptimizedAlloc.NumReplicas
}, time.Duration(cfg.EventuallyExtendedSec)*time.Second, time.Duration(cfg.PollIntervalSec)*time.Second).Should(BeNumerically(">", baseline))
})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/saturation_analyzer_path_test.go` around lines 607 - 645, Before
running runThresholdTriggerJob capture the current
DesiredOptimizedAlloc.NumReplicas as a baseline (use or add a helper like
getDesiredOptimizedReplicas/getDesiredAllocation for vaName) then run
runThresholdTriggerJob and after that use waitForPositiveDesiredAllocation (or a
new waitForDesiredReplicasGreaterThan) to assert the post-trigger
DesiredOptimizedAlloc.NumReplicas is strictly greater than the baseline; update
the spec to compare post-trigger value > baseline rather than just > 0.

Comment on lines 60 to +63
Eventually(func(g Gomega) {
_, err := k8sClient.CoreV1().Services(cfg.LLMDNamespace).Get(ctx, eppServiceName, metav1.GetOptions{})
g.Expect(err).NotTo(HaveOccurred(), "EPP service should exist")
}, 2*time.Minute, 5*time.Second).Should(Succeed(), "EPP service should exist")
}).Should(Succeed(), "EPP service should exist")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing timeout on Eventually block.

This Eventually uses Gomega's default timeout (1 second) which is likely too short for Kubernetes API operations. Other calls in this file explicitly pass cfg.EventuallyShortSec or similar.

🔧 Add explicit timeout
 		Eventually(func(g Gomega) {
 			_, err := k8sClient.CoreV1().Services(cfg.LLMDNamespace).Get(ctx, eppServiceName, metav1.GetOptions{})
 			g.Expect(err).NotTo(HaveOccurred(), "EPP service should exist")
-		}).Should(Succeed(), "EPP service should exist")
+		}, time.Duration(cfg.EventuallyShortSec)*time.Second, time.Duration(cfg.PollIntervalQuickSec)*time.Second).Should(Succeed(), "EPP service should exist")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Eventually(func(g Gomega) {
_, err := k8sClient.CoreV1().Services(cfg.LLMDNamespace).Get(ctx, eppServiceName, metav1.GetOptions{})
g.Expect(err).NotTo(HaveOccurred(), "EPP service should exist")
}, 2*time.Minute, 5*time.Second).Should(Succeed(), "EPP service should exist")
}).Should(Succeed(), "EPP service should exist")
Eventually(func(g Gomega) {
_, err := k8sClient.CoreV1().Services(cfg.LLMDNamespace).Get(ctx, eppServiceName, metav1.GetOptions{})
g.Expect(err).NotTo(HaveOccurred(), "EPP service should exist")
}, time.Duration(cfg.EventuallyShortSec)*time.Second, time.Duration(cfg.PollIntervalQuickSec)*time.Second).Should(Succeed(), "EPP service should exist")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/scale_from_zero_test.go` around lines 60 - 63, The Eventually call
that wraps k8sClient.CoreV1().Services(...).Get (the EPP service existence
check) is missing an explicit timeout; replace the default with the project's
configured timeout (e.g. use cfg.EventuallyShortSec or cfg.EventuallyTimeout) by
passing it as the first/second argument to Eventually (and optionally a polling
interval) so the check uses the same timeout pattern as other tests in this
file; update the Eventually that asserts the EPP service exists to use
cfg.EventuallyShortSec (and a reasonable interval) to avoid the 1s default.

Comment on lines +159 to +167
Eventually(func(g Gomega) {
va := &variantautoscalingv1alpha1.VariantAutoscaling{}
err := crClient.Get(ctx, client.ObjectKey{Namespace: cfg.LLMDNamespace, Name: vaName}, va)
g.Expect(err).NotTo(HaveOccurred())
g.Expect(va.Status.Conditions).NotTo(BeEmpty(), "VA should have status conditions after reconciliation")

targetResolved := variantautoscalingv1alpha1.GetCondition(va, variantautoscalingv1alpha1.TypeTargetResolved)
g.Expect(targetResolved).NotTo(BeNil(), "VA should have TargetResolved condition")
}).Should(Succeed())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing timeout on VA reconciliation Eventually.

The reconciliation wait also lacks explicit timeout/poll intervals.

🔧 Add explicit timeout
 		Eventually(func(g Gomega) {
 			va := &variantautoscalingv1alpha1.VariantAutoscaling{}
 			err := crClient.Get(ctx, client.ObjectKey{Namespace: cfg.LLMDNamespace, Name: vaName}, va)
 			g.Expect(err).NotTo(HaveOccurred())
 			g.Expect(va.Status.Conditions).NotTo(BeEmpty(), "VA should have status conditions after reconciliation")

 			targetResolved := variantautoscalingv1alpha1.GetCondition(va, variantautoscalingv1alpha1.TypeTargetResolved)
 			g.Expect(targetResolved).NotTo(BeNil(), "VA should have TargetResolved condition")
-		}).Should(Succeed())
+		}, time.Duration(cfg.EventuallyShortSec)*time.Second, time.Duration(cfg.PollIntervalQuickSec)*time.Second).Should(Succeed())
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Eventually(func(g Gomega) {
va := &variantautoscalingv1alpha1.VariantAutoscaling{}
err := crClient.Get(ctx, client.ObjectKey{Namespace: cfg.LLMDNamespace, Name: vaName}, va)
g.Expect(err).NotTo(HaveOccurred())
g.Expect(va.Status.Conditions).NotTo(BeEmpty(), "VA should have status conditions after reconciliation")
targetResolved := variantautoscalingv1alpha1.GetCondition(va, variantautoscalingv1alpha1.TypeTargetResolved)
g.Expect(targetResolved).NotTo(BeNil(), "VA should have TargetResolved condition")
}).Should(Succeed())
Eventually(func(g Gomega) {
va := &variantautoscalingv1alpha1.VariantAutoscaling{}
err := crClient.Get(ctx, client.ObjectKey{Namespace: cfg.LLMDNamespace, Name: vaName}, va)
g.Expect(err).NotTo(HaveOccurred())
g.Expect(va.Status.Conditions).NotTo(BeEmpty(), "VA should have status conditions after reconciliation")
targetResolved := variantautoscalingv1alpha1.GetCondition(va, variantautoscalingv1alpha1.TypeTargetResolved)
g.Expect(targetResolved).NotTo(BeNil(), "VA should have TargetResolved condition")
}, time.Duration(cfg.EventuallyShortSec)*time.Second, time.Duration(cfg.PollIntervalQuickSec)*time.Second).Should(Succeed())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/scale_from_zero_test.go` around lines 159 - 167, The Eventually
block waiting for VA reconciliation lacks explicit timeout and polling interval;
update the Eventually call that wraps the anonymous func (which fetches
variantautoscalingv1alpha1.VariantAutoscaling via crClient.Get using
client.ObjectKey{Namespace: cfg.LLMDNamespace, Name: vaName} and asserts
GetCondition(..., TypeTargetResolved)) to include explicit timeout and interval
arguments (e.g., Eventually(func(g Gomega) { ... }, 30*time.Second,
1*time.Second) or use WithTimeout/WithPolling) so the test won't hang
indefinitely and will poll at a defined cadence.

Comment thread test/e2e/suite_test.go
Comment on lines +50 to +87
// restartPrometheusAdapterPods forces a rollout by deleting adapter pods, then waits for readiness.
func restartPrometheusAdapterPods() {
By("Restarting prometheus-adapter pods")
podList, err := k8sClient.CoreV1().Pods(cfg.MonitoringNS).List(ctx, metav1.ListOptions{
LabelSelector: "app.kubernetes.io/name=prometheus-adapter",
})
Expect(err).NotTo(HaveOccurred(), "Failed to list prometheus-adapter pods")
for _, pod := range podList.Items {
deleteErr := k8sClient.CoreV1().Pods(cfg.MonitoringNS).Delete(ctx, pod.Name, metav1.DeleteOptions{})
if deleteErr != nil && !errors.IsNotFound(deleteErr) {
GinkgoWriter.Printf("Warning: Failed to delete prometheus-adapter pod %s: %v\n", pod.Name, deleteErr)
} else {
GinkgoWriter.Printf("Deleted prometheus-adapter pod: %s\n", pod.Name)
}
}
By("Waiting for prometheus-adapter pods to be ready")
tm := time.Duration(cfg.EventuallyLongSec) * time.Second
poll := time.Duration(cfg.PollIntervalSec) * time.Second
Eventually(func(g Gomega) {
pods, err := k8sClient.CoreV1().Pods(cfg.MonitoringNS).List(ctx, metav1.ListOptions{
LabelSelector: "app.kubernetes.io/name=prometheus-adapter",
})
g.Expect(err).NotTo(HaveOccurred())
g.Expect(pods.Items).NotTo(BeEmpty(), "At least one prometheus-adapter pod should exist")
ready := 0
for _, pod := range pods.Items {
if pod.Status.Phase == corev1.PodRunning {
for _, condition := range pod.Status.Conditions {
if condition.Type == corev1.PodReady && condition.Status == corev1.ConditionTrue {
ready++
break
}
}
}
}
g.Expect(ready).To(BeNumerically(">", 0), "At least one prometheus-adapter pod should be ready")
}, tm, poll).Should(Succeed())
GinkgoWriter.Println("prometheus-adapter pods restarted and ready")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

The adapter recovery path can succeed while external.metrics is still broken.

Line 57-63 only warn on pod delete failures, and Line 65-86 stop waiting as soon as any adapter pod is Ready. In auto mode that means a failed delete or a rollout that has not yet re-registered external.metrics.k8s.io/v1beta1 still looks healthy, so later HPA/external-metrics checks fail far from the root cause. Make non-NotFound delete errors fatal and gate success on discovery succeeding after the restart.

Suggested fix
 func restartPrometheusAdapterPods() {
 	By("Restarting prometheus-adapter pods")
 	podList, err := k8sClient.CoreV1().Pods(cfg.MonitoringNS).List(ctx, metav1.ListOptions{
 		LabelSelector: "app.kubernetes.io/name=prometheus-adapter",
 	})
 	Expect(err).NotTo(HaveOccurred(), "Failed to list prometheus-adapter pods")
 	for _, pod := range podList.Items {
 		deleteErr := k8sClient.CoreV1().Pods(cfg.MonitoringNS).Delete(ctx, pod.Name, metav1.DeleteOptions{})
-		if deleteErr != nil && !errors.IsNotFound(deleteErr) {
-			GinkgoWriter.Printf("Warning: Failed to delete prometheus-adapter pod %s: %v\n", pod.Name, deleteErr)
-		} else {
-			GinkgoWriter.Printf("Deleted prometheus-adapter pod: %s\n", pod.Name)
-		}
+		Expect(deleteErr == nil || errors.IsNotFound(deleteErr)).To(BeTrue(),
+			"failed to delete prometheus-adapter pod %s: %v", pod.Name, deleteErr)
+		GinkgoWriter.Printf("Deleted prometheus-adapter pod: %s\n", pod.Name)
 	}
 	By("Waiting for prometheus-adapter pods to be ready")
 	tm := time.Duration(cfg.EventuallyLongSec) * time.Second
 	poll := time.Duration(cfg.PollIntervalSec) * time.Second
 	Eventually(func(g Gomega) {
 		pods, err := k8sClient.CoreV1().Pods(cfg.MonitoringNS).List(ctx, metav1.ListOptions{
 			LabelSelector: "app.kubernetes.io/name=prometheus-adapter",
 		})
 		g.Expect(err).NotTo(HaveOccurred())
 		g.Expect(pods.Items).NotTo(BeEmpty(), "At least one prometheus-adapter pod should exist")
 		ready := 0
 		for _, pod := range pods.Items {
 			if pod.Status.Phase == corev1.PodRunning {
 				for _, condition := range pod.Status.Conditions {
 					if condition.Type == corev1.PodReady && condition.Status == corev1.ConditionTrue {
 						ready++
 						break
 					}
 				}
 			}
 		}
 		g.Expect(ready).To(BeNumerically(">", 0), "At least one prometheus-adapter pod should be ready")
+		_, discErr := k8sClient.Discovery().ServerResourcesForGroupVersion("external.metrics.k8s.io/v1beta1")
+		g.Expect(discErr).NotTo(HaveOccurred(), "external.metrics API should be registered after adapter restart")
 	}, tm, poll).Should(Succeed())
 	GinkgoWriter.Println("prometheus-adapter pods restarted and ready")
 }

Also applies to: 182-223

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/suite_test.go` around lines 50 - 87, The
restartPrometheusAdapterPods helper currently only warns on non-NotFound delete
errors and treats success as any single pod being Ready; change it so
non-NotFound delete errors fail the test (make the deleteErr check for each pod
use Expect/delete failure -> NotTo(HaveOccurred()) or equivalent fatal) and
after the pod readiness loop add a discovery check that the API group/version
external.metrics.k8s.io/v1beta1 is available (use the Kubernetes discovery
client via
k8sClient.Discovery().ServerResourcesForGroupVersion("external.metrics.k8s.io/v1beta1")
or similar) and gate the Eventually/Succeed on both pod readiness and successful
discovery of that API; update the same logic in the analogous function/section
around lines 182-223.

Comment thread test/utils/logging.go
Comment on lines +43 to +46
raw, readErr := func() ([]byte, error) {
defer func() { _ = stream.Close() }()
return io.ReadAll(stream)
}()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, locate and examine the file at test/utils/logging.go
cat -n test/utils/logging.go | head -60

Repository: opendatahub-io/workload-variant-autoscaler

Length of output: 2042


🏁 Script executed:

# Check if this is test code or production code and understand the function context
cd test/utils && pwd && ls -la logging.go

Repository: opendatahub-io/workload-variant-autoscaler

Length of output: 175


🏁 Script executed:

# Look at the call site mentioned in the scratchpad
grep -n "test/e2e/saturation_analyzer_path_test.go" -r . 2>/dev/null || head -150 test/e2e/saturation_analyzer_path_test.go | tail -50

Repository: opendatahub-io/workload-variant-autoscaler

Length of output: 68


Bound pod log reads to prevent memory-exhaustion DoS (CWE-400).

Severity: Major. io.ReadAll(stream) reads unbounded data from Kubernetes pod log streams; a large or noisy pod in e2e/saturation tests can force OOM in CI or local runs, especially in a loop over multiple pods.

Suggested fix
+const maxPodLogBytes int64 = 1 << 20 // 1 MiB per pod
+
 	raw, readErr := func() ([]byte, error) {
 		defer func() { _ = stream.Close() }()
-		return io.ReadAll(stream)
+		limited := io.LimitReader(stream, maxPodLogBytes+1)
+		buf, err := io.ReadAll(limited)
+		if err != nil {
+			return nil, err
+		}
+		if int64(len(buf)) > maxPodLogBytes {
+			buf = buf[:maxPodLogBytes]
+		}
+		return buf, nil
 	}()

As per coding guidelines, **/*.go: "Use io.LimitReader for HTTP response bodies (prevent memory exhaustion)".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
raw, readErr := func() ([]byte, error) {
defer func() { _ = stream.Close() }()
return io.ReadAll(stream)
}()
const maxPodLogBytes int64 = 1 << 20 // 1 MiB per pod
raw, readErr := func() ([]byte, error) {
defer func() { _ = stream.Close() }()
limited := io.LimitReader(stream, maxPodLogBytes+1)
buf, err := io.ReadAll(limited)
if err != nil {
return nil, err
}
if int64(len(buf)) > maxPodLogBytes {
buf = buf[:maxPodLogBytes]
}
return buf, nil
}()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/utils/logging.go` around lines 43 - 46, The current unbounded
io.ReadAll(stream) can OOM; change to read with a limit by wrapping the stream
with io.LimitReader and a defined max (e.g. maxPodLogBytes) before calling
io.ReadAll, keep the existing defer to stream.Close(), and surface a clear error
when content is truncated; locate the anonymous reader invocation that assigns
raw, readErr (the func() { defer stream.Close(); return io.ReadAll(stream) }())
and replace it with io.ReadAll(io.LimitReader(stream, maxPodLogBytes)) and add a
constant (maxPodLogBytes) near the top of the file or re-use an existing
constant so tests cannot exhaust memory.

vishakha-ramani and others added 4 commits April 2, 2026 15:53
* Added readme for queueing model analyzer

* update readme

* update readme

* restructure readme

* moved slo driven queueing model analyzer readme under /docs

* fix: use ScaleTargetAccessor in processInactiveVariant test

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: tantawi <tantawi@us.ibm.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* phase 1 clean up

Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com>

* phase 2 decouple

Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com>

* skip modelservice install for e2es

Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com>

* apply review changes from copilot

Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com>

---------

Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com>
Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com>
# Conflicts:
#	deploy/inference-objective-e2e.yaml
#	docs/developer-guide/troubleshooting.md
@zdtsw zdtsw changed the title [sync] upstream llm-d main branch 89ef12b [2026-04-02] [sync] upstream llm-d main branch 0ee5ba1 [2026-04-06] Apr 6, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (7)
internal/controller/configmap_helpers.go (1)

58-75: ⚠️ Potential issue | 🟠 Major

Add ApplyDefaults() call before validation in parseQMAnalyzerConfig.

QueueingModelScalingConfig has omitempty fields (SLOMultiplier, TuningEnabled, TargetTTFT, TargetITL) with semantic defaults documented in code comments (SLOMultiplier=0 means "use default 3.0"; TuningEnabled pointer indicates nil=default true). However, unlike SaturationScalingConfig, it lacks an ApplyDefaults() method and parseQMAnalyzerConfig doesn't apply defaults before validation. This means zero-valued fields persist in stored configs instead of being set to documented defaults, causing incorrect runtime behavior. Either implement ApplyDefaults() on QueueingModelScalingConfig or apply defaults inline in parseQMAnalyzerConfig after unmarshal and before validate, consistent with the pattern in parseSaturationConfig.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/configmap_helpers.go` around lines 58 - 75, The
parseQMAnalyzerConfig function currently validates unmarshaled
QueueingModelScalingConfig structs without applying semantic defaults, so
zero-valued omitempty fields persist; update parseQMAnalyzerConfig to call
ApplyDefaults() on the qmConfig instance after yaml.Unmarshal and before
qmConfig.Validate(), mirroring the pattern used in parseSaturationConfig, or
alternatively implement an ApplyDefaults() method on the
QueueingModelScalingConfig type and invoke it in parseQMAnalyzerConfig so
SLOMultiplier, TuningEnabled, TargetTTFT, TargetITL get their documented
defaults prior to validation and storage.
charts/workload-variant-autoscaler/values.yaml (1)

40-47: ⚠️ Potential issue | 🟠 Major

Default insecureSkipVerify: true enables MITM attacks (CWE-295).

The default value disables TLS certificate verification for Prometheus connections. While marked as "Development," shipping insecure defaults risks accidental production deployment.

Consider defaulting to false and requiring explicit opt-in for development:

     tls:
-      insecureSkipVerify: true   # Development: true, Production: false
+      insecureSkipVerify: false  # Set to true only for local development

As per coding guidelines: "No InsecureSkipVerify in TLS configs (enables MITM attacks)".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@charts/workload-variant-autoscaler/values.yaml` around lines 40 - 47, The TLS
config currently defaults tls.insecureSkipVerify: true which disables
certificate verification; change the default to tls.insecureSkipVerify: false
and require explicit opt-in for development (e.g., document or provide a
separate dev-values override) and ensure the existing caCert/caCertPath fields
(caCert, caCertPath) are available and documented so users can supply a CA
rather than disabling verification; update the values description to reflect
secure-by-default behavior and how to enable relaxed verification only for
local/dev environments.
deploy/kubernetes/install.sh (1)

1-10: ⚠️ Potential issue | 🟠 Major

Add set -u to enforce strict variable checking.

This sourced script uses $MONITORING_NAMESPACE (line 17) without guaranteeing it exists. With only set -e and pipefail, a missing variable silently expands to empty, resulting in malformed URLs like https://kube-prometheus-stack-prometheus..svc.cluster.local:9090. Replace set -e and set -o pipefail with set -euo pipefail on lines 9–10, or document that callers must enable set -u before sourcing this file. This aligns with the required shell script security guideline (CWE-78).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/kubernetes/install.sh` around lines 1 - 10, The script currently sets
strict error and pipefail handling but not unset-variable checking, which can
let undefined variables like MONITORING_NAMESPACE expand to empty; update the
shell options in this sourced script (install.sh) to enable unset-variable
checking by replacing the current "set -e" and "set -o pipefail" with "set -euo
pipefail" (or, if you prefer not to force it in a sourced file, add a clear
comment at the top documenting that callers must enable "set -u" before sourcing
and validate MONITORING_NAMESPACE is defined); ensure any references to
MONITORING_NAMESPACE are validated or defaulted if you adopt the first approach.
deploy/kubernetes/README.md (1)

107-113: ⚠️ Potential issue | 🟡 Minor

Orphaned line appears to be a merge artifact.

Line 110 (export ACCELERATOR_TYPE="A100") is outside the code block (which ends at line 107) and duplicates the ACCELERATOR_TYPE export already shown at line 101. This appears to be leftover from editing—remove it or move it inside a code block.

🔧 Remove orphaned line
 # Performance tuning (optional)
 export VLLM_MAX_NUM_SEQS=64                 # vLLM max concurrent sequences (batch size)

For a complete list of all configuration options, see the Configuration Reference in the main deployment guide.
-export ACCELERATOR_TYPE="A100" # GPU type (auto-detected)
-export GATEWAY_PROVIDER="istio" # Gateway: istio or kgateway
-```

Deployment flags - Control which components to deploy:

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @deploy/kubernetes/README.md around lines 107 - 113, Remove the orphaned
duplicate export ACCELERATOR_TYPE="A100" that sits outside the fenced code block
(it duplicates the export at line ~101); either delete that stray line or re-add
it inside the surrounding triple-backtick code fence so both exports
(ACCELERATOR_TYPE and GATEWAY_PROVIDER) are contained within the same code block
and the closing ``` is correctly positioned; verify the only remaining
ACCELERATOR_TYPE export is the intended one and that GATEWAY_PROVIDER remains
inside the code fence.


</details>

</blockquote></details>
<details>
<summary>config/rbac/role.yaml (1)</summary><blockquote>

`29-39`: _⚠️ Potential issue_ | _🟠 Major_

**Remove unnecessary `list` and `watch` verbs from secrets permission; scope to controller namespace.**

The code at `internal/collector/source/pod/pod_scraping_source.go:323` only reads specific secrets via `client.Get()` within a configurable namespace (defaulting to ServiceNamespace). The RBAC directive includes `list` and `watch` verbs that are not used and expose all cluster secrets. Replace the ClusterRole with a namespaced Role limited to the controller's namespace, or document specific secret names if cross-namespace access is required. This reduces CWE-639 (authorization bypass through user-controlled key) risk.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@config/rbac/role.yaml` around lines 29 - 39, The RBAC currently allows
list/watch on secrets and is cluster-scoped; change it to a namespaced Role
scoped to the controller namespace and remove unnecessary verbs for secrets: in
config/rbac/role.yaml replace the ClusterRole (or cluster-scoped secret rules)
with a Role that targets only the controller's namespace (ServiceNamespace) and
for the secrets resource include only "get" (no "list" or "watch"); reference
the code path internal/collector/source/pod/pod_scraping_source.go (line ~323)
which uses client.Get() for specific secrets to justify the tighter, namespaced
permission set.
```

</details>

</blockquote></details>
<details>
<summary>internal/utils/utils.go (1)</summary><blockquote>

`370-401`: _⚠️ Potential issue_ | _🟠 Major_

**Early return bypasses VA label fallback.**

When `scaleTarget != nil` but `GetLeaderPodTemplateSpec()` returns nil (line 374-376), the function returns `""` immediately without checking the `va.Labels[AcceleratorNameLabel]` fallback. This could cause accelerator name resolution to fail unnecessarily.

<details>
<summary>🔧 Move the nil check to allow fallback</summary>

```diff
 func GetAcceleratorNameFromScaleTarget(va *llmdVariantAutoscalingV1alpha1.VariantAutoscaling, scaleTarget scaletarget.ScaleTargetAccessor) string {
 	// Check scaleTarget for accelerator name if it's not nil
 	if scaleTarget != nil {
 		podTemplateSpec := scaleTarget.GetLeaderPodTemplateSpec()
-		if podTemplateSpec == nil {
-			return ""
-		}
-		// Check nodeSelector first
-		if podTemplateSpec.Spec.NodeSelector != nil {
-			for _, key := range constants.GpuProductKeys {
-				if val, ok := podTemplateSpec.Spec.NodeSelector[key]; ok {
-					return val
+		if podTemplateSpec != nil {
+			// Check nodeSelector first
+			if podTemplateSpec.Spec.NodeSelector != nil {
+				for _, key := range constants.GpuProductKeys {
+					if val, ok := podTemplateSpec.Spec.NodeSelector[key]; ok {
+						return val
+					}
 				}
 			}
-		}
 
-		// Check nodeAffinity
-		if podTemplateSpec.Spec.Affinity != nil && podTemplateSpec.Spec.Affinity.NodeAffinity != nil {
-			if val := extractGPUFromNodeAffinity(podTemplateSpec.Spec.Affinity.NodeAffinity, constants.GpuProductKeys); val != "" {
-				return val
+			// Check nodeAffinity
+			if podTemplateSpec.Spec.Affinity != nil && podTemplateSpec.Spec.Affinity.NodeAffinity != nil {
+				if val := extractGPUFromNodeAffinity(podTemplateSpec.Spec.Affinity.NodeAffinity, constants.GpuProductKeys); val != "" {
+					return val
+				}
 			}
 		}
 	}
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@internal/utils/utils.go` around lines 370 - 401, In
GetAcceleratorNameFromScaleTarget, don't return early when
scaleTarget.GetLeaderPodTemplateSpec() is nil; instead skip
nodeSelector/nodeAffinity checks and allow falling back to
va.Labels[AcceleratorNameLabel]. Remove the immediate return in the
podTemplateSpec nil branch so the function proceeds to check va
(VariantAutoscaling) labels; keep existing logic for checking
podTemplateSpec.Spec.NodeSelector, podTemplateSpec.Spec.Affinity (and
extractGPUFromNodeAffinity with constants.GpuProductKeys) when podTemplateSpec
is non-nil, but always evaluate the AcceleratorNameLabel fallback before
returning.
```

</details>

</blockquote></details>
<details>
<summary>internal/datastore/datastore.go (1)</summary><blockquote>

`209-211`: _⚠️ Potential issue_ | _🟠 Major_

**`Clear()` must unregister metrics sources too.**

Lines 209-211 only clear `ds.pools`. After this change, `ds.registry` still keeps the namespaced `MetricsSource`s alive, so a datastore reset can keep scraping deleted pools and leak stale state into the next reconciliation.


<details>
<summary>Suggested fix</summary>

```diff
 func (ds *datastore) Clear() {
+	ds.pools.Range(func(key, _ any) bool {
+		if namespacedName, ok := key.(string); ok {
+			if err := ds.registry.Unregister(namespacedName); err != nil {
+				ctrl.Log.V(logging.DEBUG).Info("Failed to unregister metrics source during clear", "namespacedName", namespacedName, "error", err)
+			}
+		}
+		return true
+	})
 	ds.pools.Clear()
 }
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@internal/datastore/datastore.go` around lines 209 - 211, The Clear method
currently only clears ds.pools but must also unregister any MetricsSource
objects from ds.registry to avoid stale metrics; update datastore.Clear to
iterate over whatever collection holds the registered metrics sources (e.g., a
map or slice you use when creating namespaced MetricsSource instances), call
ds.registry.Unregister(...) for each MetricsSource, and then clear that
collection in addition to ds.pools so both the registry and pool state are fully
reset (refer to datastore.Clear, ds.registry, ds.pools, and your MetricsSource
registration points).
```

</details>

</blockquote></details>

</blockquote></details>
♻️ Duplicate comments (2)
test/e2e/scale_from_zero_test.go (2)

817-825: ⚠️ Potential issue | 🟡 Minor

Missing timeout on VA reconciliation Eventually in LWS test.

 		Eventually(func(g Gomega) {
 			va := &variantautoscalingv1alpha1.VariantAutoscaling{}
 			err := crClient.Get(ctx, client.ObjectKey{Namespace: cfg.LLMDNamespace, Name: vaName}, va)
 			g.Expect(err).NotTo(HaveOccurred())
 			g.Expect(va.Status.Conditions).NotTo(BeEmpty(), "VA should have status conditions after reconciliation")

 			targetResolved := variantautoscalingv1alpha1.GetCondition(va, variantautoscalingv1alpha1.TypeTargetResolved)
 			g.Expect(targetResolved).NotTo(BeNil(), "VA should have TargetResolved condition")
-		}).Should(Succeed())
+		}, time.Duration(cfg.EventuallyShortSec)*time.Second, time.Duration(cfg.PollIntervalQuickSec)*time.Second).Should(Succeed())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/scale_from_zero_test.go` around lines 817 - 825, The Eventually
block checking VA reconciliation lacks an explicit timeout; update the
Eventually call in scale_from_zero_test.go (the block that reads va :=
&variantautoscalingv1alpha1.VariantAutoscaling{}, uses crClient.Get(ctx,
client.ObjectKey{Namespace: cfg.LLMDNamespace, Name: vaName}, va) and
variantautoscalingv1alpha1.GetCondition) to pass a reasonable timeout and
optional polling interval (e.g., 2m timeout, 5s interval) so the test fails fast
instead of hanging; ensure the timeout values are applied to the Eventually
invocation that asserts va.Status.Conditions and TargetResolved.

1062-1065: ⚠️ Potential issue | 🟡 Minor

Missing timeout on EPP service Eventually in single-node LWS test.

 		Eventually(func(g Gomega) {
 			_, err := k8sClient.CoreV1().Services(cfg.LLMDNamespace).Get(ctx, eppServiceName, metav1.GetOptions{})
 			g.Expect(err).NotTo(HaveOccurred(), "EPP service should exist")
-		}).Should(Succeed(), "EPP service should exist")
+		}, time.Duration(cfg.EventuallyShortSec)*time.Second, time.Duration(cfg.PollIntervalQuickSec)*time.Second).Should(Succeed(), "EPP service should exist")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/scale_from_zero_test.go` around lines 1062 - 1065, The Eventually
call that checks for the EPP service (wrapping
k8sClient.CoreV1().Services(...).Get with cfg.LLMDNamespace and eppServiceName)
lacks an explicit timeout, so add a timeout (and optional interval) to the
Eventually invocation (e.g. Eventually(func(g Gomega) { ... }, 30*time.Second,
1*time.Second)). Also ensure the time package is imported if not already; update
the Eventually call surrounding the Get to include these timeout/interval
parameters.
🟠 Major comments (22)
hack/benchmark/requirements.txt-1-3 (1)

1-3: ⚠️ Potential issue | 🟠 Major

Pin pillow≥10.2.0 to control security-critical transitive dependency.

matplotlib==3.10.8 declares pillow>=8, allowing pillow 9.5.0 (containing CVE-2023-4863, GHSA-3f63-hfp8-52jq, and GHSA-44wm-f244-xhp3). While default pip resolution fetches a patched version, explicit pinning prevents supply chain surprises. Add pillow>=10.2.0 to requirements.txt to cover CVE-2023-4863 (libwebp heap overflow) and CVE-2023-50447 (PIL.ImageMath.eval arbitrary code execution).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hack/benchmark/requirements.txt` around lines 1 - 3, Add an explicit pillow
pin to the requirements list by adding the entry "pillow>=10.2.0" alongside the
existing matplotlib, numpy, and PyYAML entries; update the same requirements.txt
so that the transitive pillow version is constrained to 10.2.0 or later to avoid
older vulnerable releases referenced by matplotlib (ensure the new line follows
the same format as existing entries).
charts/workload-variant-autoscaler/templates/hpa.yaml-11-22 (1)

11-22: ⚠️ Potential issue | 🟠 Major

Add fail-fast validation for scaleTargetKind values.

The current if/else pattern silently falls back to Deployment for any unrecognized value, including typos like "leaderworkerset" (lowercase). This produces a valid manifest targeting the wrong resource kind, causing autoscaling to silently misconfigure at runtime.

🔧 Suggested change
   scaleTargetRef:
-    {{- if eq .Values.llmd.scaleTargetKind "LeaderWorkerSet" }}
+    {{- $kind := default "Deployment" .Values.llmd.scaleTargetKind }}
+    {{- if eq $kind "LeaderWorkerSet" }}
     apiVersion: leaderworkerset.x-k8s.io/v1
     kind: LeaderWorkerSet
-    {{- else }}
+    {{- else if eq $kind "Deployment" }}
     apiVersion: apps/v1
     kind: Deployment
+    {{- else }}
+    {{- fail (printf "llmd.scaleTargetKind must be Deployment or LeaderWorkerSet, got %q" $kind) }}
     {{- end }}

Apply the same guard in templates/variantautoscaling.yaml (lines 20–26).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@charts/workload-variant-autoscaler/templates/hpa.yaml` around lines 11 - 22,
The template silently falls back to Deployment for any
.Values.llmd.scaleTargetKind, allowing typos to generate wrong manifests; add
explicit validation that .Values.llmd.scaleTargetKind is either
"LeaderWorkerSet" or "Deployment" (or unset/default) and call Helm's fail
function with a clear message if it's any other value; update the same guard in
the variant autoscaling template handling .Values.llmd.scaleTargetKind (the
block currently switching between LeaderWorkerSet and Deployment) so mis-typed
values fail-fast instead of silently producing a Deployment.
.github/workflows/ci-e2e-openshift.yaml-659-666 (1)

659-666: ⚠️ Potential issue | 🟠 Major

Tighten resource selection to match exact Helm release, not substring (CWE-269: Improper Access Control).

The jq filter contains("workload-variant-autoscaler") matches any resource whose name contains that substring—e.g., team-workload-variant-autoscaler-prod. On a shared cluster, re-annotating another release's ClusterRole/ClusterRoleBinding transfers Helm ownership and can break that release's controller RBAC. Select by the meta.helm.sh/release-name annotation instead.

🔧 Suggested change
           for kind in clusterrole clusterrolebinding; do
             kubectl get "$kind" -o json 2>/dev/null | \
-              jq -r '.items[] | select(.metadata.name | contains("workload-variant-autoscaler")) | select(.metadata.name | startswith("wva-e2e-") | not) | select(.metadata.annotations["meta.helm.sh/release-namespace"] != null) | .metadata.name' 2>/dev/null | \
+              jq -r '.items[]
+                | select(.metadata.annotations["meta.helm.sh/release-name"] == "workload-variant-autoscaler")
+                | select((.metadata.annotations["meta.helm.sh/release-namespace"] // "") != "")
+                | .metadata.name' 2>/dev/null | \
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/ci-e2e-openshift.yaml around lines 659 - 666, The jq
selection currently matches resources by name substring
(contains("workload-variant-autoscaler")), which can accidentally pick other
releases; instead filter by the Helm release annotation
meta.helm.sh/release-name equal to "workload-variant-autoscaler" while keeping
the existing exclusion for names starting with "wva-e2e-". Update the kubectl |
jq pipeline used in the loop (the part that builds the jq -r '.items[] |
select(...) | .metadata.name') to replace the contains(...) check with a check
on .metadata.annotations["meta.helm.sh/release-name"] ==
"workload-variant-autoscaler" so only the exact Helm release is selected for
clusterrole/clusterrolebinding adoption.
internal/utils/allocation.go-28-33 (1)

28-33: ⚠️ Potential issue | 🟠 Major

Nil pointer dereference if scaleTarget is nil.

scaleTarget.GetReplicas() is called without nil validation. If a nil accessor is passed, this panics. The test suite at capacity_store_test.go:93-96 shows nil is a valid input.

Proposed fix
 func BuildAllocationFromMetrics(
 	metrics interfaces.OptimizerMetrics,
 	va *v1alpha1.VariantAutoscaling,
 	scaleTarget scaletarget.ScaleTargetAccessor,
 	acceleratorCost float64,
 ) (interfaces.Allocation, error) {
+	if scaleTarget == nil {
+		return interfaces.Allocation{}, fmt.Errorf("scaleTarget cannot be nil for VA: %s", va.Name)
+	}
 	// Extract K8s information
 	// Number of replicas
 	var numReplicas int
 	if scaleTarget.GetReplicas() != nil {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/utils/allocation.go` around lines 28 - 33, The code dereferences
scaleTarget by calling scaleTarget.GetReplicas() without checking for nil, which
can panic; update the logic around numReplicas to first check if scaleTarget is
nil (or scaleTarget.GetReplicas() is nil) and use constants.SpecReplicasFallback
when nil; modify the block that assigns numReplicas (currently referencing
scaleTarget.GetReplicas(), numReplicas, and constants.SpecReplicasFallback) to
safely handle a nil scaleTarget by falling back to
constants.SpecReplicasFallback and only converting the pointer to int when
non-nil.
cmd/main.go-91-106 (1)

91-106: ⚠️ Potential issue | 🟠 Major

Move LWS CRD check after timeout configuration and probe group-version directly.

checkLeaderWorkerSetCRD runs at line 113 before restConfig.Timeout is set at line 223. The call to ServerGroupsAndResources() walks every APIService without a timeout, so a slow or broken aggregated API can hang startup or return partial results that omit the LeaderWorkerSet group, causing false negative (manager operates in deployment-only mode despite LWS being available).

Reconfigure the flow: set restConfig.Timeout before calling the CRD check, then probe using ServerResourcesForGroupVersion(constants.LeaderWorkerSetAPIVersion) to target only the LWS API group. Treat only explicit not-found errors as "LWS disabled"; propagate or log other failures.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/main.go` around lines 91 - 106, Move the restConfig.Timeout setup to
occur before calling checkLeaderWorkerSetCRD, and update checkLeaderWorkerSetCRD
to probe the specific LWS API group using
discoveryClient.ServerResourcesForGroupVersion(constants.LeaderWorkerSetAPIVersion)
instead of discoveryClient.ServerGroupsAndResources(); treat a concrete "not
found" response as LWS disabled but surface or log other errors (and avoid
swallowing partial-result errors), and ensure discovery client calls respect
restConfig.Timeout so slow aggregated APIs cannot hang startup.
deploy/lib/cli.sh-84-103 (1)

84-103: ⚠️ Potential issue | 🟠 Major

Guard value-taking options against missing or flag-like arguments.

Options --wva-image, --model, --accelerator, --release-name, and --environment consume $2 without validation. A call like ./install.sh --release-name --undeploy silently swallows --undeploy as the release name, leaving UNDEPLOY unset—deployment proceeds normally instead of undeploying. Add guard checks: [[ $# -ge 2 && $2 != -* ]] before each shift 2 and exit with error if absent or if $2 starts with -.

Also, the script lacks set -euo pipefail at start—add it per shell script security standards.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/lib/cli.sh` around lines 84 - 103, Add "set -euo pipefail" near the
top of the script and harden all value-taking option branches (-i|--wva-image,
-m|--model, -a|--accelerator, -r|--release-name, -e|--environment) by checking
that a non-flag argument exists before consuming $2 (e.g., verify [[ $# -ge 2 &&
$2 != -* ]] ); if the check fails call log_error with a clear message and exit 1
instead of shifting, then proceed to parse WVA_IMAGE_REPO/WVA_IMAGE_TAG,
MODEL_ID, ACCELERATOR_TYPE, WVA_RELEASE_NAME, or ENVIRONMENT and use shift 2 as
currently implemented; keep using the existing containsElement and log_error
helpers for validation and error reporting.
internal/resources/resource.go-49-60 (1)

49-60: ⚠️ Potential issue | 🟠 Major

Classify permanent API errors and prevent timeout masking.

All non-NotFound errors are currently retried, including permanent failures (RBAC, validation, etc.), which trigger unnecessary backoff loops. When retries exhaust, wait.ErrWaitTimeout masks the actual API error. Additionally, .V(logging.DEBUG).Error() logs each retry at ERROR severity—the .V() verbosity level does not downshift the log level.

Classify errors explicitly: return permanent errors (Forbidden, Invalid) immediately without retry; retry only transient errors (Conflict, timeout, etc.); log retries at Info level; wrap and return the last transient error when backoff expires instead of the generic timeout.

See UpdateStatusWithBackoff in internal/utils/utils.go for the correct pattern.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/resources/resource.go` around lines 49 - 60, The current backoff
loop in the wait.ExponentialBackoffWithContext wrapper retries all non-NotFound
errors and logs each retry at ERROR; change it to classify permanent API errors
(use apierrors.IsForbidden(err), apierrors.IsInvalid(err),
apierrors.IsUnauthorized(err) and any other non-retryable/api validation errors)
and return them immediately (no retry), while only treating transient errors
(Conflict, Timeout, etc.) as retryable; change the retry log call from
ctrl.LoggerFrom(ctx).V(logging.DEBUG).Error(...) to an Info-level log (e.g.,
ctrl.LoggerFrom(ctx).V(logging.DEBUG).Info(...)) so retries aren’t logged as
errors; track the last transient error and when the backoff expires, wrap and
return that last error instead of allowing wait.ErrWaitTimeout to mask it
(follow the pattern from UpdateStatusWithBackoff in internal/utils/utils.go).
deploy/lib/cli.sh-73-90 (1)

73-90: ⚠️ Potential issue | 🟠 Major

Image splitting breaks semantic separation of repository and tag.

IFS=':' read -r WVA_IMAGE_REPO WVA_IMAGE_TAG splits on the first :, causing:

  • quay.example:8443/ns/wva:v1 → repo=quay.example, tag=8443/ns/wva:v1
  • registry.io:5000/namespace/image:tag → repo=registry.io, tag=5000/namespace/image:tag

The variables are then passed separately to Helm (lines 75–76), where Helm expects repo to be the full image path and tag to be the version only. While Helm's template concatenates them back into a valid image reference, the semantic split is inverted. Fix by splitting on the last : instead, or accept the full image reference as a single value without splitting.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/lib/cli.sh` around lines 73 - 90, The current split using IFS=':' read
... on IMG and the -i/--wva-image arg splits on the first colon and misassigns
WVA_IMAGE_REPO and WVA_IMAGE_TAG; replace that logic to split on the last ':'
instead (so WVA_IMAGE_REPO becomes the full image path up to the final colon and
WVA_IMAGE_TAG becomes only the portion after the final colon) using Bash
parameter expansion when handling both IMG and the "$2" override; update the
branches where WVA_IMAGE_REPO and WVA_IMAGE_TAG are set (the block that
currently uses IFS and the similar block inside the -i|--wva-image case)
accordingly, leaving the error/warning handling for missing ':' in place.
deploy/lib/install_core.sh-73-75 (1)

73-75: ⚠️ Potential issue | 🟠 Major

Same issue: missing exit 1 after error log for missing environment script.

Proposed fix
     else
         log_error "Environment script not found: $SCRIPT_DIR/$ENVIRONMENT/install.sh"
+        exit 1
     fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/lib/install_core.sh` around lines 73 - 75, The else branch that logs
missing environment script uses log_error "Environment script not found:
$SCRIPT_DIR/$ENVIRONMENT/install.sh" but does not stop execution; update the
branch so that after calling log_error you call exit 1 to terminate on this
fatal error (i.e., ensure the else handling around the existence check for the
environment install script invokes exit 1 after the log_error).
deploy/lib/install_core.sh-33-41 (1)

33-41: ⚠️ Potential issue | 🟠 Major

Missing exit 1 after error log - script continues without required functions.

When the environment-specific script is not found, the script logs an error but continues to call cleanup which may be undefined, causing confusing failures.

Proposed fix
         source "$SCRIPT_DIR/$ENVIRONMENT/install.sh"
     else
         log_error "Environment-specific script not found: $SCRIPT_DIR/$ENVIRONMENT/install.sh"
+        exit 1
     fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/lib/install_core.sh` around lines 33 - 41, The error handling for
missing environment-specific script logs with log_error but then proceeds to
call cleanup and exit 0, which can invoke undefined functions; modify the branch
where [ -f "$SCRIPT_DIR/$ENVIRONMENT/install.sh" ] is false to exit with a
non-zero status immediately after logging the error (use exit 1) so the script
does not attempt to call cleanup or continue; ensure the log_error call remains
and place exit 1 right after it to abort execution when the install.sh is not
found.
deploy/lib/deploy_prometheus_kube_stack.sh-20-36 (1)

20-36: ⚠️ Potential issue | 🟠 Major

CWE-377: Predictable temporary file paths enable symlink attacks.

Using hardcoded /tmp/prometheus-tls.{key,crt} allows local attackers to pre-create symlinks, potentially causing sensitive key material to be written to attacker-controlled locations or overwriting arbitrary files.

Proposed fix using mktemp
     log_info "Creating self-signed TLS certificate for Prometheus"
+    local TMPDIR
+    TMPDIR=$(mktemp -d)
+    local TLS_KEY="$TMPDIR/prometheus-tls.key"
+    local TLS_CRT="$TMPDIR/prometheus-tls.crt"
+    trap 'rm -rf "$TMPDIR"' EXIT
+
     openssl req -x509 -newkey rsa:2048 -nodes \
-        -keyout /tmp/prometheus-tls.key \
-        -out /tmp/prometheus-tls.crt \
+        -keyout "$TLS_KEY" \
+        -out "$TLS_CRT" \
         -days 365 \
         -subj "/CN=prometheus" \
         -addext "subjectAltName=DNS:kube-prometheus-stack-prometheus.${MONITORING_NAMESPACE}.svc.cluster.local,DNS:kube-prometheus-stack-prometheus.${MONITORING_NAMESPACE}.svc,DNS:prometheus,DNS:localhost" \
-        &> /dev/null
+        || { log_error "Failed to generate TLS certificate"; return 1; }

     log_info "Creating Kubernetes secret for Prometheus TLS"
     kubectl create secret tls "$PROMETHEUS_SECRET_NAME" \
-        --cert=/tmp/prometheus-tls.crt \
-        --key=/tmp/prometheus-tls.key \
+        --cert="$TLS_CRT" \
+        --key="$TLS_KEY" \
         -n "$MONITORING_NAMESPACE" \
         --dry-run=client -o yaml | kubectl apply -f - &> /dev/null
-
-    rm -f /tmp/prometheus-tls.key /tmp/prometheus-tls.crt
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/lib/deploy_prometheus_kube_stack.sh` around lines 20 - 36, Replace the
hardcoded temp paths /tmp/prometheus-tls.key and /tmp/prometheus-tls.crt with
securely created temporary files (use mktemp to generate unique filenames and
store them in variables, e.g., PROM_KEY_FILE and PROM_CRT_FILE), set strict
permissions (chmod 600) on those files before writing, update the openssl
command and kubectl tls secret creation to reference these temp file variables
(maintain use of PROMETHEUS_SECRET_NAME and MONITORING_NAMESPACE), and add a
trap/cleanup to securely remove the temp files on exit/failure to prevent
symlink attacks.
deploy/lib/cleanup.sh-80-86 (1)

80-86: ⚠️ Potential issue | 🟠 Major

Unsafe rm -rf without variable validation (SC2115, CWE-22).

If $WVA_PROJECT or $LLM_D_PROJECT is empty/unset, this could expand to rm -rf / or delete unintended paths. Use parameter expansion with :? to fail-fast on empty values.

Proposed fix
     log_info "Deleting llm-d cloned repository..."
-    if [ ! -d "$WVA_PROJECT/$LLM_D_PROJECT" ]; then
+    if [ ! -d "${WVA_PROJECT:?}/${LLM_D_PROJECT:?}" ]; then
         log_warning "llm-d repository directory not found, skipping deletion"
     else
-        rm -rf "$WVA_PROJECT/$LLM_D_PROJECT" 2>/dev/null || \
+        rm -rf "${WVA_PROJECT:?}/${LLM_D_PROJECT:?}" 2>/dev/null || \
             log_warning "Failed to delete llm-d repository directory"
     fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/lib/cleanup.sh` around lines 80 - 86, The rm -rf call can operate on
unintended paths if WVA_PROJECT or LLM_D_PROJECT are empty; update the cleanup
block that uses log_info/log_warning to validate these variables before deletion
by using parameter expansion (e.g., ${WVA_PROJECT:?} and ${LLM_D_PROJECT:?}) or
an explicit non-empty check, construct the target path only after validation
(e.g., target="${WVA_PROJECT}/${LLM_D_PROJECT}"), and then run rm -rf on that
validated target while keeping the existing fallback log_warning on failure;
ensure you reference WVA_PROJECT and LLM_D_PROJECT and preserve the current
log_info/log_warning behavior.
hack/benchmark/dump_epp_fc_metrics/dump_all_metrics.py-83-91 (1)

83-91: ⚠️ Potential issue | 🟠 Major

Duplicate shell=True pattern in query_prometheus_range.

Same CWE-78 concern as query_prometheus. Extract a shared helper to avoid duplication and apply the safer argument-list approach once.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hack/benchmark/dump_epp_fc_metrics/dump_all_metrics.py` around lines 83 - 91,
Both query_prometheus_range and query_prometheus duplicate unsafe subprocess
calls using shell=True to find the prometheus pod; extract a shared helper
(e.g., get_prometheus_pod or run_oc_pod_lookup) and reimplement the pod lookup
with a safe argument-list subprocess.run (shell=False) building an args list
(["oc","get","pods","-n", namespace, "-l", label_selector, "--no-headers"]) and
parse stdout in Python instead of using grep/awk; update both
query_prometheus_range and query_prometheus to call this helper and keep the
same fallback logic that sets pod = "prometheus-user-workload-0" if
user_workload else "prometheus-k8s-0" on errors.
hack/benchmark/dump_epp_fc_metrics/dump_all_metrics.py-17-17 (1)

17-17: ⚠️ Potential issue | 🟠 Major

--insecure-skip-tls-verify=true disables certificate validation (CWE-295).

This enables MITM attacks against the OpenShift API server. Consider making this configurable or removing it entirely, requiring users to have proper CA trust configured.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hack/benchmark/dump_epp_fc_metrics/dump_all_metrics.py` at line 17, The
subprocess.run call that logs into OC currently hardcodes the unsafe flag
"--insecure-skip-tls-verify=true", which disables certificate validation; change
the login invocation in dump_all_metrics.py so the flag is not always passed:
introduce a configurable boolean (e.g., insecure_skip_tls or an env/CLI option)
and only append f"--insecure-skip-tls-verify=true" to the arguments when that
boolean is explicitly true, otherwise omit the flag so that proper CA trust is
required; update the code that builds the subprocess.run args around the
existing subprocess.run([... f"--server={server}", ...], capture_output=True,
check=True) call and ensure documentation or default behavior requires valid
server certificates.
deploy/lib/infra_llmd.sh-1-9 (1)

1-9: ⚠️ Potential issue | 🟠 Major

Missing set -euo pipefail for script robustness.

Per coding guidelines, shell scripts should use set -euo pipefail to fail fast on errors, undefined variables, and pipeline failures.

 #!/usr/bin/env bash
 #
 # Shared llm-d infrastructure deployment helpers for deploy/install.sh.
+set -euo pipefail
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/lib/infra_llmd.sh` around lines 1 - 9, The script lacks safe-shell
flags; add "set -euo pipefail" immediately after the shebang (#!/usr/bin/env
bash) at the top of deploy/lib/infra_llmd.sh so the script fails fast on errors,
treats undefined variables as errors, and catches pipeline failures; ensure this
change does not break callers that expect relaxed behavior around variables
(e.g., code referencing LLMD_NS, WVA_NS, EXAMPLE_DIR, WVA_PROJECT,
GATEWAY_PROVIDER, LLM_D_*), and run a quick CI test to confirm functions like
containsElement(), wait_deployment_available_nonfatal(), and
detect_inference_pool_api_group() still behave as expected.
hack/benchmark/dump_epp_fc_metrics/dump_all_metrics.py-50-58 (1)

50-58: ⚠️ Potential issue | 🟠 Major

shell=True with string interpolation is a command injection vector (CWE-78).

While namespace and label_selector are currently hardcoded, this pattern is fragile. If these values ever derive from user input, it becomes exploitable.

🔧 Use subprocess with argument list instead
-        cmd_str = f"oc get pods -n {namespace} -l {label_selector} --no-headers | grep -v Terminating | awk '{{print $1}}' | head -n 1"
-        pod_res = subprocess.run(cmd_str, shell=True, capture_output=True, text=True, check=True)
-        pod = pod_res.stdout.strip()
+        pod_res = subprocess.run(
+            ["oc", "get", "pods", "-n", namespace, "-l", label_selector, 
+             "--no-headers", "-o", "jsonpath={.items[?(@.status.phase!='Terminating')].metadata.name}"],
+            capture_output=True, text=True, check=True
+        )
+        pods = pod_res.stdout.strip().split()
+        pod = pods[0] if pods else ""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hack/benchmark/dump_epp_fc_metrics/dump_all_metrics.py` around lines 50 - 58,
The code builds a shell string and calls subprocess.run with shell=True
(cmd_str, subprocess.run(..., shell=True)), which is a command-injection risk;
change it to call subprocess.run with an argument list and avoid shell usage:
construct args like ['oc','get','pods','-n', namespace, '-l', label_selector,
'-o','jsonpath={.items[0].metadata.name}'] (or use explicit
['oc','get','pods','-n', namespace, '-l', label_selector, '--no-headers'] and
parse stdout safely) and call subprocess.run(args, capture_output=True,
text=True, check=True) (no shell=True); also validate/sanitize namespace and
label_selector before using them and keep the existing fallback logic around
pod, pod_res and the exception handling.
hack/benchmark/extract/get_benchmark_report.py-143-162 (1)

143-162: ⚠️ Potential issue | 🟠 Major

Don’t turn missing latency metrics into a perfect autoscaling score.

Lines 143-162 coerce absent TTFT/ITL values to 0, and Lines 419-480 score those zeros. A run with no successful latency samples can therefore report Final Score = 0.00, which is the best possible result, instead of “unavailable”.

Suggested fix
-            if has_results and len(ttft_p99) > 0 and len(itl_p99) > 0:
+            valid_ttft_p99 = [value for value in ttft_p99 if value > 0]
+            valid_itl_p99 = [value for value in itl_p99 if value > 0]
+            if has_results and valid_ttft_p99 and valid_itl_p99:
                 # Find maximum worst-case P99 latencies across all benchmarks
-                max_ttft_p99 = max(ttft_p99)
-                max_itl_p99 = max(itl_p99)
+                max_ttft_p99 = max(valid_ttft_p99)
+                max_itl_p99 = max(valid_itl_p99)
+            elif has_results:
+                report_text += "\nAutoscaling Run Score unavailable: missing TTFT/ITL P99 metrics.\n"

Also applies to: 419-480

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hack/benchmark/extract/get_benchmark_report.py` around lines 143 - 162, The
code currently coerces missing latency metrics (ttft and itl) to 0 by appending
0 into ttft_mean, ttft_p99, itl_mean, itl_p99 (and similarly for reqlat) which
lets the scoring logic later (the autoscaling scoring block around the 419-480
region) treat absent data as a perfect score; instead, stop coercing missing
values to 0: when metrics.get(...).get('successful', {}) yields no data, append
a sentinel (e.g., None) or skip appending to
ttft_mean/ttft_p99/itl_mean/itl_p99/req_lat_mean, and then update the scoring
logic that reads those arrays to treat missing/None/empty lists as “unavailable”
(returning unavailable or NaN) rather than computing a numeric score—use the
variable names ttft, itl, reqlat, ttft_mean, itl_mean, req_lat_mean and adjust
the scoring block that computes Final Score accordingly.
hack/benchmark/extract/get_benchmark_report.py-200-225 (1)

200-225: ⚠️ Potential issue | 🟠 Major

Make --results-dir required before using it.

Line 222 unconditionally dereferences args.results_dir, but the parser still defaults that flag to None. Omitting -r/--results-dir raises TypeError before the intended missing-file error path runs.

Suggested fix
     parser.add_argument(
         "-r", "--results-dir",
-        default=None,
+        required=True,
         help="Path to a GuideLLM exp-docs folder to parse results.json for succeed/failed requests."
     )
     args = parser.parse_args()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hack/benchmark/extract/get_benchmark_report.py` around lines 200 - 225, The
code unconditionally uses args.results_dir when building dump_file causing a
TypeError if the CLI flag is omitted; update the parser.add_argument that
defines "-r/--results-dir" (the argument for results_dir) to make it required
(add required=True) or add an explicit guard right after args =
parser.parse_args() that checks if args.results_dir is None and prints a clear
error + sys.exit(1) before calling os.path.join; modify either the parser
definition or the check around dump_file creation in get_benchmark_report.py
(references: the "-r", "--results-dir" parser.add_argument and the dump_file =
os.path.join(args.results_dir, "all_metrics_dump.json") line).
hack/benchmark/extract/get_benchmark_report.py-86-97 (1)

86-97: ⚠️ Potential issue | 🟠 Major

Treat incomplete RPS as a failure signal in serving-capacity gating.

Lines 90-97 only gate on errored.mean. A run with 0 errored RPS and non-zero incomplete RPS still counts as healthy here, so the reported “true serving capacity” can jump past the point where requests are already being dropped.

Suggested fix
             achieved_rps = metrics.get('requests_per_second', {}).get('successful', {}).get('mean', 0)
             err_rps = metrics.get('requests_per_second', {}).get('errored', {}).get('mean', 0)
+            incomplete_rps = metrics.get('requests_per_second', {}).get('incomplete', {}).get('mean', 0)
             tokens_sec = metrics.get('tokens_per_second', {}).get('successful', {}).get('mean', 0)
+            total_bad_rps = err_rps + incomplete_rps
             
-            details.append(f"Rate: {rate} RPS | Achieved: {achieved_rps:.2f} RPS | Errors: {err_rps:.2f} RPS | Tokens/s: {tokens_sec:.2f}")
+            details.append(
+                f"Rate: {rate} RPS | Achieved: {achieved_rps:.2f} RPS | "
+                f"Errored: {err_rps:.2f} RPS | Incomplete: {incomplete_rps:.2f} RPS | "
+                f"Tokens/s: {tokens_sec:.2f}"
+            )
             
-            if err_rps == 0 and achieved_rps > max_achieved_rps:
+            if total_bad_rps == 0 and achieved_rps > max_achieved_rps:
                 max_achieved_rps = achieved_rps
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hack/benchmark/extract/get_benchmark_report.py` around lines 86 - 97, The
gating logic currently only checks errored RPS (err_rps) when updating
max_achieved_rps; modify the loop that processes each benchmark (the block using
variables rate, achieved_rps, err_rps, tokens_sec and appending to details) to
also extract incomplete_rps from metrics.get('requests_per_second',
{}).get('incomplete', {}).get('mean', 0) and include it in the printed details
string, and change the condition that updates max_achieved_rps to require both
err_rps == 0 and incomplete_rps == 0 (i.e., treat any non-zero incomplete_rps as
a failure signal in the serving-capacity gating).
hack/benchmark/extract/get_benchmark_report.py-243-275 (1)

243-275: ⚠️ Potential issue | 🟠 Major

Stop hard-coding phase _3.yaml as the benchmark end.

Lines 245-251 assume the benchmark always emits _0.yaml and _3.yaml. Any run with a different phase count clamps start_time/end_time to the wrong interval, so the Prometheus range no longer matches the benchmark that was actually executed.

Suggested fix
     if results_dir:
-        import yaml
-        yaml_first = os.path.join(results_dir, "benchmark_report,_results.json_0.yaml")
-        yaml_first_alt = os.path.join(results_dir, "benchmark_report_v0.2,_results.json_0.yaml")
-        start_yaml = yaml_first if os.path.exists(yaml_first) else (yaml_first_alt if os.path.exists(yaml_first_alt) else None)
-        
-        yaml_last = os.path.join(results_dir, "benchmark_report,_results.json_3.yaml")
-        yaml_last_alt = os.path.join(results_dir, "benchmark_report_v0.2,_results.json_3.yaml")
-        end_yaml = yaml_last if os.path.exists(yaml_last) else (yaml_last_alt if os.path.exists(yaml_last_alt) else None)
+        import glob
+        import yaml
+        yaml_files = sorted(glob.glob(os.path.join(results_dir, "benchmark_report*_results.json_*.yaml")))
+        start_yaml = yaml_files[0] if yaml_files else None
+        end_yaml = yaml_files[-1] if yaml_files else None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hack/benchmark/extract/get_benchmark_report.py` around lines 243 - 275, The
code incorrectly hard-codes `_0.yaml` and `_3.yaml` as start/end files; update
the logic in get_benchmark_report.py (around results_dir handling where
yaml_first/yaml_last and start_yaml/end_yaml are set) to dynamically discover
all matching benchmark YAMLs in results_dir (e.g., match patterns like
"benchmark_report*_results.json_*.yaml" including v0.2 variants), parse the
phase index from each filename, then select the file with the smallest phase
index as start_yaml and the largest as end_yaml; keep the existing
timestamp/metrics parsing that uses start_yaml/end_yaml and fall back to the
previous behavior if no matches are found.
deploy/lib/infra_wva.sh-12-39 (1)

12-39: ⚠️ Potential issue | 🟠 Major

OpenShift deploys with certificate verification disabled, and PEM validation fails open.

CWE-295 (improper certificate validation). Lines 23–29 unconditionally set SKIP_TLS_VERIFY=true for OpenShift even though CA extraction succeeds. Lines 255–257 warn-only on invalid PEM format, continuing anyway. An attacker can spoof in-cluster metrics endpoints feeding false data into autoscaling decisions. Respect pre-set environment overrides (check before overwriting), default OpenShift to verification-enabled, and fail closed on PEM validation failures.

Suggested fix
 set_tls_verification() {
     log_info "Setting TLS verification..."
 
+    if [ -n "${SKIP_TLS_VERIFY:-}" ]; then
+        export SKIP_TLS_VERIFY
+        log_success "Using caller-provided TLS verification setting: $SKIP_TLS_VERIFY"
+        return
+    fi
+
     # Auto-detect TLS verification setting if not specified
     if ! containsElement "$ENVIRONMENT" "${NON_EMULATED_ENV_LIST[@]}"; then
             SKIP_TLS_VERIFY="true"
@@ -26,7 +31,7 @@ set_tls_verification() {
             "openshift")
                 # For OpenShift, we can use proper TLS verification since we have the Service CA
                 # However, defaulting to true for now to match current behavior
                 # TODO: Set to false once Service CA certificate extraction is fully validated
-                SKIP_TLS_VERIFY="true"
+                SKIP_TLS_VERIFY="false"
                 log_info "OpenShift cluster - TLS verification setting: $SKIP_TLS_VERIFY"
                 ;;
@@ -253,8 +258,8 @@ extract_openshift_prometheus_ca() {
 
     # Verify the certificate is valid PEM format
     if ! openssl x509 -in "$PROM_CA_CERT_PATH" -text -noout &> /dev/null; then
-        log_warning "Certificate file may not be in valid PEM format, but continuing..."
-        log_warning "If TLS errors occur, verify the certificate format is correct"
+        log_error "Extracted certificate is not valid PEM"
+        exit 1
     else
         # Log certificate details for debugging
         local cert_subject

Also: line 73 has unquoted $VALUES_FILE in helm command; should be "$VALUES_FILE".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/lib/infra_wva.sh` around lines 12 - 39, The script currently forces
SKIP_TLS_VERIFY="true" for OpenShift and when environment is unknown and does
not respect pre-set overrides, and PEM validation only warns (lines around the
PEM check) instead of failing; update the logic in the block that sets
SKIP_TLS_VERIFY (the containsElement check and the case branch for "openshift")
to first check if SKIP_TLS_VERIFY is already exported/set and avoid overwriting
it, change the OpenShift default to enable verification
(SKIP_TLS_VERIFY="false") when CA extraction succeeds (keep a fallback to true
only if extraction fails), and modify the PEM validation routine (the function
or code that validates the PEM) to exit non‑zero on invalid PEM instead of only
logging a warning; finally, fix the helm invocation that uses $VALUES_FILE (line
~73) to quote the variable as "$VALUES_FILE" to prevent word-splitting.
deploy/lib/infra_wva.sh-71-104 (1)

71-104: ⚠️ Potential issue | 🟠 Major

Quote all variables and build Helm arguments as an array to prevent word-splitting injection.

CWE-78. Lines 71-104 have unquoted variables that can split on whitespace: $VALUES_FILE (line 73), ${WVA_PROJECT} (line 71), and conditional expansions like ${CONTROLLER_INSTANCE:+--set wva.controllerInstance=$CONTROLLER_INSTANCE} (lines 101-104). A values file or project path with spaces will fragment into spurious CLI arguments. Although the caller (deploy/install.sh) runs with set -e, this sourced library should not rely on callers for error handling.

Suggested fix
-    helm upgrade -i "$WVA_RELEASE_NAME" ${WVA_PROJECT}/charts/workload-variant-autoscaler \
-        -n "$WVA_NS" \
-        --values $VALUES_FILE \
+    local -a helm_args=(
+        upgrade -i "$WVA_RELEASE_NAME" "${WVA_PROJECT}/charts/workload-variant-autoscaler"
+        -n "$WVA_NS"
+        --values "$VALUES_FILE"
         --set-file wva.prometheus.caCert="$PROM_CA_CERT_PATH" \
         --set wva.image.repository="$WVA_IMAGE_REPO" \
         --set wva.image.tag="$WVA_IMAGE_TAG" \
         --set wva.imagePullPolicy="$WVA_IMAGE_PULL_POLICY" \
         --set wva.baseName="$WELL_LIT_PATH_NAME" \
@@
-        --set wva.metrics.secure="$WVA_METRICS_SECURE" \
-        --set wva.scaleToZero="$ENABLE_SCALE_TO_ZERO" \
-        ${CONTROLLER_INSTANCE:+--set wva.controllerInstance=$CONTROLLER_INSTANCE} \
-        ${POOL_GROUP:+--set wva.poolGroup=$POOL_GROUP} \
-        ${KV_SPARE_TRIGGER:+--set wva.capacityScaling.default.kvSpareTrigger=$KV_SPARE_TRIGGER} \
-        ${QUEUE_SPARE_TRIGGER:+--set wva.capacityScaling.default.queueSpareTrigger=$QUEUE_SPARE_TRIGGER}
+        --set wva.metrics.secure="$WVA_METRICS_SECURE"
+        --set wva.scaleToZero="$ENABLE_SCALE_TO_ZERO"
+    )
+    [ -n "${CONTROLLER_INSTANCE:-}" ] && helm_args+=(--set "wva.controllerInstance=$CONTROLLER_INSTANCE")
+    [ -n "${POOL_GROUP:-}" ] && helm_args+=(--set "wva.poolGroup=$POOL_GROUP")
+    [ -n "${KV_SPARE_TRIGGER:-}" ] && helm_args+=(--set "wva.capacityScaling.default.kvSpareTrigger=$KV_SPARE_TRIGGER")
+    [ -n "${QUEUE_SPARE_TRIGGER:-}" ] && helm_args+=(--set "wva.capacityScaling.default.queueSpareTrigger=$QUEUE_SPARE_TRIGGER")
+
+    if ! helm "${helm_args[@]}"; then
+        log_error "WVA Helm install failed"
+        return 1
+    fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/lib/infra_wva.sh` around lines 71 - 104, The helm upgrade invocation
uses unquoted and directly-expanded variables (e.g., $VALUES_FILE,
${WVA_PROJECT}, and the conditional expansions like ${CONTROLLER_INSTANCE:+--set
wva.controllerInstance=$CONTROLLER_INSTANCE}) which allows
word-splitting/injection; change the deployment code to build the Helm CLI as an
array and append each argument as a separate element, quoting/escaping variable
values when adding them (e.g., construct an array like helm_args=(upgrade -i
"$WVA_RELEASE_NAME" "${WVA_PROJECT}/charts/workload-variant-autoscaler" -n
"$WVA_NS") then append "--values" "$VALUES_FILE", "--set"
"wva.image.repository=$WVA_IMAGE_REPO", and conditionally append "--set"
"wva.controllerInstance=$CONTROLLER_INSTANCE" only when $CONTROLLER_INSTANCE is
non-empty; finally invoke Helm with: helm "${helm_args[@]}". This ensures no
word-splitting or injection and does not rely on the caller's shell flags.

Copy link
Copy Markdown

@vivekk16 vivekk16 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

Copy link
Copy Markdown
Member

@anishasthana anishasthana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@anishasthana anishasthana merged commit 12cb73b into opendatahub-io:main Apr 6, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants