[sync] upstream llm-d main branch 8be94cd [2026-04-13]#64
[sync] upstream llm-d main branch 8be94cd [2026-04-13]#64zdtsw wants to merge 10 commits intoopendatahub-io:mainfrom
Conversation
Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com>
The llm-d/llm-d repo already runs `Nightly - WVA E2E (CKS)` using the same reusable workflow, same cluster, and same namespace. This copy has never passed (30/30 failures) and wastes runner time. Signed-off-by: Andrew Anderson <andy@clubanderson.com>
Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com>
…e metrics server in tests, since it's not needed for indexer testing. (llm-d#976) Signed-off-by: greg pereira <grpereir@redhat.com>
Signed-off-by: Braulio Dumba <Braulio.Dumba@ibm.com>
- docs/user-guide/LeaderWorkerSet-support.md: shoud -> should - internal/engines/analyzers/queueingmodel/tuner/utils.go: expextedObservations -> expectedObservations (4 occurrences) Note: PNGs in test/benchmark/config.go is correct (plural of PNG), not a typo. Closes llm-d#972
Resolve all linter violations reported by golangci-lint-v2, including:
- perfsprint: replace fmt.Errorf with errors.New for static strings,
fmt.Sprintf("%d",...) with strconv.Itoa, and fmt.Sprintf with string
concatenation where applicable
- goconst: extract repeated string literals into package-level constants
- gocritic: rewrite if-else chains as switch statements, fix captLocal,
assignOp, appendAssign, singleCaseSwitch, elseif, deprecatedComment
- ginkgolinter: replace len(x) assertions with HaveLen/BeEmpty
- prealloc: pre-allocate slices where capacity is known
- gofmt: auto-fix formatting
- loggercheck: fix odd key-value pairs in structured logging calls
- revive: fix comment spacing (//TODO -> // TODO, /** */ -> //)
- fatcontext: add nolint directives for test suite setup patterns
- unparam: handle unused parameters
- copyloopvar: remove unnecessary tc := tc copies (Go 1.22+)
- misspell: fix throughtput -> throughput
- unconvert: remove unnecessary type conversions
- staticcheck: add .golangci.yml exclusion for deprecated Accelerator field
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Lionel Villard <villard@us.ibm.com>
# Conflicts: # docs/user-guide/scale-from-zero.md
📝 WalkthroughWalkthroughThis pull request introduces OpenShift platform support for the CI benchmark workflow with manual trigger capability, adds a new Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes IssuesCI Pipeline Security: The Error Context Loss: Widespread conversion from Unterminated Timeout in Missing Metrics Validation in
Gateway Connectivity Verification Fragility in
Documentation Link Integrity: Multiple Slice Reuse Risk in 🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 13
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (11)
test/utils/pod_scraping_test_helpers.go (1)
555-566:⚠️ Potential issue | 🟠 MajorUse io.LimitReader for unbounded pod log reads — CWE-400 (Uncontrolled Resource Consumption).
Line 565 uses
DoRaw(ctx)without size constraints, allowing arbitrary log payloads to exhaust heap memory. A pod emitting large logs (intentionally or through malfunction) will OOM the test process.Proposed fix
import ( "context" _ "embed" "fmt" + "io" "net/http" "time" @@ - logBytes, err := logsReq.DoRaw(ctx) + logStream, err := logsReq.Stream(ctx) + if err != nil { + g.Expect(err).NotTo(gom.HaveOccurred(), "Should be able to stream job logs") + } + defer func() { + _ = logStream.Close() + }() + + const maxLogBytes = 1 << 20 // 1 MiB cap + logBytes, err := io.ReadAll(io.LimitReader(logStream, maxLogBytes)) g.Expect(err).NotTo(gom.HaveOccurred(), "Should be able to get job logs")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/utils/pod_scraping_test_helpers.go` around lines 555 - 566, The pod log read using logsReq.DoRaw(ctx) is unbounded and can OOM; change to a streaming read via logsReq.Stream(ctx), obtain the io.ReadCloser (e.g., stream), wrap it with io.LimitReader(stream, <MAX_BYTES>) to cap the bytes (choose a sensible MAX_BYTES like a few MBs), read from that limited reader into your logBytes, and ensure you close the stream; update references to logsReq and logBytes accordingly so tests still assert on the returned bytes.test/benchmark/grafana.go (1)
80-86:⚠️ Potential issue | 🟡 Minor
InsecureSkipVerify: truedisables TLS certificate verification (CWE-295).Per coding guidelines,
InsecureSkipVerifyenables MITM attacks. While this is test/benchmark code connecting to a localhost port-forward, consider adding a code comment documenting why this is acceptable here (local-only traffic through kubectl port-forward to in-cluster Grafana).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/benchmark/grafana.go` around lines 80 - 86, The code sets TLSClientConfig.InsecureSkipVerify = true on the httpClient which disables TLS cert verification; instead of changing behavior, add a clear inline comment next to the TLSClientConfig/InsecureSkipVerify setting (or above the httpClient construction) explaining that this is intentionally allowed because the benchmark/test uses localhost kubectl port-forwarding to an in-cluster Grafana instance (local-only traffic), and note any constraints (e.g., only for tests/benchmarks, not production) so reviewers understand the risk is accepted for this context.internal/collector/source/pod/pod_scraping_source.go (1)
293-296:⚠️ Potential issue | 🟠 MajorMissing
io.LimitReaderon HTTP response body — potential memory exhaustion (CWE-400).The response body from pod metrics endpoints is read without size limits. A compromised or malicious pod could return an arbitrarily large response, causing memory exhaustion in the controller.
Proposed fix
+const maxMetricsResponseSize = 10 * 1024 * 1024 // 10MB limit for metrics response + // parsePrometheusMetrics parses Prometheus text format into MetricResult. -func (p *PodScrapingSource) parsePrometheusMetrics(reader io.Reader, podName string) (*source.MetricResult, error) { +func (p *PodScrapingSource) parsePrometheusMetrics(reader io.Reader, podName string) (*source.MetricResult, error) { + limitedReader := io.LimitReader(reader, maxMetricsResponseSize) - parser := expfmt.NewTextParser(model.UTF8Validation) - metricFamilies, err := parser.TextToMetricFamilies(reader) + parser := expfmt.NewTextParser(model.UTF8Validation) + metricFamilies, err := parser.TextToMetricFamilies(limitedReader)As per coding guidelines: "Use io.LimitReader for HTTP response bodies (prevent memory exhaustion)".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/collector/source/pod/pod_scraping_source.go` around lines 293 - 296, The code reads the HTTP response body directly into parsePrometheusMetrics (call site: p.parsePrometheusMetrics(resp.Body, pod.Name)), which risks memory exhaustion; wrap resp.Body with io.LimitReader using a sane max bytes constant (e.g., maxMetricsBodyBytes) and pass that limited reader to parsePrometheusMetrics instead, ensure the response body is still closed (resp.Body.Close()) after reading, and optionally return a clear error if the reader hits the limit so callers know the metrics were truncated.test/utils/e2eutils.go (1)
209-223:⚠️ Potential issue | 🟠 MajorDo not suppress all TLS secret creation failures.
createTLSCertificateSecret()still returnsnilwhenkubectl create secret tlsfails for reasons other than "already exists".InstallPrometheusOperator()will keep going and fail later with a much less obvious root cause.🐛 Proposed fix
-func createTLSCertificateSecret() error { //nolint:unparam // error return kept for interface consistency +func createTLSCertificateSecret() error { certFile := "hack/tls-certs/prometheus-cert.pem" keyFile := "hack/tls-certs/prometheus-key.pem" cmd := exec.Command("kubectl", "create", "secret", "tls", "prometheus-tls", "--cert="+certFile, "--key="+keyFile, "-n", monitoringNamespace) if _, err := Run(cmd); err != nil { - // Secret might already exist, which is fine - fmt.Println("TLS secret already exists or creation failed (this is usually OK)") + if strings.Contains(err.Error(), "already exists") { + return nil + } + return fmt.Errorf("failed to create prometheus TLS secret: %w", err) } return nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/utils/e2eutils.go` around lines 209 - 223, The createTLSCertificateSecret function currently swallows all errors from running kubectl; change it to check the returned err from Run(cmd) and only ignore it when it indicates the secret already exists (e.g., error text contains "already exists"); for any other error return that error to the caller so InstallPrometheusOperator can fail fast. Update the error handling around the Run(cmd) call in createTLSCertificateSecret() (and keep the function signature) to log the specific failure and return the non-"already exists" error rather than always returning nil.deploy/lib/llm_d_nightly_install.sh (1)
77-80:⚠️ Potential issue | 🟡 MinorFixed 3-second sleep after Job delete is fragile.
The
time.Sleep(3 * time.Second)after deleting the previous Job assumes deletion completes within 3 seconds. Under resource pressure, this may not hold, causing the subsequentCreateto fail with "already exists".Proposed fix: poll for Job absence
- _ = k8sClient.BatchV1().Jobs(namespace).Delete(ctx, jobName, metav1.DeleteOptions{ - PropagationPolicy: ptr.To(metav1.DeletePropagationBackground), - }) - time.Sleep(3 * time.Second) + _ = k8sClient.BatchV1().Jobs(namespace).Delete(ctx, jobName, metav1.DeleteOptions{ + PropagationPolicy: ptr.To(metav1.DeletePropagationForeground), + }) + // Wait for Job to be fully deleted + _ = wait.PollUntilContextTimeout(ctx, 500*time.Millisecond, 30*time.Second, true, func(ctx context.Context) (bool, error) { + _, err := k8sClient.BatchV1().Jobs(namespace).Get(ctx, jobName, metav1.GetOptions{}) + if err != nil && strings.Contains(err.Error(), "not found") { + return true, nil + } + return false, nil + })Wait - this comment is for the wrong file. Let me correct.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deploy/lib/llm_d_nightly_install.sh` around lines 77 - 80, Replace the fragile time.Sleep(3 * time.Second) after deleting the Job with a polling loop that waits until the Job no longer exists before calling Create: after calling Delete (e.g., jobClient.Delete or DeleteJob), repeatedly attempt to Get the Job (or call jobClient.Get/GetJob) with a small backoff and timeout (e.g., check every 500ms up to a configurable deadline) and break when a NotFound error is returned; if the deadline elapses return or propagate an error instead of proceeding to Create (e.g., jobClient.Create or CreateJob). Ensure you only proceed to Create once the Get returns a NotFound to avoid the "already exists" race.internal/controller/configmap_helpers.go (1)
111-115:⚠️ Potential issue | 🟠 Major
isNamespaceExcludedfails open on namespace read errors (CWE-285, CWE-754).On API/RBAC read failure, the function returns
false, effectively treating the namespace as not excluded. Exploit scenario: temporary permission/API faults can bypass exclusion controls.Suggested fix
import ( "context" "github.com/go-logr/logr" yaml "gopkg.in/yaml.v3" corev1 "k8s.io/api/core/v1" + apierrors "k8s.io/apimachinery/pkg/api/errors" "sigs.k8s.io/controller-runtime/pkg/client" @@ var ns corev1.Namespace if err := c.Get(ctx, client.ObjectKey{Name: namespace}, &ns); err != nil { - // If namespace doesn't exist or we can't read it, default to not excluded - // This is safe - we'll proceed with normal logic - return false + // Fail closed on unexpected read errors for exclusion checks. + // Namespace not found is treated as not excluded. + return !apierrors.IsNotFound(err) }As per coding guidelines, "K8S CONTROLLER SECURITY: Proper error handling for security-critical operations."
Also applies to: 122-123
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/configmap_helpers.go` around lines 111 - 115, The isNamespaceExcluded function currently treats any client.Get error as "not excluded" (failing open); change it to fail closed: when calling c.Get(ctx, client.ObjectKey{Name: namespace}, &ns) (and the similar call at the other site), return false only if the error is a NotFound (use apierrors.IsNotFound(err)); for any other error log the failure with context (namespace and error) and return true to enforce exclusion on read/permission/API failures. Ensure you reference the function isNamespaceExcluded and update both occurrences noted in the diff.internal/controller/predicates.go (1)
215-225:⚠️ Potential issue | 🟠 MajorFail-open on namespace lookup can bypass exclusion policy (CWE-285, CWE-754).
If namespace
Getfails, the predicate continues and may reconcile resources from excluded namespaces. Exploit scenario: transient API/RBAC failures let excluded namespaces pass policy checks.Suggested fix
if namespace != "" { var ns corev1.Namespace // Use background context for predicate (no cancellation needed) - if err := k8sClient.Get(context.Background(), client.ObjectKey{Name: namespace}, &ns); err == nil { - annotations := ns.GetAnnotations() - if annotations != nil { - if value, exists := annotations[constants.NamespaceExcludeAnnotationKey]; exists && value == constants.AnnotationValueTrue { - // Namespace is excluded - filter out this VA - return false - } - } - } - // If namespace fetch fails, proceed with other checks (fail open) + if err := k8sClient.Get(context.Background(), client.ObjectKey{Name: namespace}, &ns); err != nil { + // Fail closed for exclusion policy checks + return false + } + annotations := ns.GetAnnotations() + if annotations != nil { + if value, exists := annotations[constants.NamespaceExcludeAnnotationKey]; exists && value == constants.AnnotationValueTrue { + // Namespace is excluded - filter out this VA + return false + } + } }As per coding guidelines, "K8S CONTROLLER SECURITY: Proper error handling for security-critical operations."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/predicates.go` around lines 215 - 225, The current predicate fails open when k8sClient.Get(...) for the namespace returns an error, which can let excluded namespaces pass; change the logic in the predicate that calls k8sClient.Get to treat any Get error as a failure-closed case: if k8sClient.Get(ctx, client.ObjectKey{Name: namespace}, &ns) returns an error, log the error and return false (exclude the resource) instead of continuing; keep the existing annotation check using constants.NamespaceExcludeAnnotationKey and constants.AnnotationValueTrue when Get succeeds, and explicitly handle apierrors.IsNotFound if you want a different behavior (otherwise treat NotFound as excluded as well).internal/engines/saturation/engine_v2.go (1)
109-118:⚠️ Potential issue | 🟠 MajorReject unsupported enabled analyzers instead of silently zeroing their weight.
Any enabled analyzer whose name is not
interfaces.SaturationAnalyzerNamenow contributes nothing tobaseResult.Score, which can flatten model priority and change allocation order without any signal. Either keep explicit handling for every supported analyzer or fail fast on unknown enabled names.Suggested fix
totalWeighted := 0.0 for _, aw := range config.Analyzers { if aw.Enabled != nil && !*aw.Enabled { continue } - if aw.Name == interfaces.SaturationAnalyzerName { - totalWeighted += baseResult.RequiredCapacity * aw.Score - // future: add "throughput", "slo" cases - } + switch aw.Name { + case interfaces.SaturationAnalyzerName: + totalWeighted += baseResult.RequiredCapacity * aw.Score + default: + return nil, fmt.Errorf("unsupported analyzer %q in V2 scoring", aw.Name) + } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/engines/saturation/engine_v2.go` around lines 109 - 118, The loop over config.Analyzers currently ignores any enabled analyzer whose name != interfaces.SaturationAnalyzerName, which silently zeroes their contribution; update the logic in the function containing that loop (the code that computes the weighted score using totalWeighted, baseResult.RequiredCapacity and config.Analyzers) to reject unknown enabled analyzer names instead of skipping them—when aw.Enabled is true and aw.Name is not interfaces.SaturationAnalyzerName return or surface an error (or explicitly log and fail fast) so callers know an unsupported analyzer was configured; keep the existing handling for interfaces.SaturationAnalyzerName and explicitly enumerate any future supported analyzer names in the same check.docs/user-guide/keda-integration.md (1)
236-314:⚠️ Potential issue | 🟠 MajorRemove
unsafeSsl: "true"from the primary KEDA example.This bakes TLS verification bypass into the default path. That is CWE-295: a compromised in-cluster proxy, service, or network path can spoof Prometheus responses and drive arbitrary scale decisions. Keep the secure example as the default, and move any verification-bypass workaround into an explicitly labeled local-dev troubleshooting note.
Suggested fix
- unsafeSsl: "true" # Skip SSL verification for self-signed certificates + unsafeSsl: "false"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/user-guide/keda-integration.md` around lines 236 - 314, The KEDA example currently includes unsafeSsl: "true" under the prometheus trigger metadata which bypasses TLS verification; remove the unsafeSsl field from the primary ScaledObject sample (look for the triggers -> type: prometheus block and its metadata including serverAddress, query, threshold, activationThreshold, metricType) so the default example enforces TLS verification, and instead add a clearly labeled local-dev/troubleshooting note elsewhere that explains how to temporarily set unsafeSsl for self-signed certs (do not keep it in the default config).docs/user-guide/configuration.md (1)
163-167:⚠️ Potential issue | 🟠 MajorDon't claim immutable ConfigMaps prevent malicious changes.
This overstates the guarantee.
immutable: trueblocks updates, but anyone who still hasdelete/createon ConfigMaps can replace the object and change controller settings anyway. Reword this to say immutability only prevents in-place edits and must be paired with RBAC that denies delete/recreate.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/user-guide/configuration.md` around lines 163 - 167, The bullet claiming immutable ConfigMaps "Protects against malicious modifications" overstates guarantees; update the wording to state that setting immutable: true only prevents in-place edits (it blocks updates to the existing object) and does not stop actors with delete/create permissions from replacing the ConfigMap, and add a note that immutability must be paired with RBAC that denies delete/recreate to prevent replacement; modify the "Protects against malicious modifications" bullet and/or add a line under "Security Benefits" to reflect this limitation and required RBAC pairing.internal/engines/saturation/engine.go (1)
251-275:⚠️ Potential issue | 🟠 MajorHonor
EnableLimiterper effective namespace config.
enableLimiteris taken only from the global"default"saturation config, but the V2 and queueing-model paths later build requests from namespace-local configs. If one namespace enables limited mode and the global default does not, that tenant still goes through the unrestricted optimizer for the entire cycle, so GPU constraints are silently skipped.Either make limiter selection controller-global and reject namespace overrides, or partition requests by effective limiter mode and optimize each group separately.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/engines/saturation/engine.go` around lines 251 - 275, The current code reads enableLimiter only from the global "default" config (via e.Config.SaturationConfig() and cfg.EnableLimiter) but later uses namespace-local configs when building V2/queueing-model requests, so limiter is not honored per-namespace; update the optimizer-selection logic so it respects the effective per-namespace limiter mode instead of a single global flag: either (A) enforce a controller-global policy by rejecting or overriding namespace EnableLimiter overrides when computing e.optimizer, or (B) partition the request set by effective namespace EnableLimiter and run optimization separately for each partition (creating/choosing the optimizer based on enableLimiter for each partition) before merging results; locate helper symbols analyzerName, enableLimiter, hasQMAnalyzerConfig, e.optimizer, pipeline.NewGreedyByScoreOptimizer(), and pipeline.NewCostAwareOptimizer() to implement the chosen approach.
🟡 Minor comments (11)
pkg/solver/solver_test.go-617-619 (1)
617-619:⚠️ Potential issue | 🟡 MinorStrengthen assertion at Line 617 to validate the actual minimum allocation.
Current condition only rejects values
> 50.0; it can still pass when a non-minimum allocation is selected but happens to be<= 50.0.Proposed fix
+ expectedMin := allocs[0].Value() + for _, alloc := range allocs[1:] { + if alloc.Value() < expectedMin { + expectedMin = alloc.Value() + } + } + selectedAlloc := server.Allocation() if selectedAlloc == nil { t.Error("Expected server to have an allocation after SolveUnlimited") - } else if selectedAlloc.Value() > 50.0 { + } else if selectedAlloc.Value() != expectedMin { // The allocation with minimum value should be selected - t.Errorf("Expected allocation with minimum value to be selected, got value %f", selectedAlloc.Value()) + t.Errorf("Expected allocation with minimum value %f to be selected, got %f", expectedMin, selectedAlloc.Value()) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/solver/solver_test.go` around lines 617 - 619, The test currently only checks selectedAlloc.Value() > 50.0 which can miss non-minimal selections <=50.0; update the test to compute the actual minimum allocation by iterating the allocations slice (e.g., allocations or allocs), find the minimal allocation value (expectedMin), then assert that selectedAlloc.Value() equals expectedMin within a small epsilon (use math.Abs(selectedAlloc.Value()-expectedMin) < 1e-9) and fail the test (t.Errorf/t.Fatalf) if it does not; reference selectedAlloc and Value() when performing the comparison so the assertion verifies the true minimum selection.deploy/lib/infra_wva.sh-103-104 (1)
103-104:⚠️ Potential issue | 🟡 MinorUnquoted variable expansion in Helm arguments.
The
$KV_CACHE_THRESHOLDand$QUEUE_LENGTH_THRESHOLDvalues are not quoted. If these contain spaces or shell metacharacters, command parsing could be affected.- ${KV_CACHE_THRESHOLD:+--set wva.capacityScaling.default.kvCacheThreshold=$KV_CACHE_THRESHOLD} \ - ${QUEUE_LENGTH_THRESHOLD:+--set wva.capacityScaling.default.queueLengthThreshold=$QUEUE_LENGTH_THRESHOLD} \ + ${KV_CACHE_THRESHOLD:+--set "wva.capacityScaling.default.kvCacheThreshold=$KV_CACHE_THRESHOLD"} \ + ${QUEUE_LENGTH_THRESHOLD:+--set "wva.capacityScaling.default.queueLengthThreshold=$QUEUE_LENGTH_THRESHOLD"} \This follows the same unquoted pattern as existing lines (101-102, 105-106), so risk is low if values are numeric. Consider quoting all such expansions for defense-in-depth. As per coding guidelines: "Quote all variables to prevent injection".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deploy/lib/infra_wva.sh` around lines 103 - 104, The Helm --set expansions for KV_CACHE_THRESHOLD and QUEUE_LENGTH_THRESHOLD should be quoted to prevent word-splitting or shell metacharacter injection; update the conditional expansions that set wva.capacityScaling.default.kvCacheThreshold and wva.capacityScaling.default.queueLengthThreshold to wrap the variable expansions in double quotes (i.e., quote the ${KV_CACHE_THRESHOLD} and ${QUEUE_LENGTH_THRESHOLD} expansions) so each --set value is passed as a single, safe argument.docs/user-guide/hpa-integration.md-490-490 (1)
490-490:⚠️ Potential issue | 🟡 MinorUpdate the earlier in-page anchor after renaming this heading.
This heading now generates a different fragment ID, but Line 66 still links to the old
...hpa-integrationyamlanchor, so that note above no longer jumps to the HPA example.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/user-guide/hpa-integration.md` at line 490, The in-page anchor pointing to the HPA example needs updating because the heading "### HPA Configuration Example (`config/samples/hpa/hpa.yaml`)" changed its fragment ID; find the link referenced earlier (the note that currently targets the old anchor `...hpa-integrationyaml`) and update its href/anchor to match the new fragment generated from this heading (typically derived from the heading text, e.g., `hpa-configuration-example-configsampleshpa-hpa-yaml`), ensuring the note on Line 66 jumps to the HPA example correctly.deploy/lib/llm_d_nightly_install.sh-43-45 (1)
43-45:⚠️ Potential issue | 🟡 Minor
WVA_METRICS_SECURE=falsedisables authentication on the metrics endpoint.Setting
WVA_METRICS_SECURE=falsefor OpenShift removes bearer token auth from/metrics. If this endpoint is network-accessible, metrics (including potentially sensitive operational data) become publicly readable. Verify this is intentional for the nightly environment and that network policies restrict access.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deploy/lib/llm_d_nightly_install.sh` around lines 43 - 45, The defaulting of WVA_METRICS_SECURE to "false" disables auth on /metrics; change the logic so OpenShift does not silently disable metrics auth: set WVA_METRICS_SECURE default to "true" or only allow "false" when explicitly overridden (e.g., if [ -z "${WVA_METRICS_SECURE+x}" ] then export WVA_METRICS_SECURE=true), add a clear comment next to WVA_METRICS_SECURE and a runtime warning if ENVIRONMENT=openshift and WVA_METRICS_SECURE=false, and ensure MONITORING_NAMESPACE and ENVIRONMENT checks remain unchanged so the variable names (WVA_METRICS_SECURE, MONITORING_NAMESPACE, ENVIRONMENT) are easy to locate for this fix.test/benchmark/job_waiter.go-77-80 (1)
77-80:⚠️ Potential issue | 🟡 MinorRace condition: fixed sleep after Job deletion.
time.Sleep(3 * time.Second)assumes the Job is deleted within 3 seconds. Under cluster load, deletion may take longer, causingCreateto fail with "already exists". UseDeletePropagationForegroundand poll for Job absence, or handleAlreadyExistserror on create.Proposed fix: poll for deletion completion
_ = k8sClient.BatchV1().Jobs(namespace).Delete(ctx, jobName, metav1.DeleteOptions{ - PropagationPolicy: ptr.To(metav1.DeletePropagationBackground), + PropagationPolicy: ptr.To(metav1.DeletePropagationForeground), }) - time.Sleep(3 * time.Second) + // Poll until Job is gone + _ = wait.PollUntilContextTimeout(ctx, 500*time.Millisecond, 30*time.Second, true, func(ctx context.Context) (bool, error) { + _, err := k8sClient.BatchV1().Jobs(namespace).Get(ctx, jobName, metav1.GetOptions{}) + return apierrors.IsNotFound(err), nil + })🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/benchmark/job_waiter.go` around lines 77 - 80, The fixed 3s sleep after calling k8sClient.BatchV1().Jobs(namespace).Delete(...) is a race; replace the Background propagation and sleep with a deterministic wait: call Delete with metav1.DeletePropagationForeground (replace ptr.To(metav1.DeletePropagationBackground)), then poll using the Jobs(namespace).Get(ctx, jobName, ...) until it returns NotFound (or timeout) before proceeding to create, or alternatively catch and handle the AlreadyExists error on create (Jobs(...).Create) by retrying the poll; update the logic around the Delete call and any code that uses time.Sleep to use the polling/timeout approach instead.internal/engines/analyzers/saturation_v2/capacity_store.go-181-181 (1)
181-181:⚠️ Potential issue | 🟡 MinorInconsistent live-source check can break priority selection.
Line 181 still uses
"live"literal onbest.LearnedFromwhile other paths uselearnedFromLive. If the constant changes, live records may stop being preferred.Suggested fix
- if best == nil || (best.LearnedFrom != "live" && rec.LearnedFrom == learnedFromLive) { + if best == nil || (best.LearnedFrom != learnedFromLive && rec.LearnedFrom == learnedFromLive) { best = rec }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/engines/analyzers/saturation_v2/capacity_store.go` at line 181, The conditional uses a string literal "live" when checking best.LearnedFrom which is inconsistent with the constant learnedFromLive used elsewhere; update the comparison so both sides use the learnedFromLive constant (i.e., replace the literal "live" in the if condition involving best.LearnedFrom and rec.LearnedFrom) to ensure consistent live-source priority logic for the best/rec selection code in capacity_store.go.test/benchmark/suite_test.go-91-103 (1)
91-103:⚠️ Potential issue | 🟡 MinorFail fast when
thanos-querierhas nowebport.If the service lookup succeeds but the named port is absent, this silently falls back to
9090and the later port-forward failure points at the wrong problem. Track whetherwebwas found andExpectit before continuing.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/benchmark/suite_test.go` around lines 91 - 103, The code currently looks up the thanos-querier Service and scans svc.Spec.Ports for a port with port.Name == "web" but silently leaves promServicePort at its default (9090) if not found; modify the block that uses k8sClient.CoreV1().Services(promNamespace).Get(...) and the loop over svc.Spec.Ports to track a boolean (e.g., webPortFound), set promServicePort when you find the "web" port, then call Expect(webPortFound).To(BeTrue(), "thanos-querier service missing named port 'web'") (or equivalent) before continuing so the test fails fast and clearly when the named port is absent.docs/user-guide/keda-integration.md-44-44 (1)
44-44:⚠️ Potential issue | 🟡 MinorFix the broken intra-doc anchor.
The fragment on Line 44 does not resolve to the heading on Line 236, so the link at the top of the guide is dead. Use a simple heading/fragment pair or add an explicit HTML anchor.
Also applies to: 236-236
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/user-guide/keda-integration.md` at line 44, The intra-doc anchor used in the link "at the end of this doc](`#keda-scaledobject-configuration-example-configsampleskeda-scaledobjectyaml`)" is broken; update the target to match the actual heading slug or add an explicit HTML anchor above the target heading. Locate the referenced end-of-doc heading (the heading around line 236, e.g. "KEDA ScaledObject configuration example — config/samples/keda/scaledobject.yaml") and either change the link to the real fragment generated by the heading text or insert an HTML anchor like <a id="keda-scaledobject-configuration-example-configsampleskeda-scaledobjectyaml"></a> immediately before that heading so the top-of-doc link resolves correctly.docs/user-guide/saturation-analyzer.md-243-268 (1)
243-268:⚠️ Potential issue | 🟡 MinorFenced code block missing language identifier.
The ASCII diagram code block (line 243) lacks a language specifier. Add
textorplaintextto satisfy linters and improve rendering consistency.Proposed fix
-``` +```text ┌─────────────┐🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/user-guide/saturation-analyzer.md` around lines 243 - 268, The fenced ASCII diagram block beginning with "┌─────────────┐" should include a language identifier to satisfy linters and ensure consistent rendering; update the opening fence from ``` to ```text (or ```plaintext) so the block becomes a labeled plaintext code block.docs/user-guide/saturation-analyzer.md-27-27 (1)
27-27:⚠️ Potential issue | 🟡 MinorMissing heading for Table of Contents entry.
Line 27 references
#multi-variant-analysisin the TOC, but no corresponding heading exists in the document. Either remove this TOC entry or add the missing## Multi-Variant Analysisheading.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/user-guide/saturation-analyzer.md` at line 27, The TOC contains a link to `#multi-variant-analysis` but there is no matching heading; fix by adding a document heading that exactly matches the anchor (e.g., add a line "## Multi-Variant Analysis" where the section content should go) or remove the TOC entry; ensure the heading text "Multi-Variant Analysis" matches the TOC anchor so the link resolves.docs/developer-guide/BENCHMARK_PHASE_1_AND_2.md-42-46 (1)
42-46:⚠️ Potential issue | 🟡 MinorDrop the hard-coded
gh pr checkout 900step.This locks the doc to a historical PR and will put readers on the wrong branch once that PR is closed or no longer matches the current benchmark code. The instructions should tell users to check out the branch or commit they actually want to benchmark.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/developer-guide/BENCHMARK_PHASE_1_AND_2.md` around lines 42 - 46, Remove the hard-coded git command "gh pr checkout 900" and replace it with a generic instruction telling users to checkout the specific branch or commit they intend to benchmark (e.g., "git checkout <branch-or-commit>" or "gh pr checkout <PR-number> if applicable"); update the step text around the unique string "gh pr checkout 900" so it advises users to verify and select the correct branch/commit for the benchmark instead of using a fixed PR number.
🧹 Nitpick comments (7)
internal/config/config_test.go (1)
69-69: Drop the ignored goroutine parameter entirely.Passing
iwhile discarding it (_ int) adds noise and suggests a captured-loop-index concern that no longer exists.Suggested cleanup
- go func(_ int) { + go func() { defer wg.Done() for j := 0; j < iterations; j++ { @@ - }(i) + }()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/config/config_test.go` at line 69, Remove the unused goroutine parameter: change the literal "go func(_ int) {" to a parameterless "go func() {" and remove the corresponding argument where it's invoked (the trailing "}(i)" should become "}") so the goroutine no longer declares or passes an ignored loop index; update the closure invocation accordingly in internal/config/config_test.go where "go func(_ int) {" and "}(i)" appear.internal/utils/scaletarget/fetch_test.go (1)
836-838: MakeerrorTypeassertion handling exhaustive to prevent false-positive tests.Line 836 currently validates only
"NotFound". Any new non-emptyerrorTypevalue will bypass type-specific checks and may let regressions pass unnoticed.As per coding guidelines, this addresses review priority "Bug-prone patterns and error handling gaps".Proposed diff
- if tt.errorType == "NotFound" { - assert.True(t, apierrors.IsNotFound(err), "expected NotFound error") - } + switch tt.errorType { + case "NotFound": + assert.True(t, apierrors.IsNotFound(err), "expected NotFound error") + case "": + // no type-specific assertion + default: + t.Fatalf("unhandled errorType in test case: %q", tt.errorType) + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/utils/scaletarget/fetch_test.go` around lines 836 - 838, The test only checks tt.errorType == "NotFound" and ignores any other non-empty errorType, which can mask regressions; update the assertion logic around tt.errorType in the test (the block that currently calls apierrors.IsNotFound(err)) to be exhaustive: if tt.errorType == "NotFound" assert apierrors.IsNotFound(err), else if tt.errorType == "" assert no error (or use assert.NoError), and otherwise fail the test (t.Fatalf or assert.Failf) with a clear message about the unexpected tt.errorType value so any new errorType values must be handled explicitly.internal/engines/pipeline/cost_aware_optimizer.go (1)
109-109: Consider applying the same pattern tocostAwareScaleDownfor consistency.The augmented assignment operator is now used here, but line 188 in
costAwareScaleDownstill uses the explicit form. For consistent style:Consistency fix for scale-down
- targets[vc.VariantName] = current - replicasToRemove + targets[vc.VariantName] -= replicasToRemove remaining -= float64(replicasToRemove) * vc.PerReplicaCapacity🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/engines/pipeline/cost_aware_optimizer.go` at line 109, The style is inconsistent: you used the augmented assignment operator in the scale-up code (targets[vc.VariantName] += replicasNeeded) but the costAwareScaleDown function still uses the explicit form; update costAwareScaleDown to use the corresponding compound assignment (e.g., targets[vc.VariantName] -= <replicasToRemove variable>) when decreasing targets so it matches the pattern and references the same targets and vc.VariantName identifiers.internal/config/helpers.go (1)
104-104: Resolve the unresolved TODO in the config source path.Line 104 leaves runtime behavior validation as a TODO without a tracked issue/owner. Please convert this into a tracked issue (or implement/remove it) to avoid silent drift in config behavior.
I can draft the issue text (scope, acceptance criteria, and test checks) if you want me to open one.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/config/helpers.go` at line 104, Replace the stray TODO comment "TODO: check setting QUEUEING_MODEL_CONFIG_MAP_NAME" that sits next to the `return name` line with an actionable change: either implement the env-var/config override logic (read QUEUEING_MODEL_CONFIG_MAP_NAME and use it to compute/validate the returned name in the helper function) or create a tracked issue and remove the TODO. If you implement it, update the helper to prefer QUEUEING_MODEL_CONFIG_MAP_NAME when present, add unit tests that cover default and overridden behavior, and add a short comment describing the decision; if you create an issue, include scope, acceptance criteria, and test checks in the issue and replace the TODO with a one-line reference to that issue ID. Ensure the change touches the code that currently returns `name` and the TODO text so future readers see either implemented behavior or a tracked task.config/samples/hpa/kustomization.yaml (1)
3-4: Non-standardmetadatablock in Kustomization.The
kustomize.config.k8s.io/v1beta1Kustomizationkind does not use ametadatasection. While kustomize will ignore it, this deviates from the standard schema. Consider removing it for cleanliness:Suggested fix
apiVersion: kustomize.config.k8s.io/v1beta1 kind: Kustomization -metadata: - name: hpa-sample resources: - va.yaml - hpa.yaml🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@config/samples/hpa/kustomization.yaml` around lines 3 - 4, Remove the non-standard metadata block from the Kustomization by deleting the top-level "metadata" mapping (the "metadata:" key and its "name: hpa-sample" entry) so the file contains only valid Kustomization fields (e.g., apiVersion, kind, resources, etc.); locate the Kustomization that currently includes the "metadata" block and remove that block to conform to the kustomize.config.k8s.io/v1beta1 Kustomization schema.test/benchmark/job_waiter.go (1)
29-34: Use typed constants for condition status comparison.Comparing
cond.Statusto string"True"works but is less idiomatic than usingcorev1.ConditionTrue.Suggested change
- if cond.Type == batchv1.JobComplete && cond.Status == "True" { + if cond.Type == batchv1.JobComplete && cond.Status == corev1.ConditionTrue { return true, nil } - if cond.Type == batchv1.JobFailed && cond.Status == "True" { + if cond.Type == batchv1.JobFailed && cond.Status == corev1.ConditionTrue { return false, fmt.Errorf("job failed: %s", cond.Message) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/benchmark/job_waiter.go` around lines 29 - 34, The condition checks in job_waiter.go compare cond.Status to the literal "True"; update those comparisons to use the typed constant corev1.ConditionTrue instead (replace occurrences where cond.Status == "True" in the branches handling batchv1.JobComplete and batchv1.JobFailed), ensuring you import corev1 if not already imported and preserve the existing return values and fmt.Errorf call.test/benchmark/workload.go (1)
22-22: Hardcoded image version may become stale.Image
ghcr.io/vllm-project/guidellm:v0.5.4is hardcoded. Consider parameterizing or documenting the version update policy for benchmark tooling.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/benchmark/workload.go` at line 22, The hardcoded Docker image string assigned to the image variable ("ghcr.io/vllm-project/guidellm:v0.5.4") in workload.go can become stale; change it to be configurable (e.g., read from an environment variable, CLI flag, or config struct) and fall back to a sensible default, or add a clear comment/doc explaining the version update policy; update references to image where it's used so they consume the new configurable source (look for the image variable in workload.go and any functions that pass it along).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: e7cf4881-0b1e-4bc3-8bca-1cf711441107
⛔ Files ignored due to path filters (5)
docs/developer-guide/benchmark-panels/panel-desired-replicas.pngis excluded by!**/*.pngdocs/developer-guide/benchmark-panels/panel-kv-cache.pngis excluded by!**/*.pngdocs/developer-guide/benchmark-panels/panel-queue-depth.pngis excluded by!**/*.pngdocs/developer-guide/benchmark-panels/panel-replicas.pngis excluded by!**/*.pngdocs/developer-guide/benchmark-panels/panel-saturation.pngis excluded by!**/*.png
📒 Files selected for processing (130)
.github/workflows/ci-benchmark.yaml.github/workflows/nightly-e2e-cks.yaml.gitignore.golangci.ymlCONTRIBUTING.mdMakefileREADME.mdapi/v1alpha1/variantautoscaling_types.gocharts/workload-variant-autoscaler/README.mdcmd/main.goconfig/samples/hpa/hpa.yamlconfig/samples/hpa/kustomization.yamlconfig/samples/hpa/va.yamlconfig/samples/keda/kustomization.yamlconfig/samples/keda/scaledobject.yamlconfig/samples/keda/va.yamldeploy/README.mddeploy/install.shdeploy/lib/infra_llmd.shdeploy/lib/infra_wva.shdeploy/lib/llm_d_nightly_install.shdocs/CHANGELOG-v0.5.0.mddocs/README.mddocs/developer-guide/BENCHMARK_PHASE_1_AND_2.mddocs/developer-guide/agentic-workflows.mddocs/developer-guide/benchmark-summary.mddocs/developer-guide/testing.mddocs/tutorials/demo.mddocs/tutorials/guidellm-sample.mddocs/tutorials/parameter-estimation.mddocs/tutorials/vllm-samples.mddocs/user-guide/LeaderWorkerSet-support.mddocs/user-guide/configuration.mddocs/user-guide/hpa-integration.mddocs/user-guide/installation.mddocs/user-guide/keda-integration.mddocs/user-guide/multi-controller-isolation.mddocs/user-guide/saturation-analyzer.mddocs/user-guide/scale-from-zero.mdinternal/actuator/actuator.gointernal/actuator/actuator_test.gointernal/actuator/suite_test.gointernal/collector/collector.gointernal/collector/source/pod/pod_scraping_source.gointernal/collector/source/pod_va_mapper_test.gointernal/collector/source/query_template.gointernal/config/config.gointernal/config/config_test.gointernal/config/helpers.gointernal/config/loader.gointernal/config/saturation_scaling.gointernal/config/scale_to_zero.gointernal/config/validation.gointernal/constants/labels.gointernal/controller/configmap_bootstrap.gointernal/controller/configmap_bootstrap_test.gointernal/controller/configmap_helpers.gointernal/controller/configmap_reconciler.gointernal/controller/configmap_reconciler_test.gointernal/controller/indexers/indexers_test.gointernal/controller/indexers/suite_test.gointernal/controller/predicates.gointernal/controller/suite_test.gointernal/controller/variantautoscaling_controller.gointernal/discovery/k8s_with_gpu_operator_test.gointernal/engines/analyzers/queueingmodel/analyzer.gointernal/engines/analyzers/queueingmodel/tuner/defaults.gointernal/engines/analyzers/queueingmodel/tuner/stasher.gointernal/engines/analyzers/queueingmodel/tuner/tuner.gointernal/engines/analyzers/queueingmodel/tuner/utils.gointernal/engines/analyzers/saturation_v2/analyzer.gointernal/engines/analyzers/saturation_v2/capacity_store.gointernal/engines/analyzers/saturation_v2/types.gointernal/engines/pipeline/cost_aware_optimizer.gointernal/engines/pipeline/default_limiter_test.gointernal/engines/pipeline/enforcer.gointernal/engines/pipeline/greedy_score_optimizer.gointernal/engines/pipeline/greedy_score_optimizer_test.gointernal/engines/pipeline/type_inventory.gointernal/engines/saturation/engine.gointernal/engines/saturation/engine_test.gointernal/engines/saturation/engine_v2.gointernal/engines/saturation/suite_test.gointernal/engines/scalefromzero/engine.gointernal/engines/scalefromzero/engine_test.gointernal/interfaces/saturation_analyzer.gointernal/metrics/metrics.gointernal/saturation/analyzer.gointernal/utils/scaletarget/fetch_test.gointernal/utils/tls.gointernal/utils/utils.gointernal/utils/utils_test.gopkg/analyzer/mm1kmodel.gopkg/analyzer/mm1modelstatedependent.gopkg/analyzer/queueanalyzer.gopkg/config/defaults.gopkg/core/allocation.gopkg/core/allocation_test.gopkg/core/server_test.gopkg/core/system_test.gopkg/solver/greedy.gopkg/solver/greedy_test.gopkg/solver/optimizer.gopkg/solver/solver_test.gotest/benchmark/config.gotest/benchmark/epp_config.gotest/benchmark/grafana.gotest/benchmark/hpa_helpers.gotest/benchmark/job_waiter.gotest/benchmark/prefill_heavy_benchmark_test.gotest/benchmark/scale_up_latency_benchmark_test.gotest/benchmark/setup_test.gotest/benchmark/suite_test.gotest/benchmark/workload.gotest/chart/client_only_install_test.gotest/e2e/fixtures/hpa_builder.gotest/e2e/fixtures/model_service_builder.gotest/e2e/fixtures/scaled_object_builder.gotest/e2e/fixtures/workload_builder.gotest/e2e/limiter_test.gotest/e2e/pod_scraping_test.gotest/e2e/saturation_analyzer_path_test.gotest/e2e/scale_from_zero_test.gotest/e2e/smoke_test.gotest/e2e/suite_test.gotest/utils/debug_helpers.gotest/utils/e2eutils.gotest/utils/feature_gates.gotest/utils/pod_scraping_test_helpers.gotest/utils/resources/llmdsim.go
💤 Files with no reviewable changes (8)
- internal/engines/pipeline/enforcer.go
- docs/tutorials/guidellm-sample.md
- docs/tutorials/vllm-samples.md
- docs/developer-guide/agentic-workflows.md
- docs/tutorials/demo.md
- docs/tutorials/parameter-estimation.md
- .github/workflows/nightly-e2e-cks.yaml
- internal/utils/utils_test.go
| build-image: | ||
| needs: gate | ||
| if: needs.gate.outputs.run_benchmark == 'true' && (needs.gate.outputs.platform == 'openshift' || github.event.inputs.platform == 'openshift') | ||
| runs-on: ubuntu-latest | ||
| outputs: | ||
| image_tag: ${{ steps.build.outputs.image_tag }} | ||
| steps: |
There was a problem hiding this comment.
Add explicit job-level token permissions to build-image.
This new job inherits the repository default GITHUB_TOKEN scope. On a job that also handles GHCR credentials, that's unnecessary privilege expansion (CWE-250/CWE-272) if any later step is compromised.
Suggested fix
build-image:
needs: gate
+ permissions:
+ contents: read
if: needs.gate.outputs.run_benchmark == 'true' && (needs.gate.outputs.platform == 'openshift' || github.event.inputs.platform == 'openshift')
runs-on: ubuntu-latestAs per coding guidelines, "Set least-privilege permissions per job, not workflow level".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/ci-benchmark.yaml around lines 164 - 170, The build-image
job currently inherits full repository token scopes; restrict GITHUB_TOKEN by
adding a job-level permissions mapping under the build-image job (referencing
the job name build-image) to grant only the least privileges required for the
steps that interact with GHCR and the repo — e.g. set permissions.contents: read
and permissions.packages: write (and any other specific minimal permission your
steps need) so the job no longer uses the default elevated token; place this
permissions block directly under the build-image job definition alongside
needs/runs-on/outputs/steps.
| - name: Checkout source | ||
| uses: actions/checkout@v4 | ||
| with: | ||
| ref: ${{ needs.gate.outputs.pr_head_sha }} | ||
|
|
||
| - name: Log in to GHCR | ||
| uses: docker/login-action@v3 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "ci-benchmark.yaml" -o -name "ci-benchmark.yml"Repository: opendatahub-io/workload-variant-autoscaler
Length of output: 123
🏁 Script executed:
wc -l ./.github/workflows/ci-benchmark.yamlRepository: opendatahub-io/workload-variant-autoscaler
Length of output: 128
🏁 Script executed:
sed -n '171,177p' ./.github/workflows/ci-benchmark.yaml
sed -n '473,476p' ./.github/workflows/ci-benchmark.yaml
sed -n '489,499p' ./.github/workflows/ci-benchmark.yaml
sed -n '978,993p' ./.github/workflows/ci-benchmark.yamlRepository: opendatahub-io/workload-variant-autoscaler
Length of output: 1440
🏁 Script executed:
rg 'uses:' ./.github/workflows/ci-benchmark.yamlRepository: opendatahub-io/workload-variant-autoscaler
Length of output: 718
Pin all GitHub Actions by full commit SHA.
This workflow uses mutable version tags (@v3, @v4, @v6, @v7) for all actions. Supply chain attacks can retarget these tags to inject malicious code with access to GHCR secrets, PR write permissions, and cluster credentials (CWE-494).
Replace version tags with full commit SHAs:
actions/checkout@v4→ commit SHAdocker/login-action@v3→ commit SHAactions/github-script@v7→ commit SHAactions/setup-go@v6→ commit SHAactions/upload-artifact@v4→ commit SHAdocker/setup-buildx-action@v3→ commit SHA
Applies to lines 171-177, 473-476, 489-499, 978-993 and all other action invocations in this workflow.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/ci-benchmark.yaml around lines 171 - 177, The workflow
uses mutable action tags (e.g., actions/checkout@v4, docker/login-action@v3,
actions/github-script@v7, actions/setup-go@v6, actions/upload-artifact@v4,
docker/setup-buildx-action@v3); replace each of these version tags with the
corresponding full commit SHA to pin the action to an immutable revision across
the entire file (not just lines shown). Locate every invocation of those action
names (e.g., the steps named "Checkout source", "Log in to GHCR", any steps
using actions/github-script, actions/setup-go, actions/upload-artifact,
docker/setup-buildx-action) and update the ref to the exact commit SHA from the
official action repository release commit. Ensure all occurrences are updated
consistently and verify the SHAs are correct by cross-checking the upstream
action repos before committing.
| - name: Install tools (kubectl, oc, helm, make) | ||
| run: | | ||
| sudo apt-get update && sudo apt-get install -y make | ||
| KUBECTL_VERSION="v1.31.0" | ||
| curl -fsSL --retry 3 --retry-delay 5 -o kubectl "https://dl.k8s.io/release/${KUBECTL_VERSION}/bin/linux/amd64/kubectl" | ||
| chmod +x kubectl | ||
| sudo mv kubectl /usr/local/bin/ | ||
| curl -fsSL --retry 3 --retry-delay 5 -O "https://mirror.openshift.com/pub/openshift-v4/clients/ocp/stable/openshift-client-linux.tar.gz" | ||
| tar -xzf openshift-client-linux.tar.gz | ||
| sudo mv oc /usr/local/bin/ | ||
| rm -f openshift-client-linux.tar.gz kubectl README.md | ||
| curl -fsSL --retry 3 --retry-delay 5 https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash | ||
|
|
There was a problem hiding this comment.
Replace curl | bash Helm installation on the self-hosted runner.
Downloading and executing a remote script from main is a direct supply-chain/RCE path (CWE-494). In this job it runs before cluster operations and after repository checkout, so a compromised upstream script gets full access to the runner and benchmark credentials.
Suggested fix
- curl -fsSL --retry 3 --retry-delay 5 https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
+ HELM_VERSION="v3.17.0"
+ curl -fsSLo helm.tar.gz "https://get.helm.sh/helm-${HELM_VERSION}-linux-amd64.tar.gz"
+ echo "<published-sha256> helm.tar.gz" | sha256sum -c -
+ tar -xzf helm.tar.gz linux-amd64/helm
+ sudo install linux-amd64/helm /usr/local/bin/helm🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/ci-benchmark.yaml around lines 501 - 513, The Helm install
step uses an unsafe curl | bash pipeline (the line invoking
"https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash");
replace it with a deterministic, verifiable installation: download a specific
Helm release archive from the official GitHub releases (e.g., a pinned tag),
verify the downloaded archive against the accompanying SHA256 checksum or GPG
signature, extract the binary and move it to /usr/local/bin; update the step
labeled "Install tools (kubectl, oc, helm, make)" to perform these explicit
download + checksum verification + extraction actions instead of piping remote
script to bash.
| # Start APIService guard: KEDA on this cluster continuously reclaims the | ||
| # external.metrics.k8s.io APIService. This background loop re-patches it | ||
| # every 8 seconds so the HPA can read wva_desired_replicas during the benchmark. | ||
| # Key fix: caBundle must be set to null because KEDA sets it, and Kubernetes | ||
| # rejects insecureSkipTLSVerify=true when caBundle is present. | ||
| MONITORING_NS="openshift-user-workload-monitoring" | ||
| ( | ||
| while true; do | ||
| sleep 8 | ||
| current_svc=$(kubectl get apiservice v1beta1.external.metrics.k8s.io -o jsonpath='{.spec.service.name}' 2>/dev/null) | ||
| current_ns=$(kubectl get apiservice v1beta1.external.metrics.k8s.io -o jsonpath='{.spec.service.namespace}' 2>/dev/null) | ||
| if [ "$current_svc" != "prometheus-adapter" ] || [ "$current_ns" != "$MONITORING_NS" ]; then | ||
| echo "[apiservice-guard] KEDA reclaimed (now: $current_svc/$current_ns), re-patching..." | ||
| kubectl patch apiservice v1beta1.external.metrics.k8s.io --type=merge -p "{ | ||
| \"spec\": { | ||
| \"caBundle\": null, | ||
| \"insecureSkipTLSVerify\": true, | ||
| \"service\": { | ||
| \"name\": \"prometheus-adapter\", | ||
| \"namespace\": \"$MONITORING_NS\" | ||
| } | ||
| } | ||
| }" 2>&1 || true |
There was a problem hiding this comment.
Don't disable TLS verification on the cluster-wide external metrics APIService.
This rewires external.metrics.k8s.io to an endpoint with insecureSkipTLSVerify: true, which turns off certificate validation for a shared cluster API surface (CWE-295). Anything able to impersonate or intercept that service can spoof metrics and drive HPA decisions while the benchmark is running.
Use a real CA bundle from the adapter service certificate instead of nulling caBundle and forcing insecure TLS.
Suggested fix
- "caBundle": null,
- "insecureSkipTLSVerify": true,
+ "caBundle": "$PROM_ADAPTER_CA_BUNDLE_B64",
+ "insecureSkipTLSVerify": false,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/ci-benchmark.yaml around lines 606 - 628, The patch loop
currently forces insecureSkipTLSVerify: true and clears caBundle for the
apiservice v1beta1.external.metrics.k8s.io (in the kubectl patch call inside the
while loop using MONITORING_NS and service prometheus-adapter), which disables
TLS verification; instead, retrieve the adapter's serving CA bundle (e.g., from
the prometheus-adapter TLS secret or the Service/Endpoints certificate chain),
base64-encode that CA data, and set "spec.caBundle" to that encoded value in the
kubectl patch payload while removing or setting "spec.insecureSkipTLSVerify" to
false; keep the same apiservice name and service fields (prometheus-adapter and
MONITORING_NS) but replace the null-ca/insecureSkipTLSVerify logic with code
that fetches the real CA and applies it in the patch.
| # Override llm-d container image tags if set (e.g. upgrade from v0.3.0 to v0.6.0) | ||
| if [ -n "$LLMD_IMAGE_TAG" ]; then | ||
| log_info "Overriding llm-d image tags to $LLMD_IMAGE_TAG" | ||
| yq eval ".decode.containers[0].image = \"ghcr.io/llm-d/llm-d-cuda:${LLMD_IMAGE_TAG}\"" -i "$LLM_D_MODELSERVICE_VALUES" | ||
| yq eval ".routing.proxy.image = \"ghcr.io/llm-d/llm-d-routing-sidecar:${LLMD_IMAGE_TAG}\"" -i "$LLM_D_MODELSERVICE_VALUES" | ||
| fi |
There was a problem hiding this comment.
LLMD_IMAGE_TAG is rewriting repositories, not just tags.
Setting a tag override here hardcodes ghcr.io/llm-d/... for decode and routing, so any mirrored registry or alternative base image in LLM_D_MODELSERVICE_VALUES gets silently replaced. It also leaves the prefill image on the old version, which can roll out mixed llm-d components.
🐛 Proposed fix
- yq eval ".decode.containers[0].image = \"ghcr.io/llm-d/llm-d-cuda:${LLMD_IMAGE_TAG}\"" -i "$LLM_D_MODELSERVICE_VALUES"
- yq eval ".routing.proxy.image = \"ghcr.io/llm-d/llm-d-routing-sidecar:${LLMD_IMAGE_TAG}\"" -i "$LLM_D_MODELSERVICE_VALUES"
+ yq eval '
+ .decode.containers[0].image |= sub(":[^:@]+$"; ":" + strenv(LLMD_IMAGE_TAG)) |
+ (.prefill.containers[0].image? |= sub(":[^:@]+$"; ":" + strenv(LLMD_IMAGE_TAG))) |
+ .routing.proxy.image |= sub(":[^:@]+$"; ":" + strenv(LLMD_IMAGE_TAG))
+ ' -i "$LLM_D_MODELSERVICE_VALUES"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Override llm-d container image tags if set (e.g. upgrade from v0.3.0 to v0.6.0) | |
| if [ -n "$LLMD_IMAGE_TAG" ]; then | |
| log_info "Overriding llm-d image tags to $LLMD_IMAGE_TAG" | |
| yq eval ".decode.containers[0].image = \"ghcr.io/llm-d/llm-d-cuda:${LLMD_IMAGE_TAG}\"" -i "$LLM_D_MODELSERVICE_VALUES" | |
| yq eval ".routing.proxy.image = \"ghcr.io/llm-d/llm-d-routing-sidecar:${LLMD_IMAGE_TAG}\"" -i "$LLM_D_MODELSERVICE_VALUES" | |
| fi | |
| # Override llm-d container image tags if set (e.g. upgrade from v0.3.0 to v0.6.0) | |
| if [ -n "$LLMD_IMAGE_TAG" ]; then | |
| log_info "Overriding llm-d image tags to $LLMD_IMAGE_TAG" | |
| yq eval ' | |
| .decode.containers[0].image |= sub(":[^:@]+$"; ":" + strenv(LLMD_IMAGE_TAG)) | | |
| (.prefill.containers[0].image? |= sub(":[^:@]+$"; ":" + strenv(LLMD_IMAGE_TAG))) | | |
| .routing.proxy.image |= sub(":[^:@]+$"; ":" + strenv(LLMD_IMAGE_TAG)) | |
| ' -i "$LLM_D_MODELSERVICE_VALUES" | |
| fi |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@deploy/lib/infra_llmd.sh` around lines 104 - 109, The override currently
hardcodes the ghcr.io/llm-d/... repositories by replacing the entire image
string when LLMD_IMAGE_TAG is set; change it to only replace the image tag
portion for decode and routing (and also update the prefill image) so mirrored
registries or custom repos are preserved. Locate the yq calls that set images in
LLM_D_MODELSERVICE_VALUES (the lines referencing LLMD_IMAGE_TAG and yq eval for
.decode.containers[0].image and .routing.proxy.image), and modify them to update
just the part after the ":" (or append ":${LLMD_IMAGE_TAG}" if no tag exists)
rather than overwriting the full repository; do the same for the prefill image
field to ensure all llm-d components use the same tag.
| if strings.Contains(cm.Data[configKey], "flowControl") { | ||
| return nil | ||
| } | ||
|
|
||
| cm.Data[configKey] = DesiredEPPConfig |
There was a problem hiding this comment.
Don't use a flowControl substring as the idempotence check.
This returns early whenever the current config mentions flowControl, even if the scorer weights differ from DesiredEPPConfig. The benchmark can then report the 2/2/3 policy while actually running with a different EPP configuration.
Suggested fix
- if strings.Contains(cm.Data[configKey], "flowControl") {
+ if cm.Data[configKey] == DesiredEPPConfig {
return nil
}
cm.Data[configKey] = DesiredEPPConfig📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if strings.Contains(cm.Data[configKey], "flowControl") { | |
| return nil | |
| } | |
| cm.Data[configKey] = DesiredEPPConfig | |
| if cm.Data[configKey] == DesiredEPPConfig { | |
| return nil | |
| } | |
| cm.Data[configKey] = DesiredEPPConfig |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/benchmark/epp_config.go` around lines 73 - 77, The early return uses a
substring check on cm.Data[configKey] for "flowControl" which is too loose;
replace this with a proper idempotence check by comparing the current config to
DesiredEPPConfig (either exact string equality of cm.Data[configKey] ==
DesiredEPPConfig or, preferably, unmarshal cm.Data[configKey] and
DesiredEPPConfig and compare the relevant fields such as scorer weights) so you
only skip writing when the configurations truly match; update the logic around
configKey and DesiredEPPConfig accordingly.
| curlCmd := fmt.Sprintf( | ||
| `echo "=== Connectivity check to %s ===" && `+ | ||
| `echo "Attempting request with model=%s ..." && `+ | ||
| `HTTP_CODE=$(curl -v -sS -o /tmp/body.txt -w "%%{http_code}" -m 60 -X POST "%s/v1/completions" `+ | ||
| `-H "Content-Type: application/json" `+ | ||
| `-d '{"model":"%s","prompt":"Hello","max_tokens":5}' 2>/tmp/curl_debug.txt) && `+ | ||
| `echo "HTTP_CODE=$HTTP_CODE" && echo "--- Response Body ---" && cat /tmp/body.txt && echo "" && `+ | ||
| `echo "--- Curl Debug (last 20 lines) ---" && tail -20 /tmp/curl_debug.txt && `+ | ||
| `if [ "$HTTP_CODE" -ge 200 ] && [ "$HTTP_CODE" -lt 300 ]; then echo "OK"; else echo "FAIL: HTTP $HTTP_CODE"; exit 1; fi`, | ||
| gatewayURL, modelID, gatewayURL, modelID, | ||
| ) |
There was a problem hiding this comment.
Shell command injection via unsanitized gatewayURL and modelID (CWE-78).
User-controlled gatewayURL and modelID are interpolated directly into a shell command string via fmt.Sprintf. If either contains shell metacharacters (e.g., "; malicious_cmd;"), arbitrary commands execute in the Job container.
While this is test code and inputs likely come from trusted sources, the pattern is dangerous if reused. Sanitize inputs or pass them as environment variables.
Proposed fix: use environment variables instead of interpolation
- curlCmd := fmt.Sprintf(
- `echo "=== Connectivity check to %s ===" && `+
- // ... interpolated command
- gatewayURL, modelID, gatewayURL, modelID,
- )
+ curlCmd := `echo "=== Connectivity check to $GATEWAY_URL ===" && ` +
+ `echo "Attempting request with model=$MODEL_ID ..." && ` +
+ `HTTP_CODE=$(curl -v -sS -o /tmp/body.txt -w "%{http_code}" -m 60 -X POST "$GATEWAY_URL/v1/completions" ` +
+ `-H "Content-Type: application/json" ` +
+ `-d "{\"model\":\"$MODEL_ID\",\"prompt\":\"Hello\",\"max_tokens\":5}" 2>/tmp/curl_debug.txt) && ` +
+ // ... rest of command using $GATEWAY_URL and $MODEL_IDThen add environment variables to the container spec:
Env: []corev1.EnvVar{
{Name: "GATEWAY_URL", Value: gatewayURL},
{Name: "MODEL_ID", Value: modelID},
},🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/benchmark/job_waiter.go` around lines 82 - 92, The curlCmd currently
injects gatewayURL and modelID directly into a shell string (via curlCmd) which
risks shell injection; instead, stop interpolating those values and pass them
into the Job container as environment variables (e.g., GATEWAY_URL and MODEL_ID)
and update curlCmd to reference those env vars (e.g., "$GATEWAY_URL" and
"$MODEL_ID") or use a here-doc that reads from env, ensuring gatewayURL and
modelID are set in the container spec Env slice (Env:
[]corev1.EnvVar{{Name:"GATEWAY_URL",Value:gatewayURL},{Name:"MODEL_ID",Value:modelID}})
and remove direct fmt.Sprintf interpolation of gatewayURL/modelID in the curlCmd
variable.
| By("Patching EPP ConfigMap with flowControl + scorer weights (queue=2, kv-cache=2, prefix-cache=3)") | ||
| eppDeployName, findErr := FindEPPDeployment(ctx, k8sClient, benchCfg.LLMDNamespace) | ||
| Expect(findErr).NotTo(HaveOccurred(), "Failed to find EPP deployment") | ||
| patchErr := PatchEPPConfigMap(ctx, k8sClient, benchCfg.LLMDNamespace, eppDeployName) | ||
| if patchErr != nil { | ||
| GinkgoWriter.Printf("WARNING: EPP ConfigMap patch failed (non-fatal): %v\n", patchErr) | ||
| } else { | ||
| GinkgoWriter.Println("EPP ConfigMap patched successfully — flowControl enabled, weights 2/2/3") | ||
| eppPatched = true | ||
| } |
There was a problem hiding this comment.
Fail fast when the EPP config patch does not apply.
This benchmark is explicitly measuring the flow-control path, but patch failures are only logged as warnings. That lets the run continue and publish artifacts as a "prefill-heavy" result even when the required EPP policy was never enabled.
Suggested fix
patchErr := PatchEPPConfigMap(ctx, k8sClient, benchCfg.LLMDNamespace, eppDeployName)
- if patchErr != nil {
- GinkgoWriter.Printf("WARNING: EPP ConfigMap patch failed (non-fatal): %v\n", patchErr)
- } else {
- GinkgoWriter.Println("EPP ConfigMap patched successfully — flowControl enabled, weights 2/2/3")
- eppPatched = true
- }
+ Expect(patchErr).NotTo(HaveOccurred(), "Prefill-heavy benchmark requires the EPP flowControl config")
+ GinkgoWriter.Println("EPP ConfigMap patched successfully — flowControl enabled, weights 2/2/3")
+ eppPatched = true
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/benchmark/prefill_heavy_benchmark_test.go` around lines 390 - 399, The
test currently treats PatchEPPConfigMap failures as non-fatal warnings which
allows runs to continue without the required EPP policy; change this to fail
fast by making the patch error fatal: replace the warning branch so that when
PatchEPPConfigMap(ctx, k8sClient, benchCfg.LLMDNamespace, eppDeployName) returns
an error you call Gomega's Expect/Fail (e.g.,
Expect(patchErr).NotTo(HaveOccurred()) or Fail with a clear message) so the test
stops immediately; keep the success branch setting eppPatched = true and the
success log.
| By("Checking external metrics API (best-effort, non-blocking)") | ||
| externalMetricsOK := false | ||
| Eventually(func() bool { | ||
| result, err := k8sClient.RESTClient(). | ||
| Get(). | ||
| AbsPath("/apis/external.metrics.k8s.io/v1beta1/namespaces/" + benchCfg.LLMDNamespace + "/wva_desired_replicas"). | ||
| DoRaw(ctx) | ||
| g.Expect(err).NotTo(HaveOccurred(), "External metrics API should be accessible") | ||
| g.Expect(string(result)).To(ContainSubstring("wva_desired_replicas"), "Metric should be available") | ||
| g.Expect(string(result)).To(ContainSubstring(res.VAName), "Metric should reference the benchmark VA") | ||
| GinkgoWriter.Printf("External metrics API confirmed: wva_desired_replicas available for %s\n", res.VAName) | ||
| }, 5*time.Minute, 10*time.Second).Should(Succeed()) | ||
| if err != nil { | ||
| GinkgoWriter.Printf(" External metrics API check: %v\n", err) | ||
| return false | ||
| } | ||
| s := string(result) | ||
| if strings.Contains(s, "wva_desired_replicas") && strings.Contains(s, res.VAName) { | ||
| GinkgoWriter.Printf("External metrics API confirmed: wva_desired_replicas available for %s\n", res.VAName) | ||
| externalMetricsOK = true | ||
| return true | ||
| } | ||
| return false | ||
| }, 3*time.Minute, 10*time.Second).Should(Or(BeTrue(), Not(BeTrue()))) | ||
| if !externalMetricsOK { | ||
| GinkgoWriter.Println("WARNING: External metrics API not available — HPA may not scale. Proceeding anyway.") | ||
| } | ||
|
|
||
| By("Waiting for Prometheus to scrape simulator metrics") | ||
| Eventually(func(g Gomega) { | ||
| By("Checking Prometheus metrics (best-effort, non-blocking)") | ||
| promOK := false | ||
| Eventually(func() bool { | ||
| _, err := promClient.QueryWithRetry(ctx, `vllm:kv_cache_usage_perc`) | ||
| g.Expect(err).NotTo(HaveOccurred(), "Prometheus should have KV cache metrics from simulator") | ||
| GinkgoWriter.Println("Prometheus confirmed: vllm:kv_cache_usage_perc is available") | ||
| }, 5*time.Minute, 15*time.Second).Should(Succeed()) | ||
| if err == nil { | ||
| GinkgoWriter.Println("Prometheus confirmed: vllm:kv_cache_usage_perc is available") | ||
| promOK = true | ||
| return true | ||
| } | ||
| return false | ||
| }, 2*time.Minute, 15*time.Second).Should(Or(BeTrue(), Not(BeTrue()))) | ||
| if !promOK { | ||
| GinkgoWriter.Println("WARNING: Prometheus vLLM metrics not yet available — KV cache data may be incomplete.") | ||
| } |
There was a problem hiding this comment.
These Eventually checks never actually wait.
Or(BeTrue(), Not(BeTrue())) matches both boolean outcomes, so each Eventually passes on the first poll. That removes the intended warm-up window and can start the benchmark before either metrics pipeline is ready. Use a real polling helper that returns a boolean after the timeout instead of a tautological matcher.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/benchmark/setup_test.go` around lines 114 - 150, The Eventually calls
are using a tautological matcher Or(BeTrue(), Not(BeTrue())) so they succeed
immediately; change those to a real wait by replacing Or(BeTrue(),
Not(BeTrue())) with BeTrue() so Eventually actually polls until the metric check
returns true or the timeout expires. Specifically update the two Eventually
invocations that run the anonymous funcs which set externalMetricsOK and promOK
(the k8sClient.RESTClient() check for "wva_desired_replicas" and
promClient.QueryWithRetry for `vllm:kv_cache_usage_perc`) to use
.Should(BeTrue()) and keep the existing post-Eventually warnings that examine
externalMetricsOK/promOK.
| expectedVersion := os.Getenv("K8S_EXPECTED_VERSION") | ||
| if !clusterExists { | ||
| scriptCmd := exec.Command("bash", "deploy/kind-emulator/setup.sh", "-g", fmt.Sprintf("%d", maxGPUs), "-t", gpuType, "K8S_VERSION="+expectedVersion) | ||
| scriptCmd := exec.Command("bash", "deploy/kind-emulator/setup.sh", "-g", strconv.Itoa(maxGPUs), "-t", gpuType, "K8S_VERSION="+expectedVersion) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n -C2 'K8S_VERSION|CheckIfClusterExistsOrCreate|exec\.Command\("bash", "deploy/kind-emulator/setup\.sh"' test/utils/e2eutils.go deploy/kind-emulator/setup.shRepository: opendatahub-io/workload-variant-autoscaler
Length of output: 2310
🏁 Script executed:
head -30 test/utils/e2eutils.go | grep -E "^import|^package"Repository: opendatahub-io/workload-variant-autoscaler
Length of output: 108
🏁 Script executed:
sed -n '1,40p' test/utils/e2eutils.goRepository: opendatahub-io/workload-variant-autoscaler
Length of output: 1017
Pass K8S_VERSION as environment variable, not positional argument.
Line 316 passes "K8S_VERSION="+expectedVersion as a positional argument to bash. The exec.Command function treats all trailing strings as argv entries, not environment variables. The setup.sh script expects K8S_VERSION via environment variable (line 23: k8s_version="${K8S_VERSION:-$DEFAULT_K8S_VERSION}"). As written, the cluster will always use DEFAULT_K8S_VERSION regardless of expectedVersion.
Fix
- scriptCmd := exec.Command("bash", "deploy/kind-emulator/setup.sh", "-g", strconv.Itoa(maxGPUs), "-t", gpuType, "K8S_VERSION="+expectedVersion)
+ scriptCmd := exec.Command("bash", "deploy/kind-emulator/setup.sh", "-g", strconv.Itoa(maxGPUs), "-t", gpuType)
+ scriptCmd.Env = append(os.Environ(), "K8S_VERSION="+expectedVersion)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/utils/e2eutils.go` at line 316, The exec.Command call that creates
scriptCmd is passing "K8S_VERSION="+expectedVersion as a positional argv instead
of an environment variable; update the code around the scriptCmd creation so the
command arguments are only "bash", "deploy/kind-emulator/setup.sh", "-g",
strconv.Itoa(maxGPUs), "-t", gpuType (remove the "K8S_VERSION=..." arg) and set
the environment on scriptCmd (e.g., scriptCmd.Env = append(os.Environ(),
"K8S_VERSION="+expectedVersion)) so the setup.sh reads K8S_VERSION via the
environment as intended.
Syncs llm-d/llm-d-workload-variant-autoscaler main branch into ODH main branch.
Upstream commit: llm-d@8be94cd
Generated by #41
Upstream commits included:
Summary by CodeRabbit
New Features
Documentation
Improvements