Skip to content

[sync] upstream llm-d/llm-d-router f7e88524 [2026-06-22]#261

Open
zdtsw wants to merge 10 commits into
opendatahub-io:mainfrom
zdtsw-forking:sync/upstream-f7e88524
Open

[sync] upstream llm-d/llm-d-router f7e88524 [2026-06-22]#261
zdtsw wants to merge 10 commits into
opendatahub-io:mainfrom
zdtsw-forking:sync/upstream-f7e88524

Conversation

@zdtsw

@zdtsw zdtsw commented Jun 22, 2026

Copy link
Copy Markdown
Member

Syncs llm-d/llm-d-router main into opendatahub-io/llm-d-router main.

Upstream commit: llm-d@f7e8852

Summary by CodeRabbit

  • New Features

    • Added OpenTelemetry tracing instrumentation to outbound proxy HTTP requests for improved observability.
  • Documentation

    • New container sizing guide for router deployment with CPU/memory scaling recommendations.
    • Updated architecture documentation with additional resource references.
  • Chores

    • Updated llm-d-kv-cache dependency to stable release.
    • Refactored E2E testing infrastructure and router validation test suite.
    • Simplified Helm chart configuration for standalone deployments.

gyliu513 and others added 9 commits June 18, 2026 15:44
…-d#1616)

add recommendation for agentic inference requests for CPU and memory.
add recommendation for resource consumption for prefix block size
add section on envoy proxy resource usage.

Signed-off-by: Jacob Murry <jacobmurry@google.com>
Update github.com/llm-d/llm-d-kv-cache from v0.9.0-rc.1 to the
released v0.9.0.

Signed-off-by: Maroon Ayoub <mayoub@redhat.com>
…m-d#1696)

The GA InferencePool API (inference.networking.k8s.io/v1) nests pod
labels under spec.selector.matchLabels, while the deprecated alpha API
(inference.networking.x-k8s.io/v1alpha2) uses a flat spec.selector map.

The allowlist validator read spec.selector unconditionally and
fmt.Sprintf'd nested map values into label strings, producing invalid
selectors like "map[app.kubernetes.io/name:...]" that Kubernetes
rejected for exceeding the 63-character label value limit.

Branch the selector path based on the detected API group and use
NestedStringMap for type-safe extraction.

Signed-off-by: Pierangelo Di Pilato <pierdipi@redhat.com>
…is disabled (llm-d#1695)

When router.inferencePool.create=false the EPP --endpoint-selector is
rendered solely from router.modelServers.matchLabels. If matchLabels is
omitted the selector renders empty and EPP rejects it at startup, because
standalone mode requires either --pool-name or a non-empty
--endpoint-selector.

The existing matchLabels validation lives in _inferencepool.yaml, which
only renders when an InferencePool is created, so the create=false path
had no guard and produced a broken EPP deployment silently.

Add the check to the standalone validations so helm template fails fast
with an actionable message, and add a negative test case to
hack/verify-helm.sh covering the reproducer from the issue.

Fixes llm-d#1665

Signed-off-by: Jonathan Wrede <wrede.jonathan00@gmail.com>
* test: consolidate and rename e2e runner scripts

Signed-off-by: roytman <roytman@il.ibm.com>

* fix comments

Signed-off-by: roytman <roytman@il.ibm.com>

* Update test/scripts/test-e2e-router.sh

Co-authored-by: Shmuel Kallner <kallner@il.ibm.com>
Signed-off-by: Alexey Roytman <roytman@il.ibm.com>

* Update test/scripts/e2e-common.sh

Co-authored-by: Shmuel Kallner <kallner@il.ibm.com>
Signed-off-by: Alexey Roytman <roytman@il.ibm.com>

---------

Signed-off-by: roytman <roytman@il.ibm.com>
Signed-off-by: Alexey Roytman <roytman@il.ibm.com>
Co-authored-by: Shmuel Kallner <kallner@il.ibm.com>
… PR 2905 (llm-d#1600)

* fix(standalone): rework agentgateway configuration to use derived pseudo-service

Signed-off-by: abhay1999 <abhaychaurasiya19@gmail.com>

* chore: trigger ci re-run

Signed-off-by: abhay1999 <abhaychaurasiya19@gmail.com>

* fix(standalone): update agentgateway validation error and add verification assertions

Signed-off-by: abhay1999 <abhaychaurasiya19@gmail.com>

---------

Signed-off-by: abhay1999 <abhaychaurasiya19@gmail.com>
Signed-off-by: weizhoublue <weizhou.lan@daocloud.io>
…ting (llm-d#1683)

Signed-off-by: weizhoublue <weizhou.lan@daocloud.io>
@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

This PR replaces the scheduler-based e2e test infrastructure with a router-based one: a shared e2e-common.sh helper is introduced, a new test-e2e-router.sh script is added, the test-e2e-gaie.sh script is refactored to use the shared helpers, and Makefile targets plus CI workflow are updated accordingly. The agentgateway Helm configuration is refactored to derive backend service names and ports from modelServers labels rather than explicit agentgateway.service.* values, removing the agentgateway Service template and tightening validations. The sidecar proxy's outbound HTTP transport is wrapped with otelhttp.NewTransport for W3C trace context propagation. AllowlistValidator.poolSelector is introduced to handle both GA (spec.selector.matchLabels) and deprecated alpha (spec.selector flat map) InferencePool API shapes. Supporting changes include a new docs/operations.md sizing guide, a go.mod bump to stable v0.9.0, a yq path fix in push-chart.sh, and an OWNED_IMAGE_RE regex fix.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 8 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
No Injection Vectors ⚠️ Warning CWE-78 shell command injection in hack/verify-helm.sh: Multiple eval statements execute dynamically constructed commands with the HELM environment variable, which can contain arbitrary shell code i... Replace eval statements with bash arrays and direct execution: cmd=("${HELM}" template ...); "${cmd[@]}" to eliminate shell metacharacter expansion.
Title check ❓ Inconclusive The PR title uses a generic sync/upstream marker format that does not convey the substantive changes included in this pull request. Replace with a title describing the main functional changes, such as 'Add router e2e tests, fix InferencePool selector parsing, and inject W3C trace context' or similar.
✅ Passed checks (8 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Contribution Quality And Spam Detection ✅ Passed Legitimate upstream sync: coherent refactorings across config/code/tests; real security fixes (CWE-284, CWE-78, CWE-706, CWE-20/918) with test coverage; zero spam/AI signals detected.
No Hardcoded Secrets ✅ Passed No hardcoded secrets, API keys, tokens, passwords, or embedded credentials detected across all modified files; Kubernetes secrets properly referenced.
No Weak Cryptography ✅ Passed No banned cryptographic primitives (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), custom crypto implementations, or non-constant-time secret comparisons found. TLS uses strong ECDHE ciphers with TLS 1...
No Privileged Containers ✅ Passed PR contains only justified privileged configurations: Istio control plane sidecar proxy debugged conditionally on IPTABLES_TRACE_LOGGING flag, and production containers run as non-root (USER 65532).
No Sensitive Data In Logs ✅ Passed No logging statements expose passwords, tokens, API keys, PII, session IDs, bearer tokens, or customer data. Structured logging of pod names/IPs and infrastructure identifiers is acceptable for int...

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/sidecar/proxy/proxy.go (1)

404-433: ⚠️ Potential issue | 🔴 Critical

Harden TLS verification in decoder proxy transport (CWE-295).

Line 421 directly passes user-configurable insecureSkipVerify to tls.Config.InsecureSkipVerify. Operators can enable this via --tls-insecure-skip-verify=decoder or YAML config, disabling certificate validation on decoder (vLLM) HTTPS traffic. Enables MITM attacks: prompt tampering, response injection, and data theft.

Remediation
 func (s *Server) newProxyTransport(scheme string, insecureSkipVerify bool) http.RoundTripper {
 	if scheme == schemeHTTPS {
+		// Certificate verification is mandatory. Operators must use proper CA roots.
 		t.TLSClientConfig = &tls.Config{
-			InsecureSkipVerify: insecureSkipVerify, //nolint:gosec
+			InsecureSkipVerify: false,
 			MinVersion:         tls.VersionTLS12,

Per **/*.go security guideline: "No InsecureSkipVerify in TLS configs (enables MITM attacks)."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/sidecar/proxy/proxy.go` around lines 404 - 433, The newProxyTransport
method directly passes the user-configurable insecureSkipVerify parameter to
tls.Config.InsecureSkipVerify, which violates security guidelines and enables
MITM attacks. Remove the insecureSkipVerify parameter from the newProxyTransport
function signature and ensure that InsecureSkipVerify in the tls.Config is
always set to false (or omitted entirely, since false is the default). This
prevents operators from disabling certificate validation on decoder HTTPS
traffic regardless of configuration options like --tls-insecure-skip-verify.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@config/charts/llm-d-router-standalone/templates/_validations.tpl`:
- Around line 21-27: The current validation in the _validations.tpl template
only checks if matchLabels exists, but does not validate the specific
matchLabels.app field which is required for agentgateway backend naming. When
router.proxy.proxyType is set to agentgateway and router.inferencePool.create is
false (standalone mode), add an additional validation check after the existing
matchLabels check to ensure that matchLabels.app is present and valid. This
validation should fail with a clear error message if matchLabels.app is missing
or empty when using agentgateway in standalone mode, preventing silent routing
errors at runtime.

In `@docs/operations.md`:
- Around line 111-136: The Helm configuration example in the documentation
includes a router.epp.resources section that does not exist in the actual chart
schema. Remove the entire EPP container resources block (lines showing epp: with
its nested resources, requests for cpu and memory, and limits) and retain only
the valid router.proxy.resources configuration block. This will ensure users
copying the example get valid YAML that matches the actual chart structure which
only supports router.proxy resources configuration.

In `@hack/verify-helm.sh`:
- Around line 151-186: The code uses `eval` to execute dynamically constructed
command strings, creating a command injection vulnerability. Replace each
command variable (missing_endpoint_selector_command, invalid_proxy_command,
deprecated_agentgateway_service_command,
unsupported_agentgateway_llm_d_router_gateway_command,
unsupported_agentgateway_listener_port_command,
mismatched_agentgateway_listener_target_port_command) with bash arrays instead
of strings, and execute them using the array syntax `if "${cmd_array[@]}"
>/dev/null; then` instead of `if eval "${cmd_string}"; then`. This eliminates
the security risk while preserving the same command execution behavior.

In `@pkg/sidecar/proxy/allowlist_test.go`:
- Around line 46-177: The poolSelector Context is missing regression tests for
empty selector fail-closed behavior. Add two new test cases within the
poolSelector Context: one that verifies the poolSelector method on an
AllowlistValidator returns an error when the GA InferencePool has an empty
matchLabels map in its spec.selector field, and another that verifies it returns
an error when the alpha InferencePool has an empty selector map in its spec
field. Both test cases should follow the existing error case pattern (using the
same pool setup structure and Expect assertions for HaveOccurred and error
message validation) to ensure match-all selector regressions are prevented.

In `@pkg/sidecar/proxy/allowlist.go`:
- Around line 253-257: The early return at lines 253-257 when av.poolSelector
fails bypasses the informer replacement and stop path that occurs at lines
288-295, creating a security vulnerability where a malformed pool update can
leave the previous selector active and keep stale targets authorized. Instead of
returning immediately on selector extraction error, execute the same informer
teardown logic used in the informer replacement path (including stopping and
clearing the informer) plus recompute the target state to ensure the system
fails closed rather than leaving stale authorization rules in place.
- Around line 275-280: The code currently accepts empty selectorData which
results in an empty label selector that matches all resources in a namespace,
creating a security vulnerability. After successfully retrieving selectorData
from the spec using unstructured.NestedStringMap (when the found check passes),
add a validation to reject empty selectorData maps by checking if
len(selectorData) == 0 and returning an appropriate error message before the
return statement that calls labels.Set(selectorData).AsSelector(). This ensures
the allowlist scope is properly enforced and prevents namespace-wide resource
access.

In `@pkg/sidecar/proxy/transport_test.go`:
- Around line 34-35: The test mutates global OTel propagator state via
otel.SetTextMapPropagator without restoring it, causing order-dependent failures
in other tests. Save the original propagator returned by
otel.GetTextMapPropagator before calling otel.SetTextMapPropagator, then restore
it in t.Cleanup() following the pattern already established in
pkg/common/observability/tracing/telemetry_test.go. This ensures the global
state is properly reset after the test completes.

In `@test/scripts/test-e2e-gaie.sh`:
- Around line 24-26: The SIM_IMAGE variable assignment uses the wrong
environment variable reference. On the line where SIM_IMAGE is being set with a
default value, it currently references ${VLLM_IMAGE:-...} as the fallback, but
it should reference ${SIM_IMAGE:-...} instead. Change the variable name in the
parameter expansion from VLLM_IMAGE to SIM_IMAGE so that environment variable
overrides for SIM_IMAGE are properly honored.

In `@test/scripts/test-e2e-router.sh`:
- Around line 24-30: The trap handler for INT and TERM signals hard-codes the
cluster name "e2e-tests" and unconditionally calls e2e_handle_interrupt, which
can delete an unrelated cluster with the same name if the script is running
against an existing K8S_CONTEXT. Gate the cleanup to only occur when the script
actually created the cluster by conditionally setting the trap or passing a flag
to e2e_handle_interrupt that indicates ownership. The cleanup should only be
active when K8S_CONTEXT is unset (meaning the script created the cluster), not
when using an existing context.

---

Outside diff comments:
In `@pkg/sidecar/proxy/proxy.go`:
- Around line 404-433: The newProxyTransport method directly passes the
user-configurable insecureSkipVerify parameter to tls.Config.InsecureSkipVerify,
which violates security guidelines and enables MITM attacks. Remove the
insecureSkipVerify parameter from the newProxyTransport function signature and
ensure that InsecureSkipVerify in the tls.Config is always set to false (or
omitted entirely, since false is the default). This prevents operators from
disabling certificate validation on decoder HTTPS traffic regardless of
configuration options like --tls-insecure-skip-verify.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 1241f0e4-d5b4-4890-a93b-743af2c3acdf

📥 Commits

Reviewing files that changed from the base of the PR and between bb5f430 and c085bd7.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum, !**/*.sum
📒 Files selected for processing (22)
  • .github/workflows/ci-pr-checks.yaml
  • Makefile
  • README.md
  • config/charts/README.md
  • config/charts/llm-d-router-standalone/templates/_validations.tpl
  • config/charts/llm-d-router-standalone/templates/agentgateway-service.yaml
  • config/charts/llm-d-router-standalone/values.yaml
  • config/charts/routerlib/templates/_helpers.tpl
  • docs/architecture.md
  • docs/operations.md
  • go.mod
  • hack/push-chart.sh
  • hack/verify-helm.sh
  • pkg/sidecar/proxy/allowlist.go
  • pkg/sidecar/proxy/allowlist_test.go
  • pkg/sidecar/proxy/proxy.go
  • pkg/sidecar/proxy/transport_test.go
  • scripts/check-latest-tags.sh
  • test/scripts/e2e-common.sh
  • test/scripts/run_e2e.sh
  • test/scripts/test-e2e-gaie.sh
  • test/scripts/test-e2e-router.sh
💤 Files with no reviewable changes (2)
  • config/charts/llm-d-router-standalone/templates/agentgateway-service.yaml
  • test/scripts/run_e2e.sh

Comment on lines +21 to +27
{{- /* Without an InferencePool the EPP --endpoint-selector is rendered from modelServers.matchLabels; an empty selector is rejected by EPP at startup, so require it here. */ -}}
{{- $useInferencePool := ne .Values.router.inferencePool.create false -}}
{{- if not $useInferencePool -}}
{{- if or (empty .Values.router.modelServers) (not .Values.router.modelServers.matchLabels) -}}
{{- fail ".Values.router.modelServers.matchLabels is required when .Values.router.inferencePool.create=false: standalone mode renders the EPP --endpoint-selector from matchLabels and cannot start with an empty selector" -}}
{{- end -}}
{{- end -}}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Require and validate modelServers.matchLabels.app for agentgateway backend naming (CWE-20).

Line 21-Line 27 only enforces that matchLabels exists. In this PR, agentgateway backend identity is derived from matchLabels.app; if app is missing/invalid, routing can silently target the wrong backend name at runtime. Add a hard fail for missing/invalid matchLabels.app when router.proxy.proxyType=agentgateway and router.inferencePool.create=false.

Proposed validation hardening
 {{- if not $useInferencePool -}}
   {{- if or (empty .Values.router.modelServers) (not .Values.router.modelServers.matchLabels) -}}
     {{- fail ".Values.router.modelServers.matchLabels is required when .Values.router.inferencePool.create=false: standalone mode renders the EPP --endpoint-selector from matchLabels and cannot start with an empty selector" -}}
   {{- end -}}
+  {{- if eq ($proxy.proxyType | default "envoy") "agentgateway" -}}
+    {{- $appLabel := index (.Values.router.modelServers.matchLabels | default dict) "app" | default "" -}}
+    {{- if empty $appLabel -}}
+      {{- fail ".Values.router.modelServers.matchLabels.app is required when proxyType=agentgateway and inferencePool.create=false" -}}
+    {{- end -}}
+    {{- if or (gt (len $appLabel) 63) (not (regexMatch "^[a-z0-9]([-a-z0-9]*[a-z0-9])?$" $appLabel)) -}}
+      {{- fail ".Values.router.modelServers.matchLabels.app must be a DNS-1123 label for agentgateway backend naming" -}}
+    {{- end -}}
+  {{- end -}}
 {{- end -}}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@config/charts/llm-d-router-standalone/templates/_validations.tpl` around
lines 21 - 27, The current validation in the _validations.tpl template only
checks if matchLabels exists, but does not validate the specific matchLabels.app
field which is required for agentgateway backend naming. When
router.proxy.proxyType is set to agentgateway and router.inferencePool.create is
false (standalone mode), add an additional validation check after the existing
matchLabels check to ensure that matchLabels.app is present and valid. This
validation should fail with a clear error message if matchLabels.app is missing
or empty when using agentgateway in standalone mode, preventing silent routing
errors at runtime.

Comment thread docs/operations.md
Comment on lines +111 to +136
```yaml
router:
# Endpoint Picker (EPP) Container Resources
epp:
resources:
requests:
cpu: "32"
memory: "64Gi"
limits:
memory: "128Gi"

# Envoy Proxy Container Resources
proxy:
resources:
requests:
cpu: "8"
memory: "2Gi"
limits:
memory: "4Gi"
```

To apply these values during deployment, run the Helm install or upgrade command with your custom values file:

```bash
helm install optimize-baseline ./config/charts/llm-d-router-standalone -f resource_overrides.yaml
```

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify Helm values.yaml schema contains router.epp.resources and router.proxy.resources keys.

# Check for epp resources path
rg -n 'router:' config/charts/llm-d-router-standalone/values.yaml -A 30 | grep -E '(epp:|proxy:|resources:)'

# Confirm exact nesting structure
yq '.router.epp.resources' config/charts/llm-d-router-standalone/values.yaml 2>/dev/null || echo "Path does not exist"
yq '.router.proxy.resources' config/charts/llm-d-router-standalone/values.yaml 2>/dev/null || echo "Path does not exist"

Repository: opendatahub-io/llm-d-router

Length of output: 181


🏁 Script executed:

cat -n config/charts/llm-d-router-standalone/values.yaml | head -60

Repository: opendatahub-io/llm-d-router

Length of output: 2190


🏁 Script executed:

# Search for 'epp:' in the values.yaml file
rg -n 'epp:' config/charts/llm-d-router-standalone/values.yaml

Repository: opendatahub-io/llm-d-router

Length of output: 53


🏁 Script executed:

# Get the complete router section structure
rg -n '^  [a-z]' config/charts/llm-d-router-standalone/values.yaml | head -20

Repository: opendatahub-io/llm-d-router

Length of output: 124


🏁 Script executed:

# Get full values.yaml to understand all top-level keys under router
cat config/charts/llm-d-router-standalone/values.yaml | grep -E '^  [a-z]' | sort -u

Repository: opendatahub-io/llm-d-router

Length of output: 117


Remove invalid router.epp.resources configuration from Helm example; retain only valid router.proxy.resources.

The chart values.yaml does not contain a router.epp key. The documented configuration at lines 111–136 references router.epp.resources which does not exist in the chart schema. Users copying this example will have invalid YAML that fails deployment. The router.proxy.resources configuration is valid; remove the non-existent EPP block.

Actual chart structure
router:
  modelServers:
  extraServicePorts:
  proxy:
    resources:          # <-- only this path exists
      requests:
        cpu: "4"
        memory: 8Gi
      limits:
        memory: 16Gi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/operations.md` around lines 111 - 136, The Helm configuration example in
the documentation includes a router.epp.resources section that does not exist in
the actual chart schema. Remove the entire EPP container resources block (lines
showing epp: with its nested resources, requests for cpu and memory, and limits)
and retain only the valid router.proxy.resources configuration block. This will
ensure users copying the example get valid YAML that matches the actual chart
structure which only supports router.proxy resources configuration.

Comment thread hack/verify-helm.sh
Comment on lines +151 to +186
missing_endpoint_selector_command="${HELM} template ${SCRIPT_ROOT}/config/charts/llm-d-router-standalone --set router.inferencePool.create=false --set router.modelServers.type=vllm --set 'router.modelServers.targetPorts[0].number=8000' >/dev/null"
echo "Executing: ${missing_endpoint_selector_command}"
if eval "${missing_endpoint_selector_command}"; then
echo "Helm template unexpectedly succeeded for inferencePool.create=false without modelServers.matchLabels"
exit 1
fi

invalid_proxy_command="${HELM} template ${SCRIPT_ROOT}/config/charts/llm-d-router-standalone --set router.modelServers.matchLabels.app=llm-instance-gateway --set router.inferencePool.create=false --set router.proxy.proxyType=bogus >/dev/null"
echo "Executing: ${invalid_proxy_command}"
if eval "${invalid_proxy_command}"; then
echo "Helm template unexpectedly succeeded for invalid proxyType"
exit 1
fi

missing_agentgateway_service_command="${HELM} template ${SCRIPT_ROOT}/config/charts/llm-d-router-standalone --set router.modelServers.matchLabels.app=llm-instance-gateway --set router.inferencePool.create=false --set router.proxy.proxyType=agentgateway >/dev/null"
echo "Executing: ${missing_agentgateway_service_command}"
if eval "${missing_agentgateway_service_command}"; then
echo "Helm template unexpectedly succeeded for missing agentgateway service.name"
deprecated_agentgateway_service_command="${HELM} template ${SCRIPT_ROOT}/config/charts/llm-d-router-standalone --set router.modelServers.matchLabels.app=llm-instance-gateway --set router.inferencePool.create=false --set router.proxy.proxyType=agentgateway --set router.proxy.agentgateway.service.name=foo >/dev/null"
echo "Executing: ${deprecated_agentgateway_service_command}"
if eval "${deprecated_agentgateway_service_command}"; then
echo "Helm template unexpectedly succeeded for deprecated agentgateway.service configuration"
exit 1
fi

unsupported_agentgateway_llm_d_router_gateway_command="${HELM} template ${SCRIPT_ROOT}/config/charts/llm-d-router-standalone --set router.proxy.proxyType=agentgateway --set router.proxy.agentgateway.service.name=llm-instance-gateway --set 'router.proxy.agentgateway.service.ports[0]=8000' --set router.inferencePool.create=true --set router.modelServers.matchLabels.app=llm-instance-gateway >/dev/null"
unsupported_agentgateway_llm_d_router_gateway_command="${HELM} template ${SCRIPT_ROOT}/config/charts/llm-d-router-standalone --set router.proxy.proxyType=agentgateway --set router.inferencePool.create=true --set router.modelServers.matchLabels.app=llm-instance-gateway >/dev/null"
echo "Executing: ${unsupported_agentgateway_llm_d_router_gateway_command}"
if eval "${unsupported_agentgateway_llm_d_router_gateway_command}"; then
echo "Helm template unexpectedly succeeded for unsupported agentgateway createInferencePool=true configuration"
exit 1
fi

mismatched_agentgateway_ports_command="${HELM} template ${SCRIPT_ROOT}/config/charts/llm-d-router-standalone --set router.proxy.proxyType=agentgateway --set router.proxy.agentgateway.service.name=llm-instance-gateway --set 'router.proxy.agentgateway.service.ports[0]=8001' --set router.modelServers.matchLabels.app=llm-instance-gateway --set router.inferencePool.create=false --set 'router.modelServers.targetPorts[0].number=8000' >/dev/null"
echo "Executing: ${mismatched_agentgateway_ports_command}"
if eval "${mismatched_agentgateway_ports_command}"; then
echo "Helm template unexpectedly succeeded for mismatched agentgateway service.ports"
exit 1
fi

unsupported_agentgateway_listener_port_command="${HELM} template ${SCRIPT_ROOT}/config/charts/llm-d-router-standalone --set router.proxy.proxyType=agentgateway --set router.proxy.agentgateway.service.name=llm-instance-gateway --set 'router.proxy.agentgateway.service.ports[0]=8000' --set router.modelServers.matchLabels.app=llm-instance-gateway --set router.inferencePool.create=false --set 'router.modelServers.targetPorts[0].number=8000' --set 'router.extraServicePorts[0].name=proxy' --set 'router.extraServicePorts[0].port=9000' --set 'router.extraServicePorts[0].protocol=TCP' --set 'router.extraServicePorts[0].targetPort=9000' >/dev/null"
unsupported_agentgateway_listener_port_command="${HELM} template ${SCRIPT_ROOT}/config/charts/llm-d-router-standalone --set router.proxy.proxyType=agentgateway --set router.modelServers.matchLabels.app=llm-instance-gateway --set router.inferencePool.create=false --set 'router.modelServers.targetPorts[0].number=8000' --set 'router.extraServicePorts[0].name=proxy' --set 'router.extraServicePorts[0].port=9000' --set 'router.extraServicePorts[0].protocol=TCP' --set 'router.extraServicePorts[0].targetPort=9000' >/dev/null"
echo "Executing: ${unsupported_agentgateway_listener_port_command}"
if eval "${unsupported_agentgateway_listener_port_command}"; then
echo "Helm template unexpectedly succeeded without an agentgateway listener Service port named http"
exit 1
fi

mismatched_agentgateway_listener_target_port_command="${HELM} template ${SCRIPT_ROOT}/config/charts/llm-d-router-standalone --set router.proxy.proxyType=agentgateway --set router.proxy.agentgateway.service.name=llm-instance-gateway --set 'router.proxy.agentgateway.service.ports[0]=8000' --set router.modelServers.matchLabels.app=llm-instance-gateway --set router.inferencePool.create=false --set 'router.modelServers.targetPorts[0].number=8000' --set 'router.extraServicePorts[0].name=http' --set 'router.extraServicePorts[0].port=9000' --set 'router.extraServicePorts[0].protocol=TCP' --set 'router.extraServicePorts[0].targetPort=9001' >/dev/null"
mismatched_agentgateway_listener_target_port_command="${HELM} template ${SCRIPT_ROOT}/config/charts/llm-d-router-standalone --set router.proxy.proxyType=agentgateway --set router.modelServers.matchLabels.app=llm-instance-gateway --set router.inferencePool.create=false --set 'router.modelServers.targetPorts[0].number=8000' --set 'router.extraServicePorts[0].name=http' --set 'router.extraServicePorts[0].port=9000' --set 'router.extraServicePorts[0].protocol=TCP' --set 'router.extraServicePorts[0].targetPort=9001' >/dev/null"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Remove eval command execution in test invocations (CWE-78).

Line 153, Line 160, Line 167, Line 174, and Line 181 execute dynamically built strings with eval. That creates avoidable command-injection surface and quoting fragility in CI scripts. Execute commands via arrays and "$\{cmd[@]\}" instead.

Safer execution pattern (apply to each command block)
-missing_endpoint_selector_command="${HELM} template ${SCRIPT_ROOT}/config/charts/llm-d-router-standalone --set router.inferencePool.create=false --set router.modelServers.type=vllm --set 'router.modelServers.targetPorts[0].number=8000' >/dev/null"
-echo "Executing: ${missing_endpoint_selector_command}"
-if eval "${missing_endpoint_selector_command}"; then
+missing_endpoint_selector_cmd=(
+  "${HELM}" template "${SCRIPT_ROOT}/config/charts/llm-d-router-standalone"
+  --set router.inferencePool.create=false
+  --set router.modelServers.type=vllm
+  --set 'router.modelServers.targetPorts[0].number=8000'
+)
+echo "Executing: ${missing_endpoint_selector_cmd[*]}"
+if "${missing_endpoint_selector_cmd[@]}" >/dev/null; then
   echo "Helm template unexpectedly succeeded for inferencePool.create=false without modelServers.matchLabels"
   exit 1
 fi

As per coding guidelines, **/*.sh: “Avoid eval and dynamic command execution” and “Quote all variables to prevent injection.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@hack/verify-helm.sh` around lines 151 - 186, The code uses `eval` to execute
dynamically constructed command strings, creating a command injection
vulnerability. Replace each command variable (missing_endpoint_selector_command,
invalid_proxy_command, deprecated_agentgateway_service_command,
unsupported_agentgateway_llm_d_router_gateway_command,
unsupported_agentgateway_listener_port_command,
mismatched_agentgateway_listener_target_port_command) with bash arrays instead
of strings, and execute them using the array syntax `if "${cmd_array[@]}"
>/dev/null; then` instead of `if eval "${cmd_string}"; then`. This eliminates
the security risk while preserving the same command execution behavior.

Source: Coding guidelines

Comment on lines +46 to +177
Context("poolSelector", func() {
It("should extract selector from GA InferencePool (matchLabels)", func() {
av := &AllowlistValidator{
gvr: schema.GroupVersionResource{
Group: routing.InferencePoolAPIGroup,
Version: "v1",
Resource: "inferencepools",
},
}
pool := &unstructured.Unstructured{
Object: map[string]interface{}{
"apiVersion": "inference.networking.k8s.io/v1",
"kind": "InferencePool",
"metadata": map[string]interface{}{"name": "test-pool"},
"spec": map[string]interface{}{
"selector": map[string]interface{}{
"matchLabels": map[string]interface{}{
"app.kubernetes.io/name": "my-model",
"component": "serving",
},
},
},
},
}

selector, err := av.poolSelector(pool)
Expect(err).ToNot(HaveOccurred())
Expect(selector.String()).To(SatisfyAll(
ContainSubstring("app.kubernetes.io/name=my-model"),
ContainSubstring("component=serving"),
))
})

It("should extract selector from deprecated alpha InferencePool (flat map)", func() {
av := &AllowlistValidator{
gvr: schema.GroupVersionResource{
Group: "inference.networking.x-k8s.io",
Version: "v1alpha2",
Resource: "inferencepools",
},
}
pool := &unstructured.Unstructured{
Object: map[string]interface{}{
"apiVersion": "inference.networking.x-k8s.io/v1alpha2",
"kind": "InferencePool",
"metadata": map[string]interface{}{"name": "test-pool"},
"spec": map[string]interface{}{
"selector": map[string]interface{}{
"app.kubernetes.io/name": "my-model",
"component": "serving",
},
},
},
}

selector, err := av.poolSelector(pool)
Expect(err).ToNot(HaveOccurred())
Expect(selector.String()).To(SatisfyAll(
ContainSubstring("app.kubernetes.io/name=my-model"),
ContainSubstring("component=serving"),
))
})

It("should fail for GA pool with flat selector (no matchLabels)", func() {
av := &AllowlistValidator{
gvr: schema.GroupVersionResource{
Group: routing.InferencePoolAPIGroup,
Version: "v1",
Resource: "inferencepools",
},
}
pool := &unstructured.Unstructured{
Object: map[string]interface{}{
"apiVersion": "inference.networking.k8s.io/v1",
"kind": "InferencePool",
"metadata": map[string]interface{}{"name": "test-pool"},
"spec": map[string]interface{}{
"selector": map[string]interface{}{
"app": "my-model",
},
},
},
}

_, err := av.poolSelector(pool)
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("matchLabels"))
})

It("should fail when spec is missing", func() {
av := &AllowlistValidator{
gvr: schema.GroupVersionResource{
Group: routing.InferencePoolAPIGroup,
Version: "v1",
Resource: "inferencepools",
},
}
pool := &unstructured.Unstructured{
Object: map[string]interface{}{
"apiVersion": "inference.networking.k8s.io/v1",
"kind": "InferencePool",
"metadata": map[string]interface{}{"name": "test-pool"},
},
}

_, err := av.poolSelector(pool)
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("spec"))
})

It("should fail when selector is missing", func() {
av := &AllowlistValidator{
gvr: schema.GroupVersionResource{
Group: "inference.networking.x-k8s.io",
Version: "v1alpha2",
Resource: "inferencepools",
},
}
pool := &unstructured.Unstructured{
Object: map[string]interface{}{
"apiVersion": "inference.networking.x-k8s.io/v1alpha2",
"kind": "InferencePool",
"metadata": map[string]interface{}{"name": "test-pool"},
"spec": map[string]interface{}{},
},
}

_, err := av.poolSelector(pool)
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("selector"))
})
})

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Add regression tests for empty selector fail-closed behavior (CWE-284).

Current cases validate missing fields but not matchLabels: {} (GA) or selector: {} (alpha). Add explicit assertions that these return errors, so match-all selector regressions are blocked.

Suggested test additions
 Context("poolSelector", func() {
+	It("should fail for GA pool with empty matchLabels", func() {
+		av := &AllowlistValidator{
+			gvr: schema.GroupVersionResource{
+				Group:    routing.InferencePoolAPIGroup,
+				Version:  "v1",
+				Resource: "inferencepools",
+			},
+		}
+		pool := &unstructured.Unstructured{
+			Object: map[string]interface{}{
+				"apiVersion": "inference.networking.k8s.io/v1",
+				"kind":       "InferencePool",
+				"metadata":   map[string]interface{}{"name": "test-pool"},
+				"spec": map[string]interface{}{
+					"selector": map[string]interface{}{
+						"matchLabels": map[string]interface{}{},
+					},
+				},
+			},
+		}
+		_, err := av.poolSelector(pool)
+		Expect(err).To(HaveOccurred())
+	})
+
+	It("should fail for alpha pool with empty selector", func() {
+		av := &AllowlistValidator{
+			gvr: schema.GroupVersionResource{
+				Group:    "inference.networking.x-k8s.io",
+				Version:  "v1alpha2",
+				Resource: "inferencepools",
+			},
+		}
+		pool := &unstructured.Unstructured{
+			Object: map[string]interface{}{
+				"apiVersion": "inference.networking.x-k8s.io/v1alpha2",
+				"kind":       "InferencePool",
+				"metadata":   map[string]interface{}{"name": "test-pool"},
+				"spec": map[string]interface{}{
+					"selector": map[string]interface{}{},
+				},
+			},
+		}
+		_, err := av.poolSelector(pool)
+		Expect(err).To(HaveOccurred())
+	})

As per coding guidelines, "**: REVIEW PRIORITIES: Security vulnerabilities (provide severity, exploit scenario, and remediation code)."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/sidecar/proxy/allowlist_test.go` around lines 46 - 177, The poolSelector
Context is missing regression tests for empty selector fail-closed behavior. Add
two new test cases within the poolSelector Context: one that verifies the
poolSelector method on an AllowlistValidator returns an error when the GA
InferencePool has an empty matchLabels map in its spec.selector field, and
another that verifies it returns an error when the alpha InferencePool has an
empty selector map in its spec field. Both test cases should follow the existing
error case pattern (using the same pool setup structure and Expect assertions
for HaveOccurred and error message validation) to ensure match-all selector
regressions are prevented.

Source: Coding guidelines

Comment on lines +253 to 257
selector, err := av.poolSelector(poolObj)
if err != nil {
av.logger.Error(err, "failed to extract selector from InferencePool", "name", poolName)
return
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Fail-open informer state on selector parse errors (CWE-284).

Line 253-Line 257 returns before the informer replacement/stop path (Line 288-Line 295), so a malformed pool update can leave the previous selector active and keep stale targets authorized by SSRF checks.

Use the same teardown path used for informer replacement (plus target-state recomputation) before returning, so this path fails closed.

As per coding guidelines, "**: REVIEW PRIORITIES: Security vulnerabilities (provide severity, exploit scenario, and remediation code)."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/sidecar/proxy/allowlist.go` around lines 253 - 257, The early return at
lines 253-257 when av.poolSelector fails bypasses the informer replacement and
stop path that occurs at lines 288-295, creating a security vulnerability where
a malformed pool update can leave the previous selector active and keep stale
targets authorized. Instead of returning immediately on selector extraction
error, execute the same informer teardown logic used in the informer replacement
path (including stopping and clearing the informer) plus recompute the target
state to ensure the system fails closed rather than leaving stale authorization
rules in place.

Source: Coding guidelines

Comment on lines +275 to +280
selectorData, found, err := unstructured.NestedStringMap(spec, selectorPath...)
if err != nil || !found {
return nil, fmt.Errorf("missing or invalid selector field at %v (found=%t): %w", selectorPath, found, err)
}

// Create or update pod informer for this selector
av.createPodInformer(poolName, labelSelector.AsSelector())
return labels.Set(selectorData).AsSelector(), nil

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

In Kubernetes, does an empty ListOptions.labelSelector match all resources? Please cite official Kubernetes API docs.

💡 Result:

Yes, in the Kubernetes API, an empty label selector effectively matches all resources [1][2][3]. When performing a list or watch operation, the label selector in ListOptions is used to filter resources. If the label selector is not specified or is left empty, the filter is effectively a no-op, meaning it does not restrict the returned set of objects, and all resources are returned [4][5][6]. Official Kubernetes API documentation defines a LabelSelector as a query over a set of resources; specifically, it states that "an empty label selector matches all objects" [1][2][3]. Sources: [1] https://kubernetes.io/docs/reference/kubernetes-api/common-definitions/label-selector/ [2] https://kubernetes.io/docs/reference/kubernetes-api/definitions/label-selector-v1-meta/ [4] https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/ [3] https://kubernetes.io/docs/reference/kubernetes-api/definitions/ [5] kubernetes/kubectl#545 [6] https://kubernetes.website.cncfstack.com/docs/reference/kubernetes-api/common-parameters/common-parameters/

Citations:


🏁 Script executed:

cd pkg/sidecar/proxy && wc -l allowlist.go

Repository: opendatahub-io/llm-d-router

Length of output: 88


🏁 Script executed:

cd pkg/sidecar/proxy && sed -n '275,310p' allowlist.go

Repository: opendatahub-io/llm-d-router

Length of output: 1422


🏁 Script executed:

cd pkg/sidecar/proxy && sed -n '282,320p' allowlist.go

Repository: opendatahub-io/llm-d-router

Length of output: 1444


🏁 Script executed:

cd pkg/sidecar/proxy && rg -n "createPodInformer|fromRawExtension" allowlist.go -B 3 -A 3

Repository: opendatahub-io/llm-d-router

Length of output: 577


Reject empty selector maps to prevent namespace-wide allowlisting (CWE-284, CWE-918).

Line 280 accepts empty selectorData. An empty selector.String() results in an empty LabelSelector passed to List/Watch operations at lines 300 and 308. Per Kubernetes API documentation, an empty label selector matches all resources in the namespace, bypassing the intended SSRF allowlist scope.

Suggested fix
 selectorData, found, err := unstructured.NestedStringMap(spec, selectorPath...)
 if err != nil || !found {
 	return nil, fmt.Errorf("missing or invalid selector field at %v (found=%t): %w", selectorPath, found, err)
 }
+if len(selectorData) == 0 {
+	return nil, fmt.Errorf("empty selector field at %v", selectorPath)
+}
 
 return labels.Set(selectorData).AsSelector(), nil
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/sidecar/proxy/allowlist.go` around lines 275 - 280, The code currently
accepts empty selectorData which results in an empty label selector that matches
all resources in a namespace, creating a security vulnerability. After
successfully retrieving selectorData from the spec using
unstructured.NestedStringMap (when the found check passes), add a validation to
reject empty selectorData maps by checking if len(selectorData) == 0 and
returning an appropriate error message before the return statement that calls
labels.Set(selectorData).AsSelector(). This ensures the allowlist scope is
properly enforced and prevents namespace-wide resource access.

Source: Coding guidelines

Comment on lines +34 to +35
otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(propagation.TraceContext{}, propagation.Baggage{}))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find global propagator mutations and check whether each has cleanup/restore.
rg -n -C3 --type=go 'SetTextMapPropagator\s*\('

Repository: opendatahub-io/llm-d-router

Length of output: 1840


🏁 Script executed:

#!/bin/bash
# Check if request_test.go has cleanup for its SetTextMapPropagator call
rg -n -A 20 'SetTextMapPropagator' pkg/epp/handlers/request_test.go | head -40

Repository: opendatahub-io/llm-d-router

Length of output: 640


🏁 Script executed:

#!/bin/bash
# Look for t.Cleanup patterns in test files to understand cleanup practices
rg -n 't\.Cleanup' --type=go pkg/sidecar/proxy/transport_test.go pkg/epp/handlers/request_test.go

Repository: opendatahub-io/llm-d-router

Length of output: 53


🏁 Script executed:

#!/bin/bash
# View the full test function in request_test.go to check for cleanup
sed -n '35,80p' pkg/epp/handlers/request_test.go

Repository: opendatahub-io/llm-d-router

Length of output: 1235


🏁 Script executed:

#!/bin/bash
# View the full test function in transport_test.go to confirm the issue
sed -n '31,60p' pkg/sidecar/proxy/transport_test.go

Repository: opendatahub-io/llm-d-router

Length of output: 1181


🏁 Script executed:

#!/bin/bash
# Check if there are ANY tests in the codebase using proper cleanup for global state
rg -B5 -A5 'prev.*otel|t\.Cleanup.*otel' --type=go

Repository: opendatahub-io/llm-d-router

Length of output: 945


Restore the global OTel propagator after this test.

Line 34 mutates process-global state and does not reset it, creating order-dependent failures in other tests. The same issue exists in pkg/epp/handlers/request_test.go:36. The codebase already establishes the correct pattern in pkg/common/observability/tracing/telemetry_test.go: save the original state and restore it in t.Cleanup().

Suggested remediation
 func TestNewProxyTransportInjectsTraceContext(t *testing.T) {
+	prev := otel.GetTextMapPropagator()
+	t.Cleanup(func() { otel.SetTextMapPropagator(prev) })
 	otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(propagation.TraceContext{}, propagation.Baggage{}))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(propagation.TraceContext{}, propagation.Baggage{}))
func TestNewProxyTransportInjectsTraceContext(t *testing.T) {
prev := otel.GetTextMapPropagator()
t.Cleanup(func() { otel.SetTextMapPropagator(prev) })
otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(propagation.TraceContext{}, propagation.Baggage{}))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/sidecar/proxy/transport_test.go` around lines 34 - 35, The test mutates
global OTel propagator state via otel.SetTextMapPropagator without restoring it,
causing order-dependent failures in other tests. Save the original propagator
returned by otel.GetTextMapPropagator before calling otel.SetTextMapPropagator,
then restore it in t.Cleanup() following the pattern already established in
pkg/common/observability/tracing/telemetry_test.go. This ensures the global
state is properly reset after the test completes.

Comment on lines 24 to +26
EPP_IMAGE="${EPP_IMAGE:-ghcr.io/llm-d/llm-d-router-endpoint-picker:dev}"
SIM_IMAGE="${VLLM_IMAGE:-ghcr.io/llm-d/llm-d-inference-sim:v0.9.2}"
MANIFEST_PATH="${MANIFEST_PATH:-${DIR}/../test/testdata/sim-deployment.yaml}"
MANIFEST_PATH="${MANIFEST_PATH:-${DIR}/../testdata/sim-deployment.yaml}"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix simulator image override key mismatch.

Line 25 assigns SIM_IMAGE from VLLM_IMAGE, so SIM_IMAGE=... overrides are ignored.

Proposed fix
-SIM_IMAGE="${VLLM_IMAGE:-ghcr.io/llm-d/llm-d-inference-sim:v0.9.2}"
+SIM_IMAGE="${SIM_IMAGE:-${VLLM_IMAGE:-ghcr.io/llm-d/llm-d-inference-sim:v0.9.2}}"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
EPP_IMAGE="${EPP_IMAGE:-ghcr.io/llm-d/llm-d-router-endpoint-picker:dev}"
SIM_IMAGE="${VLLM_IMAGE:-ghcr.io/llm-d/llm-d-inference-sim:v0.9.2}"
MANIFEST_PATH="${MANIFEST_PATH:-${DIR}/../test/testdata/sim-deployment.yaml}"
MANIFEST_PATH="${MANIFEST_PATH:-${DIR}/../testdata/sim-deployment.yaml}"
EPP_IMAGE="${EPP_IMAGE:-ghcr.io/llm-d/llm-d-router-endpoint-picker:dev}"
SIM_IMAGE="${SIM_IMAGE:-${VLLM_IMAGE:-ghcr.io/llm-d/llm-d-inference-sim:v0.9.2}}"
MANIFEST_PATH="${MANIFEST_PATH:-${DIR}/../testdata/sim-deployment.yaml}"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/scripts/test-e2e-gaie.sh` around lines 24 - 26, The SIM_IMAGE variable
assignment uses the wrong environment variable reference. On the line where
SIM_IMAGE is being set with a default value, it currently references
${VLLM_IMAGE:-...} as the fallback, but it should reference ${SIM_IMAGE:-...}
instead. Change the variable name in the parameter expansion from VLLM_IMAGE to
SIM_IMAGE so that environment variable overrides for SIM_IMAGE are properly
honored.

Comment on lines +24 to +30
# Set trap only for interruption signals.
# Normally kind cluster cleanup is done by AfterSuite; this trap deletes the
# e2e-tests cluster on Ctrl-C. The delete is unconditional and only meaningful
# when the suite created that cluster (K8S_CONTEXT unset); when running against
# an existing context it is a no-op unless a cluster named e2e-tests happens to
# exist.
trap 'e2e_handle_interrupt "e2e-tests"' INT TERM

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Gate interrupt cleanup to owned clusters only (CWE-706).

Line 30 hard-codes e2e-tests in the trap. On SIGINT/SIGTERM, this can delete an unrelated local kind cluster with the same name.

Proposed fix
-# Set trap only for interruption signals.
-# Normally kind cluster cleanup is done by AfterSuite; this trap deletes the
-# e2e-tests cluster on Ctrl-C. The delete is unconditional and only meaningful
-# when the suite created that cluster (K8S_CONTEXT unset); when running against
-# an existing context it is a no-op unless a cluster named e2e-tests happens to
-# exist.
-trap 'e2e_handle_interrupt "e2e-tests"' INT TERM
+# Set trap only for interruption signals.
+# Only mark e2e-tests as owned when not targeting an existing context.
+CREATED_CLUSTER=""
+if [ -z "${K8S_CONTEXT:-}" ]; then
+  CREATED_CLUSTER="e2e-tests"
+fi
+trap 'e2e_handle_interrupt "${CREATED_CLUSTER}"' INT TERM
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Set trap only for interruption signals.
# Normally kind cluster cleanup is done by AfterSuite; this trap deletes the
# e2e-tests cluster on Ctrl-C. The delete is unconditional and only meaningful
# when the suite created that cluster (K8S_CONTEXT unset); when running against
# an existing context it is a no-op unless a cluster named e2e-tests happens to
# exist.
trap 'e2e_handle_interrupt "e2e-tests"' INT TERM
# Set trap only for interruption signals.
# Only mark e2e-tests as owned when not targeting an existing context.
CREATED_CLUSTER=""
if [ -z "${K8S_CONTEXT:-}" ]; then
CREATED_CLUSTER="e2e-tests"
fi
trap 'e2e_handle_interrupt "${CREATED_CLUSTER}"' INT TERM
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/scripts/test-e2e-router.sh` around lines 24 - 30, The trap handler for
INT and TERM signals hard-codes the cluster name "e2e-tests" and unconditionally
calls e2e_handle_interrupt, which can delete an unrelated cluster with the same
name if the script is running against an existing K8S_CONTEXT. Gate the cleanup
to only occur when the script actually created the cluster by conditionally
setting the trap or passing a flag to e2e_handle_interrupt that indicates
ownership. The cleanup should only be active when K8S_CONTEXT is unset (meaning
the script created the cluster), not when using an existing context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants