test(llmisvc): collect pod and init-container logs on failure#1495
test(llmisvc): collect pod and init-container logs on failure#1495jlost wants to merge 41 commits into
Conversation
- Override BASE_IMG to UBI in Makefile.overrides.mk so `make docker-build-*` uses the same base image the Dockerfiles expect (fixes dnf-not-found on Debian slim) - Deploy SeaweedFS separately in ODH operator mode since the operator doesn't include it, but the E2E tests need a local S3 backend Made-with: Cursor Signed-off-by: James Ostrander <jostrand@redhat.com>
Extract operator installation logic into install-operator.sh (shared by deploy.odh.sh and VS Code tasks) and DSCI/DSC application into apply-dsci-dsc.sh. Enhance the setup-e2e-ocp Makefile target to accept OPERATOR_TYPE, OPERATOR_VERSION, and CATALOG_SOURCE, translating them into INSTALL_ODH_OPERATOR for setup-e2e-tests.sh. MIRROR_IMAGES is now auto-derived (true when OPERATOR_TYPE=rhoai and CATALOG_SOURCE is an FBC fragment image). setup-e2e-tests.sh is unchanged -- backward compatible with existing CI. Made-with: Cursor Signed-off-by: James Ostrander <jostrand@redhat.com>
- Add `build-images-ocp` and `reset-e2e-ocp` Makefile targets - Remove deploy.serverless.sh and deploy.ossm.sh (dead code) - Remove ServiceMeshMember cleanup from teardown-ci-namespace.sh - Remove serverless/servicemesh references from version.sh - Drop unused DEPLOYMENT_TYPE arg from setup-ci-namespace.sh - Update fvt-tests.md to document Makefile-based E2E workflow Made-with: Cursor Signed-off-by: James Ostrander <jostrand@redhat.com>
- deploy.odh.sh: use OPERATOR_NAME for CSV lookup and dynamic pod selector from deployment matchLabels (was hardcoded to ODH) - copy-kserve-manifests-to-pvc.sh: derive operator deployment name from OPERATOR_TYPE instead of hardcoding ODH label - setup-e2e-tests.sh: set KSERVE_NAMESPACE per operator type (redhat-ods-applications for RHOAI, opendatahub for ODH); skip DSCI apply when one already exists (RHOAI auto-creates it) - Makefile.overrides.mk: forward COPY_PR_MANIFESTS to setup script; add $(strip) to all forwarded variables for VSCode space defaults Made-with: Cursor Signed-off-by: James Ostrander <jostrand@redhat.com>
- Replace the deletion of the S3 init job with a forceful replacement to ensure the job is updated correctly. - This change improves the reliability of the S3 initialization process during E2E tests. Made-with: Cursor Signed-off-by: James Ostrander <jostrand@redhat.com>
- Removed unused INSTALL_ODH_OPERATOR variable and refactored setup-e2e-tests.sh to derive installation mode from OPERATOR_TYPE. - Updated teardown-e2e-setup.sh to support both raw and operator-based deployments, ensuring a more robust cleanup process. - Enhanced install-operator.sh to dynamically handle channel detection and CSV validation, improving operator installation reliability. - Adjusted deploy.odh.sh to remove the CHANNEL_OVERRIDE variable, simplifying the script. Made-with: Cursor Signed-off-by: James Ostrander <jostrand@redhat.com>
When CATALOG_SOURCE is an FBC fragment image (e.g. quay.io/rhoai/rhoai-fbc-fragment:rhoai-3.5-ea.1) and OPERATOR_VERSION is not explicitly set, extract the major.minor version hint from the image tag and use it to find the matching channel/CSV instead of silently falling back to the default channel. Also: - Error on explicit OPERATOR_VERSION mismatch instead of silent fallback - Clean up stale subscription after auto-detect sets OPERATOR_VERSION - Fix teardown ordering: CI namespace first, webhooks before namespaces - Strip finalizers on stuck resources during CI namespace teardown - Trim redundant comments from teardown script - Prevent e2e-ocp from re-running setup (SETUP_E2E=false) Signed-off-by: James Ostrander <jostrand@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>
- Use .status.installPlanRef.name instead of deprecated .status.installplan.name - Resolve CSV from Subscription.status.installedCSV instead of grep|head - Enforce set -euo pipefail unconditionally in install-operator.sh (guard pipefail-sensitive pipelines with || true) Signed-off-by: James Ostrander <jostrand@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>
query_catalog_info only checked currentCSV (channel head), so pinning an older version like ODH 3.3.0 or RHOAI 3.4.0 failed even though the version existed in the channel entries. Now searches entries when the head doesn't match. Also: - Add clean-setup-e2e-ocp Makefile target (teardown + setup) - Add help text to teardown-e2e-ocp target Signed-off-by: James Ostrander <jostrand@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>
…ardown Replace hardcoded "openshift-operators" with configurable OPERATOR_NAMESPACE variable (defaulting to openshift-operators) across install, deploy, teardown, and manifest copy scripts. Add install plan cleanup to teardown to prevent stale plans from interfering with subsequent installs. Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: James Ostrander <jostrand@redhat.com>
- Move OpenShift E2E variables and targets into Makefile.e2e-ocp.mk, included from Makefile.overrides.mk. - Always use Manual install plan approval to avoid race conditions. - Wait for RHOAI auto-created DSCI instead of racing to apply our own. - Delete KServe CRDs on teardown to avoid storedVersions conflicts on version downgrades. Signed-off-by: James Ostrander <jostrand@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>
- config/overlays/odh-test/dsc.yaml: change defaultDeploymentMode from Serverless to RawDeployment. Serverless is incompatible with serving.managementState: Removed (no Knative), causing RHOAI 2.x installs to leave the CSV stuck in Pending. All code paths that used this file either immediately patched it to RawDeployment (manual mode) or silently broke on older operators. - teardown-e2e-setup.sh: also delete *.opendatahub.io CRDs on teardown. The previous teardown only removed *.serving.kserve.io CRDs, leaving ODH platform CRDs intact. On a version downgrade (e.g. RHOAI 3.x -> 2.25), those CRDs carry storedVersions from the newer API (e.g. v2) that the older operator's install plan refuses to overwrite, failing with "risk of data loss". Deleting them lets the next install recreate them cleanly. - Remove apply-dsci-dsc.sh: unused script with no callers; its KUEUE_STATE envsubst logic was never wired into dsc.yaml. Signed-off-by: James Ostrander <jostrand@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>
On HyperShift and ROSA HCP clusters there is no 'master' MachineConfigPool -- the ImageDigestMirrorSet takes effect without an MCP rollout. The previous code would spin for 300s waiting for an MCP that never exists. Guard the wait with a check for the MCP's existence so the mirroring path works on both CRC/ROSA Classic (MCP present) and HyperShift/ROSA HCP (no MCP). Signed-off-by: James Ostrander <jostrand@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>
- Updated Makefile.e2e-ocp.mk to include RUNNING_LOCAL and KSERVE_NAMESPACE variables for better local testing support. - Modified deploy.odh.sh to derive KSERVE_NAMESPACE based on OPERATOR_TYPE and improved logging for operator installation. - Enhanced install-operator.sh to wait for ODH/RHOAI CRDs to be established before proceeding. - Refined run-e2e-tests.sh to export RUNNING_LOCAL and set KSERVE_NAMESPACE based on the operator type. - Cleaned up setup-e2e-tests.sh by removing redundant PR manifest handling logic. This update improves the flexibility and reliability of the E2E testing process across different operator types. Signed-off-by: James Ostrander <jostrand@redhat.com>
…sting - Added Makefile.ocp.mk to define targets for building KServe images, setting up KServe on OpenShift, and running E2E tests. - Updated Makefile.overrides.mk to include the new Makefile.ocp.mk for better organization. - Enhanced image handling in copy-kserve-manifests-to-pvc.sh to prefer locally-built images when QUAY_REPO is set. - Created deploy.kserve-manual.sh for manual KServe installation via kustomize, allowing for more flexible deployment options. - Refined setup-e2e-tests.sh and setup-kserve.sh to streamline the installation process and improve environment setup. This update enhances the deployment and testing capabilities for KServe on OpenShift, providing a more robust framework for developers. Signed-off-by: James Ostrander <jostrand@redhat.com>
…annel detection - Introduced test-detect-channel.sh to validate the detect_channel() function without requiring a cluster, covering various scenarios for version detection and channel resolution. - Added test-install-operator.sh to test the install_operator() function, simulating the full installation pipeline and ensuring correct Subscription YAML generation. - Both scripts utilize mock functions to simulate OpenShift CLI behavior, enhancing the reliability of the tests. These additions improve the testing framework for operator installation and channel detection, ensuring better coverage and reliability in the deployment process. Signed-off-by: James Ostrander <jostrand@redhat.com>
The comment incorrectly stated CI mode uses Automatic install-plan approval. The script always uses Manual approval and explicitly approves the generated InstallPlan in wait_for_operator_ready(). Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: James Ostrander <jostrand@redhat.com>
- install-operator.sh: fix header comment -- approval mode is always Manual (script explicitly approves the InstallPlan); correct CI mode description to say startingCSV is omitted rather than Automatic approval - teardown-e2e-setup.sh: add WARNING comment on ODH CRD deletion block clarifying this is intentional but requires a dedicated test cluster - setup-kserve.sh: remove redundant DSC RawDeployment patch (deploy.kserve- manual.sh already applies dsc.yaml which sets RawDeployment); fix hardcoded 'namespace: opendatahub' in ODH MC kustomization override -- use KSERVE_NAMESPACE to stay consistent with the base kustomization.yaml - copy-kserve-manifests-to-pvc.sh, deploy.odh.sh: replace python3 -c JSON parsing for pod selector extraction with oc go-template, matching the pattern already established in version.sh Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: James Ostrander <jostrand@redhat.com>
- Makefile.ocp.mk: clarify COPY_PR_MANIFESTS comment -- manual mode always uses local branch overlays, so the flag is inapplicable there - install-operator.sh: remove unused _INSTALL_OPERATOR_SOURCED sentinel Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: James Ostrander <jostrand@redhat.com>
- Correct Makefile reference: targets are defined in Makefile.ocp.mk (included by Makefile.overrides.mk), not directly in Makefile.overrides.mk - Remove spurious E2E_MARKER from teardown-e2e-ocp example; the target does not use that variable Signed-off-by: James Ostrander <jostrand@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>
The RHOAI operator path has not been verified with this setup. Replace the pinned-version example with an equivalent ODH FBC fragment example. Signed-off-by: James Ostrander <jostrand@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>
- install-operator.sh help text: swap quay.io/rhoai/rhoai-fbc-fragment (private, auth-gated) with a single quay.io/opendatahub/opendatahub-operator-catalog example (public, verified pullable) - fvt-tests.md: same substitution for the pinned-catalog example The rhoai-fbc-fragment references in test-install-operator.sh and test-detect-channel.sh are intentional -- they test rhoai-specific code paths and never pull the image (query_catalog_info is mocked). Case G in test-install-operator.sh confirms the no-FBC RHOAI path works using the default redhat-operators catalog. Signed-off-by: James Ostrander <jostrand@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>
- SC2295: quote inner expansion in CSV_VERSION prefix-strip to prevent OPERATOR_NAME from being interpreted as a glob pattern - stale job sweep: match only Failed state (drop || $2 == "0/1" which incorrectly matched in-progress jobs), remove head -5 cap - bash -c payload: switch to single-quoted string and pass OPERATOR_NAMESPACE/CONTROLLER_DEPLOYMENT as positional args ($1/$2) to fix broken quoting caused by double-quotes inside double-quoted string Signed-off-by: James Ostrander <jostrand@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>
- build-kserve-images.sh: validate BUILDER against docker|podman whitelist before invocation to prevent command injection (CWE-78) - copy-kserve-manifests-to-pvc.sh: fail fast when pod selector is empty to prevent silent use of a wrong pod when deployment lookup fails - teardown-e2e-setup.sh: strip finalizers from stuck KServe/ODH resources when application namespaces remain in Terminating after initial delete, matching the pattern already used in teardown-ci-namespace.sh Signed-off-by: James Ostrander <jostrand@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: James Ostrander <jostrand@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jlost The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/hold |
…erage (kserve#5187) Signed-off-by: Bartosz Majsak <bartosz.majsak@gmail.com>
…isibility pytest-xdist always captures worker stdout even with -s, so print() calls in _collect_diagnostics were only visible in the final "Captured stdout call" section - not in the real-time tee log and truncated in large runs. Switch _collect_diagnostics, diagnostic.py helpers, and the except-block error message from print() to logger.info/error. Log records are forwarded in real-time by xdist via --log-cli, so they appear immediately in the live log stream and the tee'd log file. Also make wait_timeout configurable via LLMISVC_WAIT_TIMEOUT env var (default: 900s) so local dev iterations don't need a 15-minute wait. Signed-off-by: James Ostrander <jostrand@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/e2e/llmisvc/diagnostic.py`:
- Around line 25-27: The function print_all_events_table currently ignores the
max_events parameter and logs all events; update the logic in
print_all_events_table to enforce max_events by limiting the events
iterated/logged (e.g., slice the events list or stop after emitting max_events
entries) before calling the logging callable (log / _log.info), and ensure any
summary or count reflects the applied limit so diagnostics remain bounded and
performant.
- Around line 192-194: The current exception handler logs container output
verbatim via log("%s", logs or "(empty)"), which can leak sensitive tokens;
change this to redact secrets before emitting. Add or call a helper (e.g.,
redact_sensitive_tokens(logs)) that applies regex replacements to mask
Authorization/Bearer headers, "sig|se|sasl|token" query params in signed URLs,
AWS/Google keys and long base64-like strings, then call log("%s", redacted_logs
or "(empty)") inside the except block (keep the same log call and variable
names). Ensure the helper is used wherever raw container logs are printed in
this module so all usages of log(..., logs) are protected.
In `@test/e2e/llmisvc/test_llm_inference_service.py`:
- Around line 531-533: The except block in maybe_delete_llmisvc currently
swallows all exceptions and only prints a warning; update it to surface failures
instead of hiding them by logging the error (including service_name and
exception details) and then re-raising the exception (or calling pytest.fail) so
cleanup failures fail the test and do not leak resources; target the except
Exception as e block that references service_name and ensure the error is not
silently ignored.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 6137b1f9-45bc-46f5-bae4-89c1459ddba8
📒 Files selected for processing (6)
test/e2e/common/http_retry.pytest/e2e/llmisvc/diagnostic.pytest/e2e/llmisvc/fixtures.pytest/e2e/llmisvc/test_llm_inference_service.pytest/e2e/llmisvc/test_llm_inference_service_stop.pytest/e2e/pytest.ini
| def print_all_events_table( | ||
| namespace: str, max_events: int = 50, log: Callable = _log.info | ||
| ): |
There was a problem hiding this comment.
Apply max_events limit; it is currently ignored.
Line 26 advertises a bounded event dump, but Lines 34-45 log every event. This makes diagnostics noisy and can degrade CI performance.
Proposed fix
- events = core.list_namespaced_event(namespace=namespace).items
+ events = core.list_namespaced_event(namespace=namespace).items
+ events = sorted(
+ events,
+ key=lambda ev: ev.last_timestamp or ev.first_timestamp or datetime.min,
+ reverse=True,
+ )[:max_events]As per coding guidelines, **: REVIEW PRIORITIES: 4. Performance problems.
Also applies to: 34-45
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/e2e/llmisvc/diagnostic.py` around lines 25 - 27, The function
print_all_events_table currently ignores the max_events parameter and logs all
events; update the logic in print_all_events_table to enforce max_events by
limiting the events iterated/logged (e.g., slice the events list or stop after
emitting max_events entries) before calling the logging callable (log /
_log.info), and ensure any summary or count reflects the applied limit so
diagnostics remain bounded and performant.
| log("# -- logs (%s) --", label) | ||
| log("%s", logs or "(empty)") | ||
| except Exception as e: |
There was a problem hiding this comment.
Redact sensitive tokens before emitting raw container logs (CWE-532 / CWE-200).
Line 193 writes container logs verbatim to CI output; this can expose bearer tokens, credentials, or signed URLs in build artifacts.
Proposed fix
+_REDACTION_PATTERNS = [
+ re.compile(r"(?i)(authorization:\s*bearer\s+)[^\s]+"),
+ re.compile(r"(?i)(token=)[^&\s]+"),
+ re.compile(r"(?i)(password=)[^&\s]+"),
+]
+
+def _redact_sensitive(text: str) -> str:
+ redacted = text
+ for pattern in _REDACTION_PATTERNS:
+ redacted = pattern.sub(r"\1***REDACTED***", redacted)
+ return redacted
+
def _emit_container_logs(
@@
- log("%s", logs or "(empty)")
+ log("%s", _redact_sensitive(logs or "(empty)"))As per coding guidelines, **: REVIEW PRIORITIES: 1. Security vulnerabilities.
🧰 Tools
🪛 Ruff (0.15.12)
[warning] 194-194: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/e2e/llmisvc/diagnostic.py` around lines 192 - 194, The current exception
handler logs container output verbatim via log("%s", logs or "(empty)"), which
can leak sensitive tokens; change this to redact secrets before emitting. Add or
call a helper (e.g., redact_sensitive_tokens(logs)) that applies regex
replacements to mask Authorization/Bearer headers, "sig|se|sasl|token" query
params in signed URLs, AWS/Google keys and long base64-like strings, then call
log("%s", redacted_logs or "(empty)") inside the except block (keep the same log
call and variable names). Ensure the helper is used wherever raw container logs
are printed in this module so all usages of log(..., logs) are protected.
| except Exception as e: | ||
| print(f"⚠️ Warning: Failed to cleanup service {service_name}: {e}") | ||
|
|
There was a problem hiding this comment.
Do not silently swallow cleanup failures in maybe_delete_llmisvc.
Line 531 catches all exceptions and only prints a warning, which can hide leaked resources and cascade into later test failures.
Proposed fix
- except Exception as e:
- print(f"⚠️ Warning: Failed to cleanup service {service_name}: {e}")
+ except client.rest.ApiException as e:
+ logger.exception(
+ "Failed to cleanup service %s (status=%s): %s",
+ service_name,
+ e.status,
+ e,
+ )
+ if not test_failed and e.status != 404:
+ raise
+ except Exception as e:
+ logger.exception("Unexpected cleanup failure for service %s: %s", service_name, e)
+ if not test_failed:
+ raiseAs per coding guidelines, **: REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps.
🧰 Tools
🪛 Ruff (0.15.12)
[warning] 531-531: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/e2e/llmisvc/test_llm_inference_service.py` around lines 531 - 533, The
except block in maybe_delete_llmisvc currently swallows all exceptions and only
prints a warning; update it to surface failures instead of hiding them by
logging the error (including service_name and exception details) and then
re-raising the exception (or calling pytest.fail) so cleanup failures fail the
test and do not leak resources; target the except Exception as e block that
references service_name and ensure the error is not silently ignored.
- Changed the default value of KSERVE_NAMESPACE from 'opendatahub' to 'kserve' to align with the updated deployment configuration. Signed-off-by: James Ostrander <jostrand@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: James Ostrander <jostrand@redhat.com>
…ce downloads Pre-download facebook/opt-125m once during cluster setup via a storage-initializer init container in the s3-init job, then sync to s3://example-models/facebook/opt-125m/. Add seaweedfs-s3-creds secret with S3 endpoint annotations and link it to the default service account in kserve-ci-e2e-test, so the LLMISVC storage-initializer can authenticate to SeaweedFS for s3:// model URIs. Change all hf://facebook/opt-125m references in llmisvc e2e tests (13 total across fixtures.py and 3 test files) to s3://example-models/facebook/opt-125m. Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: James Ostrander <jostrand@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>
7eb535a to
e49ac26
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
test/scripts/openshift-ci/tls/setup-s3-tls.sh (2)
76-84:⚠️ Potential issue | 🟠 Major | ⚡ Quick winCWE-78: Unquoted variables enable command injection.
Variables
$DEPLOYMENT_NAMEand$SECRET_NAMEare unquoted in command contexts. If these contain spaces or shell metacharacters, command parsing fails or enables injection attacks.🛡️ Proposed fix: Quote all variable expansions
-if oc get deployment $DEPLOYMENT_NAME -n kserve > /dev/null 2>&1; then +if oc get deployment "$DEPLOYMENT_NAME" -n kserve > /dev/null 2>&1; then echo "Deleting existing $DEPLOYMENT_NAME deployment" - oc delete deployment $DEPLOYMENT_NAME -n kserve --ignore-not-found || true + oc delete deployment "$DEPLOYMENT_NAME" -n kserve --ignore-not-found || true fi # Delete TLS secret if it exists (only for custom cert - serving cert is managed by OpenShift) if [[ "$CERT_TYPE" == "custom" ]]; then - if oc get secret $SECRET_NAME -n kserve > /dev/null 2>&1; then + if oc get secret "$SECRET_NAME" -n kserve > /dev/null 2>&1; then echo "Deleting existing $SECRET_NAME secret" - oc delete secret $SECRET_NAME -n kserve --ignore-not-found || true + oc delete secret "$SECRET_NAME" -n kserve --ignore-not-found || true fi fiApply similar quoting to lines 88-90, 94, 99, 108-109, 115-117, 121, 127, 135.
As per coding guidelines: SHELL SCRIPT SECURITY (CWE-78) - Quote all variables to prevent injection.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/scripts/openshift-ci/tls/setup-s3-tls.sh` around lines 76 - 84, The shell commands using unquoted variables DEPLOYMENT_NAME and SECRET_NAME (e.g., in the oc get deployment/oc delete deployment and oc get secret/oc delete secret calls) allow command injection and break on spaces; update each invocation to quote the variable expansions (e.g., use "$DEPLOYMENT_NAME" and "$SECRET_NAME") and apply the same quoting pattern to all similar occurrences in this script (including the other oc get/delete lines mentioned in the review) so every command argument that uses these variables is wrapped in double quotes.
58-58:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winCWE-494: Download and execute code without integrity verification.
This downloads and pipes an installer script directly to bash without verifying its signature or checksum. If the kubernetes-sigs repository is compromised or a MITM attack occurs, arbitrary code executes with the script's privileges.
🔒 Proposed fix: Verify checksum or use pinned version
- curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash -s -- 5.0.1 $HOME/.local/bin + # Download installer, verify checksum, then execute + curl -sSL -o /tmp/install_kustomize.sh "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" + echo "EXPECTED_CHECKSUM_HERE /tmp/install_kustomize.sh" | sha256sum -c - + bash /tmp/install_kustomize.sh 5.0.1 "$HOME/.local/bin" + rm /tmp/install_kustomize.shAlternatively, download kustomize binary directly from GitHub releases with checksum verification.
As per coding guidelines: SHELL SCRIPT SECURITY (CWE-78) - Avoid eval and dynamic command execution.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/scripts/openshift-ci/tls/setup-s3-tls.sh` at line 58, The script currently pipes the remote installer into bash via the command "curl -s \"https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh\" | bash -s -- 5.0.1 $HOME/.local/bin", which is CWE-494; instead, download a pinned kustomize release and its published checksum (or the installer script and its checksum) using curl -LO, verify the checksum (sha256/sha512) locally with sha256sum/sha512sum (or gpg signature if available), and only then run the installer locally (or extract the verified binary to $HOME/.local/bin); ensure the script records the pinned version (5.0.1) and fails fast if checksum verification fails.test/scripts/openshift-ci/setup-e2e-tests.sh (1)
39-55:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winValidate
$1before the firstnounsetaccess.
set -o nounsetis active, and$1is dereferenced at lines 50, 97, 145, and 152. Invoking this script without an argument fails with a generic$1: unbound variablefrom whatever the first reachable branch happens to be, rather than a clear usage error. Worse, since line 50 is inside theRUNNING_LOCAL=trueblock, missing-arg behavior differs by environment, which complicates CI triage.Validate the deploy-mode argument once at the top:
🛡️ Proposed guard
source "$SCRIPT_DIR/common.sh" source "$SCRIPT_DIR/version.sh" +if [[ $# -lt 1 || -z "${1:-}" ]]; then + echo "Usage: $(basename "$0") <deploy-mode>" >&2 + exit 2 +fi +DEPLOY_MODE="$1" + # Always print cluster / operator snapshot on exit (success or failure) for CI triage. trap print_e2e_environment_summary EXITThen reference
"$DEPLOY_MODE"instead of"$1"at lines 50, 97, 145, 152 for readability.Also applies to: 145-152
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/scripts/openshift-ci/setup-e2e-tests.sh` around lines 39 - 55, Add an explicit guard that validates the positional deploy-mode argument before any `set -o nounset`-sensitive access: read `$1` into a new variable (e.g., DEPLOY_MODE) at the top of the script, check it’s non-empty and matches expected values (or print a clear usage error and exit), and then replace every direct `$1` reference in the script with `"$DEPLOY_MODE"` (notably the checks inside the RUNNING_LOCAL block and the conditions using BUILD_GRAPH_IMAGES/BUILD_KSERVE_IMAGES). This ensures the script never dereferences an unbound `$1` and provides a clear error message on missing/invalid deploy-mode while preserving existing logic around RUNNING_LOCAL, BUILD_KSERVE_IMAGES, and BUILD_GRAPH_IMAGES.
🧹 Nitpick comments (1)
test/scripts/openshift-ci/setup-e2e-tests.sh (1)
145-145: 💤 Low valueQuoted RHS of
=~performs literal match, not regex.In Bash 3.2+, quoting the right-hand side of
=~forces a literal substring comparison rather than ERE matching. The expression here happens to work becausekserve_on_openshifthas no regex metacharacters, but the form is misleading. If a substring/equality check is intended, prefer an explicit comparison:-if [[ "$1" =~ "kserve_on_openshift" ]]; then +if [[ "$1" == "kserve_on_openshift" ]]; thenUse
==if$1is expected to equal that literal, or== *kserve_on_openshift*if a substring match is genuinely wanted.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/scripts/openshift-ci/setup-e2e-tests.sh` at line 145, The conditional currently uses the regex-match operator with a quoted right-hand side (the test against the first positional arg for kserve_on_openshift), which forces a literal match; change that test to an explicit string comparison: use equality (when you expect $1 to exactly equal kserve_on_openshift) or use a glob-style equality (when you want to detect kserve_on_openshift as a substring) instead of the quoted =~ form so the intent is clear and portable.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@config/overlays/test/s3-local-backend/seaweedfs-init-job.yaml`:
- Around line 38-39: Replace the insecure --no-verify-ssl usage in the aws s3
cp/sync commands by either switching the S3 endpoint to plain HTTP for non-TLS
local S3 (ensure AWS_ENDPOINT_URL_S3 is set to http://... instead of https://)
or by enabling proper TLS verification: remove the --no-verify-ssl flags and add
an AWS_CA_BUNDLE environment variable (e.g., AWS_CA_BUNDLE pointing to the CA
bundle path) in the job spec so the aws CLI can verify certificates; update the
aws s3 cp and aws s3 sync invocations accordingly and confirm the env var name
AWS_ENDPOINT_URL_S3 is configured consistently.
In `@config/overlays/test/s3-local-backend/seaweedfs-s3-creds-secret.yaml`:
- Around line 10-12: The Secret file seaweedfs-s3-creds-secret.yaml currently
hardcodes AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY; remove those static
values and instead generate credentials at CI/test setup (e.g., in
setup-ci-namespace.sh) and create the k8s Secret dynamically (or use
SealedSecrets/ExternalSecrets) so secrets never live in repo history; ensure the
setup script outputs/exports the generated values and applies them to a Secret
with the same name so SeaweedFS reads AWS_ACCESS_KEY_ID and
AWS_SECRET_ACCESS_KEY from the runtime Secret.
In `@test/scripts/openshift-ci/setup-e2e-tests.sh`:
- Around line 82-91: The script contains unquoted variable expansions (notably
PROJECT_ROOT, SCRIPT_DIR and KSERVE_NAMESPACE) causing potential
word-splitting/globbing; update all occurrences to use quoted expansions (e.g.
"$PROJECT_ROOT", "$SCRIPT_DIR", "${KSERVE_NAMESPACE}") including in commands
like make -C "$PROJECT_ROOT", pushd "$PROJECT_ROOT", export
PATH="${PROJECT_ROOT}/bin:${PATH}", and any other uses in the same file (around
the referenced ranges 82–91, 114–139, 152). Ensure every instance of these
variables is wrapped in double quotes so pushd/popd, make, export and other
shell commands receive a single, safe argument.
---
Outside diff comments:
In `@test/scripts/openshift-ci/setup-e2e-tests.sh`:
- Around line 39-55: Add an explicit guard that validates the positional
deploy-mode argument before any `set -o nounset`-sensitive access: read `$1`
into a new variable (e.g., DEPLOY_MODE) at the top of the script, check it’s
non-empty and matches expected values (or print a clear usage error and exit),
and then replace every direct `$1` reference in the script with `"$DEPLOY_MODE"`
(notably the checks inside the RUNNING_LOCAL block and the conditions using
BUILD_GRAPH_IMAGES/BUILD_KSERVE_IMAGES). This ensures the script never
dereferences an unbound `$1` and provides a clear error message on
missing/invalid deploy-mode while preserving existing logic around
RUNNING_LOCAL, BUILD_KSERVE_IMAGES, and BUILD_GRAPH_IMAGES.
In `@test/scripts/openshift-ci/tls/setup-s3-tls.sh`:
- Around line 76-84: The shell commands using unquoted variables DEPLOYMENT_NAME
and SECRET_NAME (e.g., in the oc get deployment/oc delete deployment and oc get
secret/oc delete secret calls) allow command injection and break on spaces;
update each invocation to quote the variable expansions (e.g., use
"$DEPLOYMENT_NAME" and "$SECRET_NAME") and apply the same quoting pattern to all
similar occurrences in this script (including the other oc get/delete lines
mentioned in the review) so every command argument that uses these variables is
wrapped in double quotes.
- Line 58: The script currently pipes the remote installer into bash via the
command "curl -s
\"https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh\"
| bash -s -- 5.0.1 $HOME/.local/bin", which is CWE-494; instead, download a
pinned kustomize release and its published checksum (or the installer script and
its checksum) using curl -LO, verify the checksum (sha256/sha512) locally with
sha256sum/sha512sum (or gpg signature if available), and only then run the
installer locally (or extract the verified binary to $HOME/.local/bin); ensure
the script records the pinned version (5.0.1) and fails fast if checksum
verification fails.
---
Nitpick comments:
In `@test/scripts/openshift-ci/setup-e2e-tests.sh`:
- Line 145: The conditional currently uses the regex-match operator with a
quoted right-hand side (the test against the first positional arg for
kserve_on_openshift), which forces a literal match; change that test to an
explicit string comparison: use equality (when you expect $1 to exactly equal
kserve_on_openshift) or use a glob-style equality (when you want to detect
kserve_on_openshift as a substring) instead of the quoted =~ form so the intent
is clear and portable.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 1f0225bd-38f7-4b4c-86d3-c330bbc49d67
📒 Files selected for processing (10)
config/overlays/test/s3-local-backend/seaweedfs-init-job.yamlconfig/overlays/test/s3-local-backend/seaweedfs-s3-creds-secret.yamltest/e2e/llmisvc/fixtures.pytest/e2e/llmisvc/test_llm_inference_service_conversion.pytest/e2e/llmisvc/test_pod_watch.pytest/e2e/llmisvc/test_storage_version_migration.pytest/scripts/openshift-ci/run-e2e-tests.shtest/scripts/openshift-ci/setup-ci-namespace.shtest/scripts/openshift-ci/setup-e2e-tests.shtest/scripts/openshift-ci/tls/setup-s3-tls.sh
🚧 Files skipped from review as they are similar to previous changes (1)
- test/e2e/llmisvc/fixtures.py
| aws s3 cp /tmp/sklearn-model.joblib s3://example-models/sklearn/model.joblib --no-verify-ssl | ||
| aws s3 sync /shared/facebook/opt-125m s3://example-models/facebook/opt-125m/ --no-verify-ssl |
There was a problem hiding this comment.
CWE-295: SSL certificate verification disabled.
The --no-verify-ssl flag disables certificate validation for S3 connections, enabling man-in-the-middle attacks. An attacker on the network can intercept traffic, inject malicious model files, or steal credentials.
Even in test environments, this is risky:
- Malicious model artifacts can exploit deserialization vulnerabilities
- Test networks may not be fully isolated
- Code patterns propagate to production
🔐 Proposed fix: Use proper TLS with CA bundle
For non-TLS local S3 (SeaweedFS), use HTTP explicitly instead of disabling SSL verification:
- aws s3 cp /tmp/sklearn-model.joblib s3://example-models/sklearn/model.joblib --no-verify-ssl
- aws s3 sync /shared/facebook/opt-125m s3://example-models/facebook/opt-125m/ --no-verify-ssl
+ aws s3 cp /tmp/sklearn-model.joblib s3://example-models/sklearn/model.joblib
+ aws s3 sync /shared/facebook/opt-125m s3://example-models/facebook/opt-125m/And ensure AWS_ENDPOINT_URL_S3 (line 54) uses http:// not https:// when TLS is not configured. For TLS-enabled S3, provide the CA certificate:
env:
- name: AWS_CA_BUNDLE
value: /etc/ssl/certs/ca-bundle.crtAs per coding guidelines: KUBERNETES CONFIG SECURITY - SecurityContext configured with appropriate restrictions.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| aws s3 cp /tmp/sklearn-model.joblib s3://example-models/sklearn/model.joblib --no-verify-ssl | |
| aws s3 sync /shared/facebook/opt-125m s3://example-models/facebook/opt-125m/ --no-verify-ssl | |
| aws s3 cp /tmp/sklearn-model.joblib s3://example-models/sklearn/model.joblib | |
| aws s3 sync /shared/facebook/opt-125m s3://example-models/facebook/opt-125m/ |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@config/overlays/test/s3-local-backend/seaweedfs-init-job.yaml` around lines
38 - 39, Replace the insecure --no-verify-ssl usage in the aws s3 cp/sync
commands by either switching the S3 endpoint to plain HTTP for non-TLS local S3
(ensure AWS_ENDPOINT_URL_S3 is set to http://... instead of https://) or by
enabling proper TLS verification: remove the --no-verify-ssl flags and add an
AWS_CA_BUNDLE environment variable (e.g., AWS_CA_BUNDLE pointing to the CA
bundle path) in the job spec so the aws CLI can verify certificates; update the
aws s3 cp and aws s3 sync invocations accordingly and confirm the env var name
AWS_ENDPOINT_URL_S3 is configured consistently.
| stringData: | ||
| AWS_ACCESS_KEY_ID: s3admin | ||
| AWS_SECRET_ACCESS_KEY: s3admin123 |
There was a problem hiding this comment.
CWE-798: Hardcoded credentials in source control.
Embedding credentials directly in YAML manifests creates multiple security risks:
- Credentials persist in git history even if later removed
- Anyone with repository access obtains credentials
- Establishes an anti-pattern that may be copied to production
- Weak credentials (
s3admin/s3admin123) are easily guessable
Even in test environments, hardcoded credentials enable unauthorized access if the test S3 endpoint is network-reachable or credentials are reused.
🔐 Proposed fix: Generate credentials during setup
Replace the static Secret with dynamic credential generation in setup-ci-namespace.sh:
-# Apply SeaweedFS S3 credentials secret and link to default SA (used by LLMISVC s3:// model URIs)
-echo "Applying SeaweedFS S3 credentials secret"
-: "${KSERVE_NAMESPACE:=kserve}"
-sed "s/s3-service.kserve/s3-service.${KSERVE_NAMESPACE}/" \
- "$PROJECT_ROOT/config/overlays/test/s3-local-backend/seaweedfs-s3-creds-secret.yaml" | \
- oc apply -f - -n "$NAMESPACE"
+# Generate random S3 credentials for this test run
+echo "Generating SeaweedFS S3 credentials secret"
+: "${KSERVE_NAMESPACE:=kserve}"
+S3_ACCESS_KEY="s3-$(openssl rand -hex 16)"
+S3_SECRET_KEY="$(openssl rand -base64 32)"
+cat <<EOF | oc apply -f - -n "$NAMESPACE"
+apiVersion: v1
+kind: Secret
+metadata:
+ name: seaweedfs-s3-creds
+ annotations:
+ serving.kserve.io/s3-endpoint: "s3-service.${KSERVE_NAMESPACE}:8333"
+ serving.kserve.io/s3-usehttps: "0"
+ serving.kserve.io/s3-verifyssl: "0"
+type: Opaque
+stringData:
+ AWS_ACCESS_KEY_ID: "${S3_ACCESS_KEY}"
+ AWS_SECRET_ACCESS_KEY: "${S3_SECRET_KEY}"
+EOFThen configure SeaweedFS to accept these credentials or use ephemeral token-based auth.
As per coding guidelines: KUBERNETES CONFIG SECURITY - No secrets in plain text (use SealedSecrets or external-secrets).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@config/overlays/test/s3-local-backend/seaweedfs-s3-creds-secret.yaml` around
lines 10 - 12, The Secret file seaweedfs-s3-creds-secret.yaml currently
hardcodes AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY; remove those static
values and instead generate credentials at CI/test setup (e.g., in
setup-ci-namespace.sh) and create the k8s Secret dynamically (or use
SealedSecrets/ExternalSecrets) so secrets never live in repo history; ensure the
setup script outputs/exports the generated values and applies them to a Secret
with the same name so SeaweedFS reads AWS_ACCESS_KEY_ID and
AWS_SECRET_ACCESS_KEY from the runtime Secret.
| $PROJECT_ROOT/hack/setup/cli/install-kustomize.sh | ||
| make -C "$PROJECT_ROOT" yq | ||
| export PATH="${PROJECT_ROOT}/bin:${PATH}" | ||
|
|
||
| echo "Installing KServe Python SDK ..." | ||
| pushd $PROJECT_ROOT >/dev/null | ||
| ./test/scripts/gh-actions/setup-uv.sh | ||
| # Add bin directory to PATH so uv command is available | ||
| export PATH="${PROJECT_ROOT}/bin:${PATH}" | ||
| popd | ||
| pushd $PROJECT_ROOT/python/kserve >/dev/null |
There was a problem hiding this comment.
Quote all variable expansions (CWE-78).
Multiple unquoted expansions of $PROJECT_ROOT, $SCRIPT_DIR, and ${KSERVE_NAMESPACE} remain. Today the values are safe k8s namespace strings and CI-controlled paths, but the project-wide rule is to quote unconditionally so a future repo-clone path containing whitespace, or an externally provided KSERVE_NAMESPACE, does not cause word-splitting/globbing surprises. Shellcheck SC2086 already flags lines 82, 87, 91.
🔒 Suggested quoting (representative diff)
-$PROJECT_ROOT/hack/setup/cli/install-kustomize.sh
+"$PROJECT_ROOT/hack/setup/cli/install-kustomize.sh"
make -C "$PROJECT_ROOT" yq
export PATH="${PROJECT_ROOT}/bin:${PATH}"
echo "Installing KServe Python SDK ..."
-pushd $PROJECT_ROOT >/dev/null
+pushd "$PROJECT_ROOT" >/dev/null
./test/scripts/gh-actions/setup-uv.sh
export PATH="${PROJECT_ROOT}/bin:${PATH}"
popd
-pushd $PROJECT_ROOT/python/kserve >/dev/null
+pushd "$PROJECT_ROOT/python/kserve" >/dev/null
@@
- oc apply -n ${KSERVE_NAMESPACE} -f -
+ oc apply -n "${KSERVE_NAMESPACE}" -f -
@@
-oc rollout status deployment/seaweedfs -n ${KSERVE_NAMESPACE} --timeout=300s
+oc rollout status deployment/seaweedfs -n "${KSERVE_NAMESPACE}" --timeout=300s
@@
-if oc wait --for=condition=complete job/s3-init -n ${KSERVE_NAMESPACE} --timeout=60s 2>/dev/null; then
+if oc wait --for=condition=complete job/s3-init -n "${KSERVE_NAMESPACE}" --timeout=60s 2>/dev/null; then
@@
- sed "s/s3-service.kserve/s3-service.${KSERVE_NAMESPACE}/" \
+ sed "s/s3-service.kserve/s3-service.${KSERVE_NAMESPACE}/" \
"$PROJECT_ROOT/config/overlays/test/s3-local-backend/seaweedfs-init-job.yaml" | \
sed "s|kserve/storage-initializer:latest|${STORAGE_INITIALIZER_IMAGE}|" | \
- oc replace --force -n ${KSERVE_NAMESPACE} -f -
+ oc replace --force -n "${KSERVE_NAMESPACE}" -f -
echo "Waiting for S3 init job to complete..."
- if ! oc wait --for=condition=complete job/s3-init -n ${KSERVE_NAMESPACE} --timeout=300s; then
+ if ! oc wait --for=condition=complete job/s3-init -n "${KSERVE_NAMESPACE}" --timeout=300s; then
echo "S3 init job failed. Pod status and logs:"
- oc get pods -l job-name=s3-init -n ${KSERVE_NAMESPACE}
- oc logs -l job-name=s3-init -n ${KSERVE_NAMESPACE} --tail=50 || true
+ oc get pods -l job-name=s3-init -n "${KSERVE_NAMESPACE}"
+ oc logs -l job-name=s3-init -n "${KSERVE_NAMESPACE}" --tail=50 || true
exit 1
fi
fi
@@
-$SCRIPT_DIR/setup-ci-namespace.sh "$1"
+"$SCRIPT_DIR/setup-ci-namespace.sh" "$1"As per coding guidelines: "SHELL SCRIPT SECURITY (CWE-78): … Quote all variables to prevent injection".
Also applies to: 114-139, 152-152
🧰 Tools
🪛 Shellcheck (0.11.0)
[info] 82-82: Double quote to prevent globbing and word splitting.
(SC2086)
[info] 87-87: Double quote to prevent globbing and word splitting.
(SC2086)
[info] 91-91: Double quote to prevent globbing and word splitting.
(SC2086)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/scripts/openshift-ci/setup-e2e-tests.sh` around lines 82 - 91, The
script contains unquoted variable expansions (notably PROJECT_ROOT, SCRIPT_DIR
and KSERVE_NAMESPACE) causing potential word-splitting/globbing; update all
occurrences to use quoted expansions (e.g. "$PROJECT_ROOT", "$SCRIPT_DIR",
"${KSERVE_NAMESPACE}") including in commands like make -C "$PROJECT_ROOT", pushd
"$PROJECT_ROOT", export PATH="${PROJECT_ROOT}/bin:${PATH}", and any other uses
in the same file (around the referenced ranges 82–91, 114–139, 152). Ensure
every instance of these variables is wrapped in double quotes so pushd/popd,
make, export and other shell commands receive a single, safe argument.
Move ODH-specific SeaweedFS resources (init job with opt-125m download, s3-creds secret) into test/overlays/openshift-ci/ and parametrize the model URI via OPT_125M_MODEL_URI env var (defaults to hf://) so upstream xKS tests continue working without SeaweedFS model caching. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: James Ostrander <jostrand@redhat.com>
Signed-off-by: James Ostrander <jostrand@redhat.com>
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
test/scripts/openshift-ci/setup-e2e-tests.sh (1)
86-86:⚠️ Potential issue | 🟠 Major | ⚡ Quick winQuote all variable expansions to prevent word-splitting/globbing (CWE-78).
Multiple unquoted expansions of
$PROJECT_ROOT,${KSERVE_NAMESPACE}, and${STORAGE_INITIALIZER_IMAGE}remain. The coding guidelines require unconditional quoting to prevent injection if a future path contains whitespace or an external value is provided. Shellcheck SC2086 flags lines 86, 91, 95.Quote all variable expansions throughout the script.
🔒 Suggested quoting
# Install kustomize and yq (also needed for SeaweedFS kustomize build below) -$PROJECT_ROOT/hack/setup/cli/install-kustomize.sh +"$PROJECT_ROOT/hack/setup/cli/install-kustomize.sh" make -C "$PROJECT_ROOT" yq export PATH="${PROJECT_ROOT}/bin:${PATH}" echo "Installing KServe Python SDK ..." -pushd $PROJECT_ROOT >/dev/null +pushd "$PROJECT_ROOT" >/dev/null ./test/scripts/gh-actions/setup-uv.sh export PATH="${PROJECT_ROOT}/bin:${PATH}" popd -pushd $PROJECT_ROOT/python/kserve >/dev/null +pushd "$PROJECT_ROOT/python/kserve" >/dev/null uv sync --active --group test uv pip install timeout-sampler popd @@ if [[ "$SEAWEEDFS_BUNDLED" == "false" ]]; then echo "Deploying SeaweedFS S3 backend for tests..." kustomize build "$PROJECT_ROOT/config/overlays/test/s3-local-backend" | \ sed "s/namespace: kserve/namespace: ${KSERVE_NAMESPACE}/" | \ - oc apply -n ${KSERVE_NAMESPACE} -f - + oc apply -n "${KSERVE_NAMESPACE}" -f - fi echo "Waiting for SeaweedFS deployment to be ready..." -oc rollout status deployment/seaweedfs -n ${KSERVE_NAMESPACE} --timeout=300s +oc rollout status deployment/seaweedfs -n "${KSERVE_NAMESPACE}" --timeout=300s # The s3-init job is already created by the kustomize build above. # It may have failed if SeaweedFS wasn't ready yet, so check and re-create if needed. -if oc wait --for=condition=complete job/s3-init -n ${KSERVE_NAMESPACE} --timeout=60s 2>/dev/null; then +if oc wait --for=condition=complete job/s3-init -n "${KSERVE_NAMESPACE}" --timeout=60s 2>/dev/null; then echo "S3 init job already completed successfully" else echo "S3 init job not completed, re-creating..." sed "s/s3-service.kserve/s3-service.${KSERVE_NAMESPACE}/" \ "$PROJECT_ROOT/test/overlays/openshift-ci/seaweedfs-init-job-odh.yaml" | \ sed "s|kserve/storage-initializer:latest|${STORAGE_INITIALIZER_IMAGE}|" | \ - oc replace --force -n ${KSERVE_NAMESPACE} -f - + oc replace --force -n "${KSERVE_NAMESPACE}" -f - echo "Waiting for S3 init job to complete..." - if ! oc wait --for=condition=complete job/s3-init -n ${KSERVE_NAMESPACE} --timeout=300s; then + if ! oc wait --for=condition=complete job/s3-init -n "${KSERVE_NAMESPACE}" --timeout=300s; then echo "S3 init job failed. Pod status and logs:" - oc get pods -l job-name=s3-init -n ${KSERVE_NAMESPACE} - oc logs -l job-name=s3-init -n ${KSERVE_NAMESPACE} --tail=50 || true + oc get pods -l job-name=s3-init -n "${KSERVE_NAMESPACE}" + oc logs -l job-name=s3-init -n "${KSERVE_NAMESPACE}" --tail=50 || true exit 1 fi fi @@ echo "Prepare CI namespace and install ServingRuntimes" -$SCRIPT_DIR/setup-ci-namespace.sh "$1" +"$SCRIPT_DIR/setup-ci-namespace.sh" "$1"As per coding guidelines: "SHELL SCRIPT SECURITY (CWE-78): … Quote all variables to prevent injection".
Also applies to: 91-91, 95-95, 122-122, 126-126, 130-130, 137-137, 140-140, 142-143, 156-156
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/scripts/openshift-ci/setup-e2e-tests.sh` at line 86, Quote every variable expansion in the script to prevent word-splitting/globbing: replace unquoted $PROJECT_ROOT, ${KSERVE_NAMESPACE}, ${STORAGE_INITIALIZER_IMAGE} (and any other bare expansions) with "$PROJECT_ROOT", "${KSERVE_NAMESPACE}", "${STORAGE_INITIALIZER_IMAGE}" in all commands, assignments, and command substitutions (including the call to install-kustomize.sh and kubectl/oc invocations); ensure concatenations use quoted expansions (e.g., "${VAR}/path") and update all other occurrences flagged (e.g., other ${...} usages) so no unquoted variable remains.
🧹 Nitpick comments (1)
test/overlays/openshift-ci/seaweedfs-init-job-odh.yaml (1)
10-10: 💤 Low valueNon-reproducible image tag in CI manifest.
The
kserve/storage-initializer:latesttag is mutable and defeats reproducibility. The AI summary notes this image is rewritten viaSTORAGE_INITIALIZER_IMAGEbysetup-e2e-tests.sh, but the manifest itself should document the substitution pattern or use a concrete default.Consider adding a comment explaining the image substitution, or use a concrete tag as the default.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/overlays/openshift-ci/seaweedfs-init-job-odh.yaml` at line 10, The manifest currently pins a mutable image tag "kserve/storage-initializer:latest"; update the manifest entry that sets image: kserve/storage-initializer:latest to either a concrete, reproducible tag (e.g., kserve/storage-initializer:<version>) or add an inline comment next to that image line documenting that the image is overwritten by the setup-e2e-tests.sh script using the STORAGE_INITIALIZER_IMAGE environment variable, so reviewers and CI users understand the substitution pattern; ensure the chosen fix references the same image field and the setup-e2e-tests.sh / STORAGE_INITIALIZER_IMAGE mechanism.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/controller/v1alpha2/llmisvc/workload_storage.go`:
- Around line 258-260: The HF_HUB_VERBOSITY env var is being added
unconditionally which triggers pod restarts during operator upgrades; update the
code so the utils.AddEnvVars(initContainer, []corev1.EnvVar{{Name:
"HF_HUB_VERBOSITY", Value: "debug"}}) call is moved inside the same conditional
block that handles skipping HF default env vars during upgrades (the block that
guards adding HF defaults), so HF_HUB_VERBOSITY is only added when defaults are
applied and not during upgrade-only reconciliations; locate the conditional
around the HF defaults and place the HF_HUB_VERBOSITY AddEnvVars call there,
preserving the initContainer and env var name/value.
In `@test/overlays/openshift-ci/seaweedfs-init-job-odh.yaml`:
- Around line 8-27: The init container "download-hf-model" is missing a
securityContext allowing it to run as root; add a securityContext under the
initContainer definition for download-hf-model to enforce non-root and minimal
privileges (e.g., set runAsNonRoot: true and runAsUser to a non-zero UID like
1000), disallow privilege escalation (allowPrivilegeEscalation: false), drop all
Linux capabilities (capabilities: drop: ["ALL"]), and enforce a read-only root
filesystem if possible (readOnlyRootFilesystem: true) so the init container
cannot escalate privileges.
- Around line 29-57: The s3-init main container is missing a securityContext
allowing it to run as non-root; update the container named "s3-init" to include
a securityContext that enforces non-root and least privilege (e.g.,
runAsNonRoot: true and runAsUser: 1000), disallows privilege escalation
(allowPrivilegeEscalation: false), drops capabilities (capabilities.drop:
["ALL"]), and enables readOnlyRootFilesystem: true (and set runAsGroup if
needed); apply these settings to the container spec for s3-init to prevent
running as root and limit privileges.
---
Duplicate comments:
In `@test/scripts/openshift-ci/setup-e2e-tests.sh`:
- Line 86: Quote every variable expansion in the script to prevent
word-splitting/globbing: replace unquoted $PROJECT_ROOT, ${KSERVE_NAMESPACE},
${STORAGE_INITIALIZER_IMAGE} (and any other bare expansions) with
"$PROJECT_ROOT", "${KSERVE_NAMESPACE}", "${STORAGE_INITIALIZER_IMAGE}" in all
commands, assignments, and command substitutions (including the call to
install-kustomize.sh and kubectl/oc invocations); ensure concatenations use
quoted expansions (e.g., "${VAR}/path") and update all other occurrences flagged
(e.g., other ${...} usages) so no unquoted variable remains.
---
Nitpick comments:
In `@test/overlays/openshift-ci/seaweedfs-init-job-odh.yaml`:
- Line 10: The manifest currently pins a mutable image tag
"kserve/storage-initializer:latest"; update the manifest entry that sets image:
kserve/storage-initializer:latest to either a concrete, reproducible tag (e.g.,
kserve/storage-initializer:<version>) or add an inline comment next to that
image line documenting that the image is overwritten by the setup-e2e-tests.sh
script using the STORAGE_INITIALIZER_IMAGE environment variable, so reviewers
and CI users understand the substitution pattern; ensure the chosen fix
references the same image field and the setup-e2e-tests.sh /
STORAGE_INITIALIZER_IMAGE mechanism.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 418296c6-b8e3-421b-955a-0494893f939c
📒 Files selected for processing (11)
pkg/controller/v1alpha2/llmisvc/workload_storage.gopython/storage-initializer/scripts/initializer-entrypointtest/e2e/llmisvc/fixtures.pytest/e2e/llmisvc/test_llm_inference_service_conversion.pytest/e2e/llmisvc/test_pod_watch.pytest/e2e/llmisvc/test_storage_version_migration.pytest/overlays/openshift-ci/seaweedfs-init-job-odh.yamltest/overlays/openshift-ci/seaweedfs-s3-creds-secret.yamltest/scripts/openshift-ci/setup-ci-namespace.shtest/scripts/openshift-ci/setup-e2e-tests.shtest/scripts/openshift-ci/tls/setup-s3-tls.sh
✅ Files skipped from review due to trivial changes (1)
- test/overlays/openshift-ci/seaweedfs-s3-creds-secret.yaml
🚧 Files skipped from review as they are similar to previous changes (4)
- test/scripts/openshift-ci/tls/setup-s3-tls.sh
- test/e2e/llmisvc/test_llm_inference_service_conversion.py
- test/scripts/openshift-ci/setup-ci-namespace.sh
- test/e2e/llmisvc/fixtures.py
| utils.AddEnvVars(initContainer, []corev1.EnvVar{ | ||
| {Name: "HF_HUB_VERBOSITY", Value: "debug"}, | ||
| }) |
There was a problem hiding this comment.
Inconsistent upgrade behavior: HF_HUB_VERBOSITY forces pod restart during operator upgrades.
Lines 252-256 conditionally skip adding HF default env vars when upgrading from an older operator (to avoid unnecessary pod restarts), but lines 258-260 add HF_HUB_VERBOSITY unconditionally. This forces a pod restart during upgrades even when the model hasn't changed.
Move the HF_HUB_VERBOSITY assignment inside the same conditional block (after line 255) to maintain consistency with the upgrade-avoidance pattern.
🔧 Proposed fix
// Add HF default env vars when the init container is new or already has HF_ env vars
// from a previous reconciliation. Skip only when upgrading from an older operator version
// that didn't set these vars, to avoid unnecessary pod restarts.
if currentInitContainer == nil || slices.ContainsFunc(currentInitContainer.Env, func(e corev1.EnvVar) bool {
return strings.HasPrefix(e.Name, "HF_")
}) {
utils.AddDefaultHuggingFaceEnvVars(initContainer)
+ utils.AddEnvVars(initContainer, []corev1.EnvVar{
+ {Name: "HF_HUB_VERBOSITY", Value: "debug"},
+ })
}
- utils.AddEnvVars(initContainer, []corev1.EnvVar{
- {Name: "HF_HUB_VERBOSITY", Value: "debug"},
- })
-
if containerName == tokenizerContainerName {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/controller/v1alpha2/llmisvc/workload_storage.go` around lines 258 - 260,
The HF_HUB_VERBOSITY env var is being added unconditionally which triggers pod
restarts during operator upgrades; update the code so the
utils.AddEnvVars(initContainer, []corev1.EnvVar{{Name: "HF_HUB_VERBOSITY",
Value: "debug"}}) call is moved inside the same conditional block that handles
skipping HF default env vars during upgrades (the block that guards adding HF
defaults), so HF_HUB_VERBOSITY is only added when defaults are applied and not
during upgrade-only reconciliations; locate the conditional around the HF
defaults and place the HF_HUB_VERBOSITY AddEnvVars call there, preserving the
initContainer and env var name/value.
| initContainers: | ||
| - name: download-hf-model | ||
| image: kserve/storage-initializer:latest | ||
| command: | ||
| - python3 | ||
| - -c | ||
| - | | ||
| from huggingface_hub import snapshot_download | ||
| snapshot_download( | ||
| "facebook/opt-125m", | ||
| local_dir="/shared/facebook/opt-125m", | ||
| ignore_patterns=["*.msgpack", "*.h5", "flax_model*", "tf_model*", "rust_model*"], | ||
| ) | ||
| print("Model download complete") | ||
| env: | ||
| - name: HF_HUB_DISABLE_PROGRESS_BARS | ||
| value: "1" | ||
| volumeMounts: | ||
| - name: model-cache | ||
| mountPath: /shared |
There was a problem hiding this comment.
Missing security context on init container allows privilege escalation.
The download-hf-model init container lacks a securityContext. This allows it to run as root and escalate privileges (CWE-250). Even in a CI-only test job, containers should run as non-root with minimal privileges.
Add a security context to the init container.
🔒 Proposed security hardening
- name: download-hf-model
image: kserve/storage-initializer:latest
+ securityContext:
+ runAsNonRoot: true
+ allowPrivilegeEscalation: false
+ capabilities:
+ drop:
+ - ALL
command:
- python3🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/overlays/openshift-ci/seaweedfs-init-job-odh.yaml` around lines 8 - 27,
The init container "download-hf-model" is missing a securityContext allowing it
to run as root; add a securityContext under the initContainer definition for
download-hf-model to enforce non-root and minimal privileges (e.g., set
runAsNonRoot: true and runAsUser to a non-zero UID like 1000), disallow
privilege escalation (allowPrivilegeEscalation: false), drop all Linux
capabilities (capabilities: drop: ["ALL"]), and enforce a read-only root
filesystem if possible (readOnlyRootFilesystem: true) so the init container
cannot escalate privileges.
| - name: s3-init | ||
| image: docker.io/amazon/aws-cli:2.33.11@sha256:88ea087837da1b31a3afeeb4538888244a3087a867ddb1909848643fcc4b324e | ||
| command: | ||
| - /bin/sh | ||
| - -c | ||
| - | | ||
| aws s3 mb s3://logger-output || true | ||
| aws s3 mb s3://example-models || true | ||
| curl -L https://storage.googleapis.com/kfserving-examples/models/sklearn/1.0/model/model.joblib -o /tmp/sklearn-model.joblib | ||
| aws s3 cp /tmp/sklearn-model.joblib s3://example-models/sklearn/model.joblib --no-verify-ssl | ||
| aws s3 sync /shared/facebook/opt-125m s3://example-models/facebook/opt-125m/ --no-verify-ssl | ||
| env: | ||
| - name: AWS_ACCESS_KEY_ID | ||
| valueFrom: | ||
| secretKeyRef: | ||
| name: mlpipeline-s3-artifact | ||
| key: accesskey | ||
| - name: AWS_SECRET_ACCESS_KEY | ||
| valueFrom: | ||
| secretKeyRef: | ||
| name: mlpipeline-s3-artifact | ||
| key: secretkey | ||
| - name: AWS_DEFAULT_REGION | ||
| value: "us-east-1" | ||
| - name: AWS_ENDPOINT_URL_S3 | ||
| value: "http://s3-service.kserve:8333" | ||
| volumeMounts: | ||
| - name: model-cache | ||
| mountPath: /shared |
There was a problem hiding this comment.
Missing security context on main container allows privilege escalation.
The s3-init container lacks a securityContext. This allows it to run as root and escalate privileges (CWE-250). Even in a CI-only test job, containers should run as non-root with minimal privileges.
Add a security context to the main container.
🔒 Proposed security hardening
- name: s3-init
image: docker.io/amazon/aws-cli:2.33.11@sha256:88ea087837da1b31a3afeeb4538888244a3087a867ddb1909848643fcc4b324e
+ securityContext:
+ runAsNonRoot: true
+ allowPrivilegeEscalation: false
+ capabilities:
+ drop:
+ - ALL
command:
- /bin/sh🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/overlays/openshift-ci/seaweedfs-init-job-odh.yaml` around lines 29 - 57,
The s3-init main container is missing a securityContext allowing it to run as
non-root; update the container named "s3-init" to include a securityContext that
enforces non-root and least privilege (e.g., runAsNonRoot: true and runAsUser:
1000), disallows privilege escalation (allowPrivilegeEscalation: false), drops
capabilities (capabilities.drop: ["ALL"]), and enables readOnlyRootFilesystem:
true (and set runAsGroup if needed); apply these settings to the container spec
for s3-init to prevent running as root and limit privileges.
Set HF_HUB_VERBOSITY=debug so that huggingface_hub emits detailed download progress, HTTP request info, and transfer diagnostics. This helps diagnose hf:// download timeouts that currently produce no actionable logs. Two injection points cover all deployment scenarios: - Go controller (llmisvc): injects the env var into the init container pod spec, effective even when the storage-initializer image is not rebuilt from source (ODH CI). - Python entrypoint: sets the env var before huggingface_hub is imported, effective when the image IS rebuilt (upstream CI). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: James Ostrander <jostrand@redhat.com>
8442c3d to
c016d6d
Compare
Increase XET logging verbosity from info to debug to capture per-term download details (xorb hashes, byte ranges) that reveal where silent hangs occur. Add a background thread to the storage-initializer entrypoint that logs `du -sh` every 10s so we can see how far a download got before the container is killed. Also move OPT_125M_MODEL_URI from setup-e2e-tests.sh (subprocess) to run-e2e-tests.sh (parent), fixing the env var propagation bug. Commented out for now to keep reproducing the XET hang. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: James Ostrander <jostrand@redhat.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
test/scripts/openshift-ci/setup-e2e-tests.sh (1)
82-91: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick winQuote the remaining variable expansions.
Lines 82, 87, 91, 118, 122, 126, 133, 136, 138, 139, and 152 still pass unquoted expansions into commands. That leaves word-splitting/globbing on repo paths and namespace/image inputs, and it is exactly what the shell rule in this repo forbids.
Representative fix
-$PROJECT_ROOT/hack/setup/cli/install-kustomize.sh +"$PROJECT_ROOT/hack/setup/cli/install-kustomize.sh" @@ -pushd $PROJECT_ROOT >/dev/null +pushd "$PROJECT_ROOT" >/dev/null @@ -pushd $PROJECT_ROOT/python/kserve >/dev/null +pushd "$PROJECT_ROOT/python/kserve" >/dev/null @@ - oc apply -n ${KSERVE_NAMESPACE} -f - + oc apply -n "${KSERVE_NAMESPACE}" -f - @@ -oc rollout status deployment/seaweedfs -n ${KSERVE_NAMESPACE} --timeout=300s +oc rollout status deployment/seaweedfs -n "${KSERVE_NAMESPACE}" --timeout=300s @@ -if oc wait --for=condition=complete job/s3-init -n ${KSERVE_NAMESPACE} --timeout=60s 2>/dev/null; then +if oc wait --for=condition=complete job/s3-init -n "${KSERVE_NAMESPACE}" --timeout=60s 2>/dev/null; then @@ - oc replace --force -n ${KSERVE_NAMESPACE} -f - + oc replace --force -n "${KSERVE_NAMESPACE}" -f - @@ - if ! oc wait --for=condition=complete job/s3-init -n ${KSERVE_NAMESPACE} --timeout=300s; then + if ! oc wait --for=condition=complete job/s3-init -n "${KSERVE_NAMESPACE}" --timeout=300s; then @@ - oc get pods -l job-name=s3-init -n ${KSERVE_NAMESPACE} - oc logs -l job-name=s3-init -n ${KSERVE_NAMESPACE} --tail=50 || true + oc get pods -l job-name=s3-init -n "${KSERVE_NAMESPACE}" + oc logs -l job-name=s3-init -n "${KSERVE_NAMESPACE}" --tail=50 || true @@ -$SCRIPT_DIR/setup-ci-namespace.sh "$1" +"$SCRIPT_DIR/setup-ci-namespace.sh" "$1"#!/bin/bash set -euo pipefail shellcheck -f gcc test/scripts/openshift-ci/setup-e2e-tests.sh | rg 'SC2086'As per coding guidelines, "SHELL SCRIPT SECURITY (CWE-78): ... Quote all variables to prevent injection".
Also applies to: 114-152
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/scripts/openshift-ci/setup-e2e-tests.sh` around lines 82 - 91, The script passes several unquoted variable expansions which can cause word-splitting and globbing; update all occurrences of $PROJECT_ROOT and any other variable expansions used in commands (e.g., in the calls to the install-kustomize script, make -C "$PROJECT_ROOT", export PATH assignments, pushd/popd invocations, and ./test/scripts/gh-actions/setup-uv.sh and pushd $PROJECT_ROOT/python/kserve) to be quoted (e.g., "$PROJECT_ROOT", "${PROJECT_ROOT}/bin", and similar) throughout the file (including the remaining occurrences around lines 114–152) so every command argument and PATH export uses quoted expansions and prevents shell word-splitting/globbing.pkg/controller/v1alpha2/llmisvc/workload_storage.go (1)
258-263:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKeep the new debug env vars inside the existing upgrade guard.
These additions sit outside the conditional on Lines 249-256, so reconciling an older deployment still mutates the init container env and rolls the workload even when the model has not changed. That defeats the upgrade-stability logic already in this function.
Possible fix
if currentInitContainer == nil || slices.ContainsFunc(currentInitContainer.Env, func(e corev1.EnvVar) bool { return strings.HasPrefix(e.Name, "HF_") }) { utils.AddDefaultHuggingFaceEnvVars(initContainer) + utils.AddEnvVars(initContainer, []corev1.EnvVar{ + {Name: "HF_HUB_VERBOSITY", Value: "debug"}, + {Name: "HF_DEBUG", Value: "1"}, + {Name: "RUST_LOG", Value: "debug"}, + {Name: "RUST_BACKTRACE", Value: "1"}, + }) } - - utils.AddEnvVars(initContainer, []corev1.EnvVar{ - {Name: "HF_HUB_VERBOSITY", Value: "debug"}, - {Name: "HF_DEBUG", Value: "1"}, - {Name: "RUST_LOG", Value: "debug"}, - {Name: "RUST_BACKTRACE", Value: "1"}, - })🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/controller/v1alpha2/llmisvc/workload_storage.go` around lines 258 - 263, The new debug env vars added via utils.AddEnvVars(initContainer, ...) are placed outside the existing upgrade guard and will trigger unnecessary rollouts; move the four EnvVar entries ("HF_HUB_VERBOSITY", "HF_DEBUG", "RUST_LOG", "RUST_BACKTRACE") so they are added only inside the existing conditional block that protects initContainer updates (the upgrade guard surrounding the init container mutation logic), ensuring the env mutation occurs only when that guard's condition is met and leaving all other logic unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@python/storage-initializer/scripts/initializer-entrypoint`:
- Around line 27-44: The recursive progress scanner _log_download_progress is
always started (progress_thread using dest_paths), causing du -sh to scan large
trees every 10s; gate this behavior behind an opt-in flag (e.g. environment
variable or config like ENABLE_DOWNLOAD_PROGRESS) and only create/start
progress_thread when that flag is truthy. Update the startup path to check the
flag before instantiating threading.Thread(target=_log_download_progress,
args=(dest_paths,), daemon=True) so the scanner never runs in normal runs unless
explicitly enabled.
---
Duplicate comments:
In `@pkg/controller/v1alpha2/llmisvc/workload_storage.go`:
- Around line 258-263: The new debug env vars added via
utils.AddEnvVars(initContainer, ...) are placed outside the existing upgrade
guard and will trigger unnecessary rollouts; move the four EnvVar entries
("HF_HUB_VERBOSITY", "HF_DEBUG", "RUST_LOG", "RUST_BACKTRACE") so they are added
only inside the existing conditional block that protects initContainer updates
(the upgrade guard surrounding the init container mutation logic), ensuring the
env mutation occurs only when that guard's condition is met and leaving all
other logic unchanged.
In `@test/scripts/openshift-ci/setup-e2e-tests.sh`:
- Around line 82-91: The script passes several unquoted variable expansions
which can cause word-splitting and globbing; update all occurrences of
$PROJECT_ROOT and any other variable expansions used in commands (e.g., in the
calls to the install-kustomize script, make -C "$PROJECT_ROOT", export PATH
assignments, pushd/popd invocations, and ./test/scripts/gh-actions/setup-uv.sh
and pushd $PROJECT_ROOT/python/kserve) to be quoted (e.g., "$PROJECT_ROOT",
"${PROJECT_ROOT}/bin", and similar) throughout the file (including the remaining
occurrences around lines 114–152) so every command argument and PATH export uses
quoted expansions and prevents shell word-splitting/globbing.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 4d993c91-dba9-455a-a4fc-122304b97513
📒 Files selected for processing (4)
pkg/controller/v1alpha2/llmisvc/workload_storage.gopython/storage-initializer/scripts/initializer-entrypointtest/scripts/openshift-ci/run-e2e-tests.shtest/scripts/openshift-ci/setup-e2e-tests.sh
🚧 Files skipped from review as they are similar to previous changes (1)
- test/scripts/openshift-ci/run-e2e-tests.sh
| def _log_download_progress(paths, interval=10): | ||
| while True: | ||
| time.sleep(interval) | ||
| for p in paths: | ||
| if not os.path.isdir(p): | ||
| continue | ||
| result = subprocess.run( | ||
| ["du", "-sh", p], capture_output=True, text=True | ||
| ) | ||
| if result.returncode == 0: | ||
| ts = datetime.datetime.now(datetime.timezone.utc).isoformat() | ||
| logger.info("download progress [%s]: %s", ts, result.stdout.strip()) | ||
|
|
||
|
|
||
| progress_thread = threading.Thread( | ||
| target=_log_download_progress, args=(dest_paths,), daemon=True | ||
| ) | ||
| progress_thread.start() |
There was a problem hiding this comment.
Gate the recursive progress scanner behind an opt-in flag.
du -sh walks the full model tree every 10s while the download is writing into the same directory. On large pulls this adds avoidable I/O contention to the startup path for every storage init, not just debug runs.
Possible fix
-def _log_download_progress(paths, interval=10):
+def _log_download_progress(paths, interval=60):
while True:
time.sleep(interval)
for p in paths:
if not os.path.isdir(p):
continue
@@
-progress_thread = threading.Thread(
- target=_log_download_progress, args=(dest_paths,), daemon=True
-)
-progress_thread.start()
+if os.getenv("STORAGE_INITIALIZER_PROGRESS_LOGGING") == "1":
+ progress_thread = threading.Thread(
+ target=_log_download_progress, args=(dest_paths,), daemon=True
+ )
+ progress_thread.start()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@python/storage-initializer/scripts/initializer-entrypoint` around lines 27 -
44, The recursive progress scanner _log_download_progress is always started
(progress_thread using dest_paths), causing du -sh to scan large trees every
10s; gate this behavior behind an opt-in flag (e.g. environment variable or
config like ENABLE_DOWNLOAD_PROGRESS) and only create/start progress_thread when
that flag is truthy. Update the startup path to check the flag before
instantiating threading.Thread(target=_log_download_progress,
args=(dest_paths,), daemon=True) so the scanner never runs in normal runs unless
explicitly enabled.
… issues hf-xet has upstream bugs that cause intermittent model download failures. Pinning hf-xet>=1.5.0 as an explicit dependency in kserve-storage to ensure all packages resolve to a stable version. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Andres Llausas <allausas@redhat.com>
The test-llmisvc kustomize overlay was not patching the inferenceservice-config configmap, causing it to fall back to the base kserve/storage-initializer:latest image instead of the freshly built image from the PR. This meant xet or other storage-initializer fixes were never exercised in the llmisvc CI jobs. Add a minimal configmap patch to the test-llmisvc overlay that sets the storage-initializer image, and update update-test-overlays.sh and manage.kserve-kustomize.sh to replace `latest` with the built TAG — mirroring the pattern already used by the test overlay. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Andres Llausas <allausas@redhat.com>
…Face tests hf-xet 1.5.0 (introduced via kserve-storage dependency) is slower for direct HuggingFace Hub downloads than 1.1.3, causing model downloads via --model_id to consistently exceed the default 600s wait_isvc_ready timeout. Increase per-ISVC readiness timeout to 1200s in test_huggingface.py and test_huggingface_vllm_cpu.py. Also raise GHA step timeouts to match: test-llm (2 workers, 12 tests) to 60 min, test-huggingface-server-vllm (1 worker, 5 tests) to 90 min. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Andres Llausas <allausas@redhat.com>
… HuggingFace tests" This reverts commit 58df838. Signed-off-by: Andres Llausas <allausas@redhat.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: James Ostrander <jostrand@redhat.com>
The cherry-pick of the hf-xet 1.5.0 upgrade removed the hf-transfer package block but left behind the huggingface-hub optional-dependencies section referencing it, causing uv to fail with a missing source error. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: James Ostrander <jostrand@redhat.com>
|
/retest |
Signed-off-by: James Ostrander <jostrand@redhat.com>
|
/retest |
|
@jlost: The following test has Failed: OCI Artifact Browser URLInspecting Test Artifacts ManuallyTo inspect your test artifacts manually, follow these steps:
mkdir -p oras-artifacts
cd oras-artifacts
oras pull quay.io/opendatahub/odh-ci-artifacts:kserve-group-test-fp5gp |
Summary
collect_pod_logs()totest/e2e/llmisvc/diagnostic.pythat lists all pods matching the LLMInferenceService labels and dumps current (and previous, when restarted) logs for every init container and regular container, with a 200-line tail._collect_diagnosticsintest_llm_inference_service.py, which runs on any test failure before the service is deleted -- so the storage-initializer and vLLM container logs are captured while the pod still exists.Motivation
Recent CI runs have shown flaky failures in
workload-single-cpuandworkload-pd-cputest cases where the pod's storage-initializer and/or main container logs are absent from must-gather (because the pods are deleted 20+ minutes beforeoc adm inspectruns). This PR ensures the logs are emitted directly to the build log at the moment of failure.Test plan
pull-ci-opendatahub-io-kserve-master-e2e-llm-inference-servicejob; on next flaky failure, verify storage-initializer and main container logs appear in build-log.txt.Made with Cursor
Summary by CodeRabbit
New Features
Improvements
Documentation