Skip to content

test(llmisvc): collect pod and init-container logs on failure#1495

Closed
jlost wants to merge 41 commits into
opendatahub-io:masterfrom
jlost:debug/collect-storage-init-logs
Closed

test(llmisvc): collect pod and init-container logs on failure#1495
jlost wants to merge 41 commits into
opendatahub-io:masterfrom
jlost:debug/collect-storage-init-logs

Conversation

@jlost

@jlost jlost commented May 12, 2026

Copy link
Copy Markdown

Summary

  • Adds collect_pod_logs() to test/e2e/llmisvc/diagnostic.py that lists all pods matching the LLMInferenceService labels and dumps current (and previous, when restarted) logs for every init container and regular container, with a 200-line tail.
  • Calls it from _collect_diagnostics in test_llm_inference_service.py, which runs on any test failure before the service is deleted -- so the storage-initializer and vLLM container logs are captured while the pod still exists.

Motivation

Recent CI runs have shown flaky failures in workload-single-cpu and workload-pd-cpu test cases where the pod's storage-initializer and/or main container logs are absent from must-gather (because the pods are deleted 20+ minutes before oc adm inspect runs). This PR ensures the logs are emitted directly to the build log at the moment of failure.

Test plan

  • Trigger the pull-ci-opendatahub-io-kserve-master-e2e-llm-inference-service job; on next flaky failure, verify storage-initializer and main container logs appear in build-log.txt.

Made with Cursor

Summary by CodeRabbit

  • New Features

    • OpenShift-focused Makefile targets to install, build images, run, reset, and tear down E2E tests
    • Manual KServe install option and operator installer flow for multiple operator types
    • Pod log collection integrated into E2E diagnostics; HTTP retry helper for test requests
    • New E2E manifests for SeaweedFS S3 test backend and secrets
  • Improvements

    • Streamlined E2E setup/teardown and clearer CI deployment modes
    • More robust image build/tag/push flow and deployment readiness checks
    • Added model URI configurability and safer name sanitization for K8s resources
  • Documentation

    • FVT guide updated with Makefile quick start and configuration table

jlost and others added 25 commits May 11, 2026 14:44
- Override BASE_IMG to UBI in Makefile.overrides.mk so `make
  docker-build-*` uses the same base image the Dockerfiles expect
  (fixes dnf-not-found on Debian slim)
- Deploy SeaweedFS separately in ODH operator mode since the operator
  doesn't include it, but the E2E tests need a local S3 backend

Made-with: Cursor
Signed-off-by: James Ostrander <jostrand@redhat.com>
Extract operator installation logic into install-operator.sh (shared by
deploy.odh.sh and VS Code tasks) and DSCI/DSC application into
apply-dsci-dsc.sh. Enhance the setup-e2e-ocp Makefile target to accept
OPERATOR_TYPE, OPERATOR_VERSION, and CATALOG_SOURCE, translating them
into INSTALL_ODH_OPERATOR for setup-e2e-tests.sh.

MIRROR_IMAGES is now auto-derived (true when OPERATOR_TYPE=rhoai and
CATALOG_SOURCE is an FBC fragment image).

setup-e2e-tests.sh is unchanged -- backward compatible with existing CI.

Made-with: Cursor
Signed-off-by: James Ostrander <jostrand@redhat.com>
- Add `build-images-ocp` and `reset-e2e-ocp` Makefile targets
- Remove deploy.serverless.sh and deploy.ossm.sh (dead code)
- Remove ServiceMeshMember cleanup from teardown-ci-namespace.sh
- Remove serverless/servicemesh references from version.sh
- Drop unused DEPLOYMENT_TYPE arg from setup-ci-namespace.sh
- Update fvt-tests.md to document Makefile-based E2E workflow

Made-with: Cursor
Signed-off-by: James Ostrander <jostrand@redhat.com>
- deploy.odh.sh: use OPERATOR_NAME for CSV lookup and dynamic pod
  selector from deployment matchLabels (was hardcoded to ODH)
- copy-kserve-manifests-to-pvc.sh: derive operator deployment name
  from OPERATOR_TYPE instead of hardcoding ODH label
- setup-e2e-tests.sh: set KSERVE_NAMESPACE per operator type
  (redhat-ods-applications for RHOAI, opendatahub for ODH);
  skip DSCI apply when one already exists (RHOAI auto-creates it)
- Makefile.overrides.mk: forward COPY_PR_MANIFESTS to setup script;
  add $(strip) to all forwarded variables for VSCode space defaults

Made-with: Cursor
Signed-off-by: James Ostrander <jostrand@redhat.com>
- Replace the deletion of the S3 init job with a forceful replacement to ensure the job is updated correctly.
- This change improves the reliability of the S3 initialization process during E2E tests.

Made-with: Cursor
Signed-off-by: James Ostrander <jostrand@redhat.com>
- Removed unused INSTALL_ODH_OPERATOR variable and refactored setup-e2e-tests.sh to derive installation mode from OPERATOR_TYPE.
- Updated teardown-e2e-setup.sh to support both raw and operator-based deployments, ensuring a more robust cleanup process.
- Enhanced install-operator.sh to dynamically handle channel detection and CSV validation, improving operator installation reliability.
- Adjusted deploy.odh.sh to remove the CHANNEL_OVERRIDE variable, simplifying the script.

Made-with: Cursor
Signed-off-by: James Ostrander <jostrand@redhat.com>
When CATALOG_SOURCE is an FBC fragment image (e.g.
quay.io/rhoai/rhoai-fbc-fragment:rhoai-3.5-ea.1) and OPERATOR_VERSION
is not explicitly set, extract the major.minor version hint from the
image tag and use it to find the matching channel/CSV instead of
silently falling back to the default channel.

Also:
- Error on explicit OPERATOR_VERSION mismatch instead of silent fallback
- Clean up stale subscription after auto-detect sets OPERATOR_VERSION
- Fix teardown ordering: CI namespace first, webhooks before namespaces
- Strip finalizers on stuck resources during CI namespace teardown
- Trim redundant comments from teardown script
- Prevent e2e-ocp from re-running setup (SETUP_E2E=false)

Signed-off-by: James Ostrander <jostrand@redhat.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
- Use .status.installPlanRef.name instead of deprecated .status.installplan.name
- Resolve CSV from Subscription.status.installedCSV instead of grep|head
- Enforce set -euo pipefail unconditionally in install-operator.sh
  (guard pipefail-sensitive pipelines with || true)

Signed-off-by: James Ostrander <jostrand@redhat.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
query_catalog_info only checked currentCSV (channel head), so pinning
an older version like ODH 3.3.0 or RHOAI 3.4.0 failed even though
the version existed in the channel entries. Now searches entries when
the head doesn't match.

Also:
- Add clean-setup-e2e-ocp Makefile target (teardown + setup)
- Add help text to teardown-e2e-ocp target

Signed-off-by: James Ostrander <jostrand@redhat.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…ardown

Replace hardcoded "openshift-operators" with configurable OPERATOR_NAMESPACE
variable (defaulting to openshift-operators) across install, deploy, teardown,
and manifest copy scripts. Add install plan cleanup to teardown to prevent
stale plans from interfering with subsequent installs.

Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: James Ostrander <jostrand@redhat.com>
- Move OpenShift E2E variables and targets into Makefile.e2e-ocp.mk,
  included from Makefile.overrides.mk.
- Always use Manual install plan approval to avoid race conditions.
- Wait for RHOAI auto-created DSCI instead of racing to apply our own.
- Delete KServe CRDs on teardown to avoid storedVersions conflicts on
  version downgrades.

Signed-off-by: James Ostrander <jostrand@redhat.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
- config/overlays/odh-test/dsc.yaml: change defaultDeploymentMode from
  Serverless to RawDeployment. Serverless is incompatible with
  serving.managementState: Removed (no Knative), causing RHOAI 2.x installs
  to leave the CSV stuck in Pending. All code paths that used this file
  either immediately patched it to RawDeployment (manual mode) or silently
  broke on older operators.

- teardown-e2e-setup.sh: also delete *.opendatahub.io CRDs on teardown.
  The previous teardown only removed *.serving.kserve.io CRDs, leaving ODH
  platform CRDs intact. On a version downgrade (e.g. RHOAI 3.x -> 2.25),
  those CRDs carry storedVersions from the newer API (e.g. v2) that the
  older operator's install plan refuses to overwrite, failing with
  "risk of data loss". Deleting them lets the next install recreate them
  cleanly.

- Remove apply-dsci-dsc.sh: unused script with no callers; its KUEUE_STATE
  envsubst logic was never wired into dsc.yaml.

Signed-off-by: James Ostrander <jostrand@redhat.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
On HyperShift and ROSA HCP clusters there is no 'master'
MachineConfigPool -- the ImageDigestMirrorSet takes effect without an
MCP rollout. The previous code would spin for 300s waiting for an MCP
that never exists.

Guard the wait with a check for the MCP's existence so the mirroring
path works on both CRC/ROSA Classic (MCP present) and HyperShift/ROSA
HCP (no MCP).

Signed-off-by: James Ostrander <jostrand@redhat.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
- Updated Makefile.e2e-ocp.mk to include RUNNING_LOCAL and KSERVE_NAMESPACE variables for better local testing support.
- Modified deploy.odh.sh to derive KSERVE_NAMESPACE based on OPERATOR_TYPE and improved logging for operator installation.
- Enhanced install-operator.sh to wait for ODH/RHOAI CRDs to be established before proceeding.
- Refined run-e2e-tests.sh to export RUNNING_LOCAL and set KSERVE_NAMESPACE based on the operator type.
- Cleaned up setup-e2e-tests.sh by removing redundant PR manifest handling logic.

This update improves the flexibility and reliability of the E2E testing process across different operator types.

Signed-off-by: James Ostrander <jostrand@redhat.com>
…sting

- Added Makefile.ocp.mk to define targets for building KServe images, setting up KServe on OpenShift, and running E2E tests.
- Updated Makefile.overrides.mk to include the new Makefile.ocp.mk for better organization.
- Enhanced image handling in copy-kserve-manifests-to-pvc.sh to prefer locally-built images when QUAY_REPO is set.
- Created deploy.kserve-manual.sh for manual KServe installation via kustomize, allowing for more flexible deployment options.
- Refined setup-e2e-tests.sh and setup-kserve.sh to streamline the installation process and improve environment setup.

This update enhances the deployment and testing capabilities for KServe on OpenShift, providing a more robust framework for developers.

Signed-off-by: James Ostrander <jostrand@redhat.com>
…annel detection

- Introduced test-detect-channel.sh to validate the detect_channel() function without requiring a cluster, covering various scenarios for version detection and channel resolution.
- Added test-install-operator.sh to test the install_operator() function, simulating the full installation pipeline and ensuring correct Subscription YAML generation.
- Both scripts utilize mock functions to simulate OpenShift CLI behavior, enhancing the reliability of the tests.

These additions improve the testing framework for operator installation and channel detection, ensuring better coverage and reliability in the deployment process.

Signed-off-by: James Ostrander <jostrand@redhat.com>
The comment incorrectly stated CI mode uses Automatic install-plan
approval. The script always uses Manual approval and explicitly
approves the generated InstallPlan in wait_for_operator_ready().

Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: James Ostrander <jostrand@redhat.com>
- install-operator.sh: fix header comment -- approval mode is always
  Manual (script explicitly approves the InstallPlan); correct CI mode
  description to say startingCSV is omitted rather than Automatic approval
- teardown-e2e-setup.sh: add WARNING comment on ODH CRD deletion block
  clarifying this is intentional but requires a dedicated test cluster
- setup-kserve.sh: remove redundant DSC RawDeployment patch (deploy.kserve-
  manual.sh already applies dsc.yaml which sets RawDeployment); fix hardcoded
  'namespace: opendatahub' in ODH MC kustomization override -- use
  KSERVE_NAMESPACE to stay consistent with the base kustomization.yaml
- copy-kserve-manifests-to-pvc.sh, deploy.odh.sh: replace python3 -c JSON
  parsing for pod selector extraction with oc go-template, matching the
  pattern already established in version.sh

Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: James Ostrander <jostrand@redhat.com>
- Makefile.ocp.mk: clarify COPY_PR_MANIFESTS comment -- manual mode always
  uses local branch overlays, so the flag is inapplicable there
- install-operator.sh: remove unused _INSTALL_OPERATOR_SOURCED sentinel

Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: James Ostrander <jostrand@redhat.com>
- Correct Makefile reference: targets are defined in Makefile.ocp.mk
  (included by Makefile.overrides.mk), not directly in Makefile.overrides.mk
- Remove spurious E2E_MARKER from teardown-e2e-ocp example; the target
  does not use that variable

Signed-off-by: James Ostrander <jostrand@redhat.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
The RHOAI operator path has not been verified with this setup. Replace
the pinned-version example with an equivalent ODH FBC fragment example.

Signed-off-by: James Ostrander <jostrand@redhat.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
- install-operator.sh help text: swap quay.io/rhoai/rhoai-fbc-fragment
  (private, auth-gated) with a single quay.io/opendatahub/opendatahub-operator-catalog
  example (public, verified pullable)
- fvt-tests.md: same substitution for the pinned-catalog example

The rhoai-fbc-fragment references in test-install-operator.sh and
test-detect-channel.sh are intentional -- they test rhoai-specific code
paths and never pull the image (query_catalog_info is mocked). Case G
in test-install-operator.sh confirms the no-FBC RHOAI path works using
the default redhat-operators catalog.

Signed-off-by: James Ostrander <jostrand@redhat.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
- SC2295: quote inner expansion in CSV_VERSION prefix-strip to prevent
  OPERATOR_NAME from being interpreted as a glob pattern
- stale job sweep: match only Failed state (drop || $2 == "0/1" which
  incorrectly matched in-progress jobs), remove head -5 cap
- bash -c payload: switch to single-quoted string and pass
  OPERATOR_NAMESPACE/CONTROLLER_DEPLOYMENT as positional args ($1/$2)
  to fix broken quoting caused by double-quotes inside double-quoted string

Signed-off-by: James Ostrander <jostrand@redhat.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
- build-kserve-images.sh: validate BUILDER against docker|podman whitelist
  before invocation to prevent command injection (CWE-78)
- copy-kserve-manifests-to-pvc.sh: fail fast when pod selector is empty
  to prevent silent use of a wrong pod when deployment lookup fails
- teardown-e2e-setup.sh: strip finalizers from stuck KServe/ODH resources
  when application namespaces remain in Terminating after initial delete,
  matching the pattern already used in teardown-ci-namespace.sh

Signed-off-by: James Ostrander <jostrand@redhat.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: James Ostrander <jostrand@redhat.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@openshift-ci

openshift-ci Bot commented May 12, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlost

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jlost

jlost commented May 12, 2026

Copy link
Copy Markdown
Author

/hold

bartoszmajsak and others added 2 commits May 12, 2026 09:38
…erage (kserve#5187)

Signed-off-by: Bartosz Majsak <bartosz.majsak@gmail.com>
…isibility

pytest-xdist always captures worker stdout even with -s, so print() calls
in _collect_diagnostics were only visible in the final "Captured stdout
call" section - not in the real-time tee log and truncated in large runs.

Switch _collect_diagnostics, diagnostic.py helpers, and the except-block
error message from print() to logger.info/error. Log records are forwarded
in real-time by xdist via --log-cli, so they appear immediately in the
live log stream and the tee'd log file.

Also make wait_timeout configurable via LLMISVC_WAIT_TIMEOUT env var
(default: 900s) so local dev iterations don't need a 15-minute wait.

Signed-off-by: James Ostrander <jostrand@redhat.com>
Co-authored-by: Cursor <cursoragent@cursor.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/llmisvc/diagnostic.py`:
- Around line 25-27: The function print_all_events_table currently ignores the
max_events parameter and logs all events; update the logic in
print_all_events_table to enforce max_events by limiting the events
iterated/logged (e.g., slice the events list or stop after emitting max_events
entries) before calling the logging callable (log / _log.info), and ensure any
summary or count reflects the applied limit so diagnostics remain bounded and
performant.
- Around line 192-194: The current exception handler logs container output
verbatim via log("%s", logs or "(empty)"), which can leak sensitive tokens;
change this to redact secrets before emitting. Add or call a helper (e.g.,
redact_sensitive_tokens(logs)) that applies regex replacements to mask
Authorization/Bearer headers, "sig|se|sasl|token" query params in signed URLs,
AWS/Google keys and long base64-like strings, then call log("%s", redacted_logs
or "(empty)") inside the except block (keep the same log call and variable
names). Ensure the helper is used wherever raw container logs are printed in
this module so all usages of log(..., logs) are protected.

In `@test/e2e/llmisvc/test_llm_inference_service.py`:
- Around line 531-533: The except block in maybe_delete_llmisvc currently
swallows all exceptions and only prints a warning; update it to surface failures
instead of hiding them by logging the error (including service_name and
exception details) and then re-raising the exception (or calling pytest.fail) so
cleanup failures fail the test and do not leak resources; target the except
Exception as e block that references service_name and ensure the error is not
silently ignored.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 6137b1f9-45bc-46f5-bae4-89c1459ddba8

📥 Commits

Reviewing files that changed from the base of the PR and between 6d8bb60 and 63f92cc.

📒 Files selected for processing (6)
  • test/e2e/common/http_retry.py
  • test/e2e/llmisvc/diagnostic.py
  • test/e2e/llmisvc/fixtures.py
  • test/e2e/llmisvc/test_llm_inference_service.py
  • test/e2e/llmisvc/test_llm_inference_service_stop.py
  • test/e2e/pytest.ini

Comment on lines +25 to +27
def print_all_events_table(
namespace: str, max_events: int = 50, log: Callable = _log.info
):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Apply max_events limit; it is currently ignored.

Line 26 advertises a bounded event dump, but Lines 34-45 log every event. This makes diagnostics noisy and can degrade CI performance.

Proposed fix
-        events = core.list_namespaced_event(namespace=namespace).items
+        events = core.list_namespaced_event(namespace=namespace).items
+        events = sorted(
+            events,
+            key=lambda ev: ev.last_timestamp or ev.first_timestamp or datetime.min,
+            reverse=True,
+        )[:max_events]

As per coding guidelines, **: REVIEW PRIORITIES: 4. Performance problems.

Also applies to: 34-45

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/llmisvc/diagnostic.py` around lines 25 - 27, The function
print_all_events_table currently ignores the max_events parameter and logs all
events; update the logic in print_all_events_table to enforce max_events by
limiting the events iterated/logged (e.g., slice the events list or stop after
emitting max_events entries) before calling the logging callable (log /
_log.info), and ensure any summary or count reflects the applied limit so
diagnostics remain bounded and performant.

Comment on lines +192 to +194
log("# -- logs (%s) --", label)
log("%s", logs or "(empty)")
except Exception as e:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Redact sensitive tokens before emitting raw container logs (CWE-532 / CWE-200).

Line 193 writes container logs verbatim to CI output; this can expose bearer tokens, credentials, or signed URLs in build artifacts.

Proposed fix
+_REDACTION_PATTERNS = [
+    re.compile(r"(?i)(authorization:\s*bearer\s+)[^\s]+"),
+    re.compile(r"(?i)(token=)[^&\s]+"),
+    re.compile(r"(?i)(password=)[^&\s]+"),
+]
+
+def _redact_sensitive(text: str) -> str:
+    redacted = text
+    for pattern in _REDACTION_PATTERNS:
+        redacted = pattern.sub(r"\1***REDACTED***", redacted)
+    return redacted
+
 def _emit_container_logs(
@@
-            log("%s", logs or "(empty)")
+            log("%s", _redact_sensitive(logs or "(empty)"))

As per coding guidelines, **: REVIEW PRIORITIES: 1. Security vulnerabilities.

🧰 Tools
🪛 Ruff (0.15.12)

[warning] 194-194: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/llmisvc/diagnostic.py` around lines 192 - 194, The current exception
handler logs container output verbatim via log("%s", logs or "(empty)"), which
can leak sensitive tokens; change this to redact secrets before emitting. Add or
call a helper (e.g., redact_sensitive_tokens(logs)) that applies regex
replacements to mask Authorization/Bearer headers, "sig|se|sasl|token" query
params in signed URLs, AWS/Google keys and long base64-like strings, then call
log("%s", redacted_logs or "(empty)") inside the except block (keep the same log
call and variable names). Ensure the helper is used wherever raw container logs
are printed in this module so all usages of log(..., logs) are protected.

Comment on lines +531 to +533
except Exception as e:
print(f"⚠️ Warning: Failed to cleanup service {service_name}: {e}")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not silently swallow cleanup failures in maybe_delete_llmisvc.

Line 531 catches all exceptions and only prints a warning, which can hide leaked resources and cascade into later test failures.

Proposed fix
-    except Exception as e:
-        print(f"⚠️ Warning: Failed to cleanup service {service_name}: {e}")
+    except client.rest.ApiException as e:
+        logger.exception(
+            "Failed to cleanup service %s (status=%s): %s",
+            service_name,
+            e.status,
+            e,
+        )
+        if not test_failed and e.status != 404:
+            raise
+    except Exception as e:
+        logger.exception("Unexpected cleanup failure for service %s: %s", service_name, e)
+        if not test_failed:
+            raise

As per coding guidelines, **: REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps.

🧰 Tools
🪛 Ruff (0.15.12)

[warning] 531-531: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/llmisvc/test_llm_inference_service.py` around lines 531 - 533, The
except block in maybe_delete_llmisvc currently swallows all exceptions and only
prints a warning; update it to surface failures instead of hiding them by
logging the error (including service_name and exception details) and then
re-raising the exception (or calling pytest.fail) so cleanup failures fail the
test and do not leak resources; target the except Exception as e block that
references service_name and ensure the error is not silently ignored.

jlost and others added 2 commits May 12, 2026 12:54
- Changed the default value of KSERVE_NAMESPACE from 'opendatahub' to 'kserve' to align with the updated deployment configuration.

Signed-off-by: James Ostrander <jostrand@redhat.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: James Ostrander <jostrand@redhat.com>
…ce downloads

Pre-download facebook/opt-125m once during cluster setup via a storage-initializer
init container in the s3-init job, then sync to s3://example-models/facebook/opt-125m/.

Add seaweedfs-s3-creds secret with S3 endpoint annotations and link it to the default
service account in kserve-ci-e2e-test, so the LLMISVC storage-initializer can
authenticate to SeaweedFS for s3:// model URIs.

Change all hf://facebook/opt-125m references in llmisvc e2e tests (13 total across
fixtures.py and 3 test files) to s3://example-models/facebook/opt-125m.

Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: James Ostrander <jostrand@redhat.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@jlost jlost force-pushed the debug/collect-storage-init-logs branch from 7eb535a to e49ac26 Compare May 12, 2026 18:54

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
test/scripts/openshift-ci/tls/setup-s3-tls.sh (2)

76-84: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

CWE-78: Unquoted variables enable command injection.

Variables $DEPLOYMENT_NAME and $SECRET_NAME are unquoted in command contexts. If these contain spaces or shell metacharacters, command parsing fails or enables injection attacks.

🛡️ Proposed fix: Quote all variable expansions
-if oc get deployment $DEPLOYMENT_NAME -n kserve > /dev/null 2>&1; then
+if oc get deployment "$DEPLOYMENT_NAME" -n kserve > /dev/null 2>&1; then
     echo "Deleting existing $DEPLOYMENT_NAME deployment"
-    oc delete deployment $DEPLOYMENT_NAME -n kserve --ignore-not-found || true
+    oc delete deployment "$DEPLOYMENT_NAME" -n kserve --ignore-not-found || true
 fi
 # Delete TLS secret if it exists (only for custom cert - serving cert is managed by OpenShift)
 if [[ "$CERT_TYPE" == "custom" ]]; then
-    if oc get secret $SECRET_NAME -n kserve > /dev/null 2>&1; then
+    if oc get secret "$SECRET_NAME" -n kserve > /dev/null 2>&1; then
         echo "Deleting existing $SECRET_NAME secret"
-        oc delete secret $SECRET_NAME -n kserve --ignore-not-found || true
+        oc delete secret "$SECRET_NAME" -n kserve --ignore-not-found || true
     fi
 fi

Apply similar quoting to lines 88-90, 94, 99, 108-109, 115-117, 121, 127, 135.

As per coding guidelines: SHELL SCRIPT SECURITY (CWE-78) - Quote all variables to prevent injection.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/scripts/openshift-ci/tls/setup-s3-tls.sh` around lines 76 - 84, The
shell commands using unquoted variables DEPLOYMENT_NAME and SECRET_NAME (e.g.,
in the oc get deployment/oc delete deployment and oc get secret/oc delete secret
calls) allow command injection and break on spaces; update each invocation to
quote the variable expansions (e.g., use "$DEPLOYMENT_NAME" and "$SECRET_NAME")
and apply the same quoting pattern to all similar occurrences in this script
(including the other oc get/delete lines mentioned in the review) so every
command argument that uses these variables is wrapped in double quotes.

58-58: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

CWE-494: Download and execute code without integrity verification.

This downloads and pipes an installer script directly to bash without verifying its signature or checksum. If the kubernetes-sigs repository is compromised or a MITM attack occurs, arbitrary code executes with the script's privileges.

🔒 Proposed fix: Verify checksum or use pinned version
-  curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash -s -- 5.0.1 $HOME/.local/bin
+  # Download installer, verify checksum, then execute
+  curl -sSL -o /tmp/install_kustomize.sh "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh"
+  echo "EXPECTED_CHECKSUM_HERE /tmp/install_kustomize.sh" | sha256sum -c -
+  bash /tmp/install_kustomize.sh 5.0.1 "$HOME/.local/bin"
+  rm /tmp/install_kustomize.sh

Alternatively, download kustomize binary directly from GitHub releases with checksum verification.

As per coding guidelines: SHELL SCRIPT SECURITY (CWE-78) - Avoid eval and dynamic command execution.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/scripts/openshift-ci/tls/setup-s3-tls.sh` at line 58, The script
currently pipes the remote installer into bash via the command "curl -s
\"https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh\"
| bash -s -- 5.0.1 $HOME/.local/bin", which is CWE-494; instead, download a
pinned kustomize release and its published checksum (or the installer script and
its checksum) using curl -LO, verify the checksum (sha256/sha512) locally with
sha256sum/sha512sum (or gpg signature if available), and only then run the
installer locally (or extract the verified binary to $HOME/.local/bin); ensure
the script records the pinned version (5.0.1) and fails fast if checksum
verification fails.
test/scripts/openshift-ci/setup-e2e-tests.sh (1)

39-55: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Validate $1 before the first nounset access.

set -o nounset is active, and $1 is dereferenced at lines 50, 97, 145, and 152. Invoking this script without an argument fails with a generic $1: unbound variable from whatever the first reachable branch happens to be, rather than a clear usage error. Worse, since line 50 is inside the RUNNING_LOCAL=true block, missing-arg behavior differs by environment, which complicates CI triage.

Validate the deploy-mode argument once at the top:

🛡️ Proposed guard
 source "$SCRIPT_DIR/common.sh"
 source "$SCRIPT_DIR/version.sh"

+if [[ $# -lt 1 || -z "${1:-}" ]]; then
+  echo "Usage: $(basename "$0") <deploy-mode>" >&2
+  exit 2
+fi
+DEPLOY_MODE="$1"
+
 # Always print cluster / operator snapshot on exit (success or failure) for CI triage.
 trap print_e2e_environment_summary EXIT

Then reference "$DEPLOY_MODE" instead of "$1" at lines 50, 97, 145, 152 for readability.

Also applies to: 145-152

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/scripts/openshift-ci/setup-e2e-tests.sh` around lines 39 - 55, Add an
explicit guard that validates the positional deploy-mode argument before any
`set -o nounset`-sensitive access: read `$1` into a new variable (e.g.,
DEPLOY_MODE) at the top of the script, check it’s non-empty and matches expected
values (or print a clear usage error and exit), and then replace every direct
`$1` reference in the script with `"$DEPLOY_MODE"` (notably the checks inside
the RUNNING_LOCAL block and the conditions using
BUILD_GRAPH_IMAGES/BUILD_KSERVE_IMAGES). This ensures the script never
dereferences an unbound `$1` and provides a clear error message on
missing/invalid deploy-mode while preserving existing logic around
RUNNING_LOCAL, BUILD_KSERVE_IMAGES, and BUILD_GRAPH_IMAGES.
🧹 Nitpick comments (1)
test/scripts/openshift-ci/setup-e2e-tests.sh (1)

145-145: 💤 Low value

Quoted RHS of =~ performs literal match, not regex.

In Bash 3.2+, quoting the right-hand side of =~ forces a literal substring comparison rather than ERE matching. The expression here happens to work because kserve_on_openshift has no regex metacharacters, but the form is misleading. If a substring/equality check is intended, prefer an explicit comparison:

-if [[ "$1" =~ "kserve_on_openshift" ]]; then
+if [[ "$1" == "kserve_on_openshift" ]]; then

Use == if $1 is expected to equal that literal, or == *kserve_on_openshift* if a substring match is genuinely wanted.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/scripts/openshift-ci/setup-e2e-tests.sh` at line 145, The conditional
currently uses the regex-match operator with a quoted right-hand side (the test
against the first positional arg for kserve_on_openshift), which forces a
literal match; change that test to an explicit string comparison: use equality
(when you expect $1 to exactly equal kserve_on_openshift) or use a glob-style
equality (when you want to detect kserve_on_openshift as a substring) instead of
the quoted =~ form so the intent is clear and portable.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@config/overlays/test/s3-local-backend/seaweedfs-init-job.yaml`:
- Around line 38-39: Replace the insecure --no-verify-ssl usage in the aws s3
cp/sync commands by either switching the S3 endpoint to plain HTTP for non-TLS
local S3 (ensure AWS_ENDPOINT_URL_S3 is set to http://... instead of https://)
or by enabling proper TLS verification: remove the --no-verify-ssl flags and add
an AWS_CA_BUNDLE environment variable (e.g., AWS_CA_BUNDLE pointing to the CA
bundle path) in the job spec so the aws CLI can verify certificates; update the
aws s3 cp and aws s3 sync invocations accordingly and confirm the env var name
AWS_ENDPOINT_URL_S3 is configured consistently.

In `@config/overlays/test/s3-local-backend/seaweedfs-s3-creds-secret.yaml`:
- Around line 10-12: The Secret file seaweedfs-s3-creds-secret.yaml currently
hardcodes AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY; remove those static
values and instead generate credentials at CI/test setup (e.g., in
setup-ci-namespace.sh) and create the k8s Secret dynamically (or use
SealedSecrets/ExternalSecrets) so secrets never live in repo history; ensure the
setup script outputs/exports the generated values and applies them to a Secret
with the same name so SeaweedFS reads AWS_ACCESS_KEY_ID and
AWS_SECRET_ACCESS_KEY from the runtime Secret.

In `@test/scripts/openshift-ci/setup-e2e-tests.sh`:
- Around line 82-91: The script contains unquoted variable expansions (notably
PROJECT_ROOT, SCRIPT_DIR and KSERVE_NAMESPACE) causing potential
word-splitting/globbing; update all occurrences to use quoted expansions (e.g.
"$PROJECT_ROOT", "$SCRIPT_DIR", "${KSERVE_NAMESPACE}") including in commands
like make -C "$PROJECT_ROOT", pushd "$PROJECT_ROOT", export
PATH="${PROJECT_ROOT}/bin:${PATH}", and any other uses in the same file (around
the referenced ranges 82–91, 114–139, 152). Ensure every instance of these
variables is wrapped in double quotes so pushd/popd, make, export and other
shell commands receive a single, safe argument.

---

Outside diff comments:
In `@test/scripts/openshift-ci/setup-e2e-tests.sh`:
- Around line 39-55: Add an explicit guard that validates the positional
deploy-mode argument before any `set -o nounset`-sensitive access: read `$1`
into a new variable (e.g., DEPLOY_MODE) at the top of the script, check it’s
non-empty and matches expected values (or print a clear usage error and exit),
and then replace every direct `$1` reference in the script with `"$DEPLOY_MODE"`
(notably the checks inside the RUNNING_LOCAL block and the conditions using
BUILD_GRAPH_IMAGES/BUILD_KSERVE_IMAGES). This ensures the script never
dereferences an unbound `$1` and provides a clear error message on
missing/invalid deploy-mode while preserving existing logic around
RUNNING_LOCAL, BUILD_KSERVE_IMAGES, and BUILD_GRAPH_IMAGES.

In `@test/scripts/openshift-ci/tls/setup-s3-tls.sh`:
- Around line 76-84: The shell commands using unquoted variables DEPLOYMENT_NAME
and SECRET_NAME (e.g., in the oc get deployment/oc delete deployment and oc get
secret/oc delete secret calls) allow command injection and break on spaces;
update each invocation to quote the variable expansions (e.g., use
"$DEPLOYMENT_NAME" and "$SECRET_NAME") and apply the same quoting pattern to all
similar occurrences in this script (including the other oc get/delete lines
mentioned in the review) so every command argument that uses these variables is
wrapped in double quotes.
- Line 58: The script currently pipes the remote installer into bash via the
command "curl -s
\"https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh\"
| bash -s -- 5.0.1 $HOME/.local/bin", which is CWE-494; instead, download a
pinned kustomize release and its published checksum (or the installer script and
its checksum) using curl -LO, verify the checksum (sha256/sha512) locally with
sha256sum/sha512sum (or gpg signature if available), and only then run the
installer locally (or extract the verified binary to $HOME/.local/bin); ensure
the script records the pinned version (5.0.1) and fails fast if checksum
verification fails.

---

Nitpick comments:
In `@test/scripts/openshift-ci/setup-e2e-tests.sh`:
- Line 145: The conditional currently uses the regex-match operator with a
quoted right-hand side (the test against the first positional arg for
kserve_on_openshift), which forces a literal match; change that test to an
explicit string comparison: use equality (when you expect $1 to exactly equal
kserve_on_openshift) or use a glob-style equality (when you want to detect
kserve_on_openshift as a substring) instead of the quoted =~ form so the intent
is clear and portable.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 1f0225bd-38f7-4b4c-86d3-c330bbc49d67

📥 Commits

Reviewing files that changed from the base of the PR and between 63f92cc and e49ac26.

📒 Files selected for processing (10)
  • config/overlays/test/s3-local-backend/seaweedfs-init-job.yaml
  • config/overlays/test/s3-local-backend/seaweedfs-s3-creds-secret.yaml
  • test/e2e/llmisvc/fixtures.py
  • test/e2e/llmisvc/test_llm_inference_service_conversion.py
  • test/e2e/llmisvc/test_pod_watch.py
  • test/e2e/llmisvc/test_storage_version_migration.py
  • test/scripts/openshift-ci/run-e2e-tests.sh
  • test/scripts/openshift-ci/setup-ci-namespace.sh
  • test/scripts/openshift-ci/setup-e2e-tests.sh
  • test/scripts/openshift-ci/tls/setup-s3-tls.sh
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/e2e/llmisvc/fixtures.py

Comment on lines +38 to +39
aws s3 cp /tmp/sklearn-model.joblib s3://example-models/sklearn/model.joblib --no-verify-ssl
aws s3 sync /shared/facebook/opt-125m s3://example-models/facebook/opt-125m/ --no-verify-ssl

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

CWE-295: SSL certificate verification disabled.

The --no-verify-ssl flag disables certificate validation for S3 connections, enabling man-in-the-middle attacks. An attacker on the network can intercept traffic, inject malicious model files, or steal credentials.

Even in test environments, this is risky:

  1. Malicious model artifacts can exploit deserialization vulnerabilities
  2. Test networks may not be fully isolated
  3. Code patterns propagate to production
🔐 Proposed fix: Use proper TLS with CA bundle

For non-TLS local S3 (SeaweedFS), use HTTP explicitly instead of disabling SSL verification:

-          aws s3 cp /tmp/sklearn-model.joblib s3://example-models/sklearn/model.joblib --no-verify-ssl
-          aws s3 sync /shared/facebook/opt-125m s3://example-models/facebook/opt-125m/ --no-verify-ssl
+          aws s3 cp /tmp/sklearn-model.joblib s3://example-models/sklearn/model.joblib
+          aws s3 sync /shared/facebook/opt-125m s3://example-models/facebook/opt-125m/

And ensure AWS_ENDPOINT_URL_S3 (line 54) uses http:// not https:// when TLS is not configured. For TLS-enabled S3, provide the CA certificate:

env:
- name: AWS_CA_BUNDLE
  value: /etc/ssl/certs/ca-bundle.crt

As per coding guidelines: KUBERNETES CONFIG SECURITY - SecurityContext configured with appropriate restrictions.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
aws s3 cp /tmp/sklearn-model.joblib s3://example-models/sklearn/model.joblib --no-verify-ssl
aws s3 sync /shared/facebook/opt-125m s3://example-models/facebook/opt-125m/ --no-verify-ssl
aws s3 cp /tmp/sklearn-model.joblib s3://example-models/sklearn/model.joblib
aws s3 sync /shared/facebook/opt-125m s3://example-models/facebook/opt-125m/
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@config/overlays/test/s3-local-backend/seaweedfs-init-job.yaml` around lines
38 - 39, Replace the insecure --no-verify-ssl usage in the aws s3 cp/sync
commands by either switching the S3 endpoint to plain HTTP for non-TLS local S3
(ensure AWS_ENDPOINT_URL_S3 is set to http://... instead of https://) or by
enabling proper TLS verification: remove the --no-verify-ssl flags and add an
AWS_CA_BUNDLE environment variable (e.g., AWS_CA_BUNDLE pointing to the CA
bundle path) in the job spec so the aws CLI can verify certificates; update the
aws s3 cp and aws s3 sync invocations accordingly and confirm the env var name
AWS_ENDPOINT_URL_S3 is configured consistently.

Comment on lines +10 to +12
stringData:
AWS_ACCESS_KEY_ID: s3admin
AWS_SECRET_ACCESS_KEY: s3admin123

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

CWE-798: Hardcoded credentials in source control.

Embedding credentials directly in YAML manifests creates multiple security risks:

  1. Credentials persist in git history even if later removed
  2. Anyone with repository access obtains credentials
  3. Establishes an anti-pattern that may be copied to production
  4. Weak credentials (s3admin/s3admin123) are easily guessable

Even in test environments, hardcoded credentials enable unauthorized access if the test S3 endpoint is network-reachable or credentials are reused.

🔐 Proposed fix: Generate credentials during setup

Replace the static Secret with dynamic credential generation in setup-ci-namespace.sh:

-# Apply SeaweedFS S3 credentials secret and link to default SA (used by LLMISVC s3:// model URIs)
-echo "Applying SeaweedFS S3 credentials secret"
-: "${KSERVE_NAMESPACE:=kserve}"
-sed "s/s3-service.kserve/s3-service.${KSERVE_NAMESPACE}/" \
-  "$PROJECT_ROOT/config/overlays/test/s3-local-backend/seaweedfs-s3-creds-secret.yaml" | \
-  oc apply -f - -n "$NAMESPACE"
+# Generate random S3 credentials for this test run
+echo "Generating SeaweedFS S3 credentials secret"
+: "${KSERVE_NAMESPACE:=kserve}"
+S3_ACCESS_KEY="s3-$(openssl rand -hex 16)"
+S3_SECRET_KEY="$(openssl rand -base64 32)"
+cat <<EOF | oc apply -f - -n "$NAMESPACE"
+apiVersion: v1
+kind: Secret
+metadata:
+  name: seaweedfs-s3-creds
+  annotations:
+    serving.kserve.io/s3-endpoint: "s3-service.${KSERVE_NAMESPACE}:8333"
+    serving.kserve.io/s3-usehttps: "0"
+    serving.kserve.io/s3-verifyssl: "0"
+type: Opaque
+stringData:
+  AWS_ACCESS_KEY_ID: "${S3_ACCESS_KEY}"
+  AWS_SECRET_ACCESS_KEY: "${S3_SECRET_KEY}"
+EOF

Then configure SeaweedFS to accept these credentials or use ephemeral token-based auth.

As per coding guidelines: KUBERNETES CONFIG SECURITY - No secrets in plain text (use SealedSecrets or external-secrets).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@config/overlays/test/s3-local-backend/seaweedfs-s3-creds-secret.yaml` around
lines 10 - 12, The Secret file seaweedfs-s3-creds-secret.yaml currently
hardcodes AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY; remove those static
values and instead generate credentials at CI/test setup (e.g., in
setup-ci-namespace.sh) and create the k8s Secret dynamically (or use
SealedSecrets/ExternalSecrets) so secrets never live in repo history; ensure the
setup script outputs/exports the generated values and applies them to a Secret
with the same name so SeaweedFS reads AWS_ACCESS_KEY_ID and
AWS_SECRET_ACCESS_KEY from the runtime Secret.

Comment on lines 82 to 91
$PROJECT_ROOT/hack/setup/cli/install-kustomize.sh
make -C "$PROJECT_ROOT" yq
export PATH="${PROJECT_ROOT}/bin:${PATH}"

echo "Installing KServe Python SDK ..."
pushd $PROJECT_ROOT >/dev/null
./test/scripts/gh-actions/setup-uv.sh
# Add bin directory to PATH so uv command is available
export PATH="${PROJECT_ROOT}/bin:${PATH}"
popd
pushd $PROJECT_ROOT/python/kserve >/dev/null

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Quote all variable expansions (CWE-78).

Multiple unquoted expansions of $PROJECT_ROOT, $SCRIPT_DIR, and ${KSERVE_NAMESPACE} remain. Today the values are safe k8s namespace strings and CI-controlled paths, but the project-wide rule is to quote unconditionally so a future repo-clone path containing whitespace, or an externally provided KSERVE_NAMESPACE, does not cause word-splitting/globbing surprises. Shellcheck SC2086 already flags lines 82, 87, 91.

🔒 Suggested quoting (representative diff)
-$PROJECT_ROOT/hack/setup/cli/install-kustomize.sh
+"$PROJECT_ROOT/hack/setup/cli/install-kustomize.sh"
 make -C "$PROJECT_ROOT" yq
 export PATH="${PROJECT_ROOT}/bin:${PATH}"

 echo "Installing KServe Python SDK ..."
-pushd $PROJECT_ROOT >/dev/null
+pushd "$PROJECT_ROOT" >/dev/null
   ./test/scripts/gh-actions/setup-uv.sh
   export PATH="${PROJECT_ROOT}/bin:${PATH}"
 popd
-pushd $PROJECT_ROOT/python/kserve >/dev/null
+pushd "$PROJECT_ROOT/python/kserve" >/dev/null
@@
-    oc apply -n ${KSERVE_NAMESPACE} -f -
+    oc apply -n "${KSERVE_NAMESPACE}" -f -
@@
-oc rollout status deployment/seaweedfs -n ${KSERVE_NAMESPACE} --timeout=300s
+oc rollout status deployment/seaweedfs -n "${KSERVE_NAMESPACE}" --timeout=300s
@@
-if oc wait --for=condition=complete job/s3-init -n ${KSERVE_NAMESPACE} --timeout=60s 2>/dev/null; then
+if oc wait --for=condition=complete job/s3-init -n "${KSERVE_NAMESPACE}" --timeout=60s 2>/dev/null; then
@@
-  sed "s/s3-service.kserve/s3-service.${KSERVE_NAMESPACE}/" \
+  sed "s/s3-service.kserve/s3-service.${KSERVE_NAMESPACE}/" \
     "$PROJECT_ROOT/config/overlays/test/s3-local-backend/seaweedfs-init-job.yaml" | \
     sed "s|kserve/storage-initializer:latest|${STORAGE_INITIALIZER_IMAGE}|" | \
-    oc replace --force -n ${KSERVE_NAMESPACE} -f -
+    oc replace --force -n "${KSERVE_NAMESPACE}" -f -

   echo "Waiting for S3 init job to complete..."
-  if ! oc wait --for=condition=complete job/s3-init -n ${KSERVE_NAMESPACE} --timeout=300s; then
+  if ! oc wait --for=condition=complete job/s3-init -n "${KSERVE_NAMESPACE}" --timeout=300s; then
     echo "S3 init job failed. Pod status and logs:"
-    oc get pods -l job-name=s3-init -n ${KSERVE_NAMESPACE}
-    oc logs -l job-name=s3-init -n ${KSERVE_NAMESPACE} --tail=50 || true
+    oc get pods -l job-name=s3-init -n "${KSERVE_NAMESPACE}"
+    oc logs -l job-name=s3-init -n "${KSERVE_NAMESPACE}" --tail=50 || true
     exit 1
   fi
 fi
@@
-$SCRIPT_DIR/setup-ci-namespace.sh "$1"
+"$SCRIPT_DIR/setup-ci-namespace.sh" "$1"

As per coding guidelines: "SHELL SCRIPT SECURITY (CWE-78): … Quote all variables to prevent injection".

Also applies to: 114-139, 152-152

🧰 Tools
🪛 Shellcheck (0.11.0)

[info] 82-82: Double quote to prevent globbing and word splitting.

(SC2086)


[info] 87-87: Double quote to prevent globbing and word splitting.

(SC2086)


[info] 91-91: Double quote to prevent globbing and word splitting.

(SC2086)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/scripts/openshift-ci/setup-e2e-tests.sh` around lines 82 - 91, The
script contains unquoted variable expansions (notably PROJECT_ROOT, SCRIPT_DIR
and KSERVE_NAMESPACE) causing potential word-splitting/globbing; update all
occurrences to use quoted expansions (e.g. "$PROJECT_ROOT", "$SCRIPT_DIR",
"${KSERVE_NAMESPACE}") including in commands like make -C "$PROJECT_ROOT", pushd
"$PROJECT_ROOT", export PATH="${PROJECT_ROOT}/bin:${PATH}", and any other uses
in the same file (around the referenced ranges 82–91, 114–139, 152). Ensure
every instance of these variables is wrapped in double quotes so pushd/popd,
make, export and other shell commands receive a single, safe argument.

@jlost jlost closed this May 13, 2026
@github-project-automation github-project-automation Bot moved this from New/Backlog to Done in ODH Model Serving Planning May 13, 2026
@jlost jlost reopened this May 14, 2026
@github-project-automation github-project-automation Bot moved this from Done to New/Backlog in ODH Model Serving Planning May 14, 2026
jlost and others added 2 commits May 14, 2026 15:43
Move ODH-specific SeaweedFS resources (init job with opt-125m download,
s3-creds secret) into test/overlays/openshift-ci/ and parametrize the
model URI via OPT_125M_MODEL_URI env var (defaults to hf://) so upstream
xKS tests continue working without SeaweedFS model caching.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: James Ostrander <jostrand@redhat.com>
Signed-off-by: James Ostrander <jostrand@redhat.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
test/scripts/openshift-ci/setup-e2e-tests.sh (1)

86-86: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Quote all variable expansions to prevent word-splitting/globbing (CWE-78).

Multiple unquoted expansions of $PROJECT_ROOT, ${KSERVE_NAMESPACE}, and ${STORAGE_INITIALIZER_IMAGE} remain. The coding guidelines require unconditional quoting to prevent injection if a future path contains whitespace or an external value is provided. Shellcheck SC2086 flags lines 86, 91, 95.

Quote all variable expansions throughout the script.

🔒 Suggested quoting
 # Install kustomize and yq (also needed for SeaweedFS kustomize build below)
-$PROJECT_ROOT/hack/setup/cli/install-kustomize.sh
+"$PROJECT_ROOT/hack/setup/cli/install-kustomize.sh"
 make -C "$PROJECT_ROOT" yq
 export PATH="${PROJECT_ROOT}/bin:${PATH}"
 
 echo "Installing KServe Python SDK ..."
-pushd $PROJECT_ROOT >/dev/null
+pushd "$PROJECT_ROOT" >/dev/null
   ./test/scripts/gh-actions/setup-uv.sh
   export PATH="${PROJECT_ROOT}/bin:${PATH}"
 popd
-pushd $PROJECT_ROOT/python/kserve >/dev/null
+pushd "$PROJECT_ROOT/python/kserve" >/dev/null
   uv sync --active --group test
   uv pip install timeout-sampler
 popd
 
@@
 if [[ "$SEAWEEDFS_BUNDLED" == "false" ]]; then
   echo "Deploying SeaweedFS S3 backend for tests..."
   kustomize build "$PROJECT_ROOT/config/overlays/test/s3-local-backend" | \
     sed "s/namespace: kserve/namespace: ${KSERVE_NAMESPACE}/" | \
-    oc apply -n ${KSERVE_NAMESPACE} -f -
+    oc apply -n "${KSERVE_NAMESPACE}" -f -
 fi
 
 echo "Waiting for SeaweedFS deployment to be ready..."
-oc rollout status deployment/seaweedfs -n ${KSERVE_NAMESPACE} --timeout=300s
+oc rollout status deployment/seaweedfs -n "${KSERVE_NAMESPACE}" --timeout=300s
 
 # The s3-init job is already created by the kustomize build above.
 # It may have failed if SeaweedFS wasn't ready yet, so check and re-create if needed.
-if oc wait --for=condition=complete job/s3-init -n ${KSERVE_NAMESPACE} --timeout=60s 2>/dev/null; then
+if oc wait --for=condition=complete job/s3-init -n "${KSERVE_NAMESPACE}" --timeout=60s 2>/dev/null; then
   echo "S3 init job already completed successfully"
 else
   echo "S3 init job not completed, re-creating..."
   sed "s/s3-service.kserve/s3-service.${KSERVE_NAMESPACE}/" \
     "$PROJECT_ROOT/test/overlays/openshift-ci/seaweedfs-init-job-odh.yaml" | \
     sed "s|kserve/storage-initializer:latest|${STORAGE_INITIALIZER_IMAGE}|" | \
-    oc replace --force -n ${KSERVE_NAMESPACE} -f -
+    oc replace --force -n "${KSERVE_NAMESPACE}" -f -
 
   echo "Waiting for S3 init job to complete..."
-  if ! oc wait --for=condition=complete job/s3-init -n ${KSERVE_NAMESPACE} --timeout=300s; then
+  if ! oc wait --for=condition=complete job/s3-init -n "${KSERVE_NAMESPACE}" --timeout=300s; then
     echo "S3 init job failed. Pod status and logs:"
-    oc get pods -l job-name=s3-init -n ${KSERVE_NAMESPACE}
-    oc logs -l job-name=s3-init -n ${KSERVE_NAMESPACE} --tail=50 || true
+    oc get pods -l job-name=s3-init -n "${KSERVE_NAMESPACE}"
+    oc logs -l job-name=s3-init -n "${KSERVE_NAMESPACE}" --tail=50 || true
     exit 1
   fi
 fi
@@
 echo "Prepare CI namespace and install ServingRuntimes"
-$SCRIPT_DIR/setup-ci-namespace.sh "$1"
+"$SCRIPT_DIR/setup-ci-namespace.sh" "$1"

As per coding guidelines: "SHELL SCRIPT SECURITY (CWE-78): … Quote all variables to prevent injection".

Also applies to: 91-91, 95-95, 122-122, 126-126, 130-130, 137-137, 140-140, 142-143, 156-156

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/scripts/openshift-ci/setup-e2e-tests.sh` at line 86, Quote every
variable expansion in the script to prevent word-splitting/globbing: replace
unquoted $PROJECT_ROOT, ${KSERVE_NAMESPACE}, ${STORAGE_INITIALIZER_IMAGE} (and
any other bare expansions) with "$PROJECT_ROOT", "${KSERVE_NAMESPACE}",
"${STORAGE_INITIALIZER_IMAGE}" in all commands, assignments, and command
substitutions (including the call to install-kustomize.sh and kubectl/oc
invocations); ensure concatenations use quoted expansions (e.g., "${VAR}/path")
and update all other occurrences flagged (e.g., other ${...} usages) so no
unquoted variable remains.
🧹 Nitpick comments (1)
test/overlays/openshift-ci/seaweedfs-init-job-odh.yaml (1)

10-10: 💤 Low value

Non-reproducible image tag in CI manifest.

The kserve/storage-initializer:latest tag is mutable and defeats reproducibility. The AI summary notes this image is rewritten via STORAGE_INITIALIZER_IMAGE by setup-e2e-tests.sh, but the manifest itself should document the substitution pattern or use a concrete default.

Consider adding a comment explaining the image substitution, or use a concrete tag as the default.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/overlays/openshift-ci/seaweedfs-init-job-odh.yaml` at line 10, The
manifest currently pins a mutable image tag "kserve/storage-initializer:latest";
update the manifest entry that sets image: kserve/storage-initializer:latest to
either a concrete, reproducible tag (e.g., kserve/storage-initializer:<version>)
or add an inline comment next to that image line documenting that the image is
overwritten by the setup-e2e-tests.sh script using the STORAGE_INITIALIZER_IMAGE
environment variable, so reviewers and CI users understand the substitution
pattern; ensure the chosen fix references the same image field and the
setup-e2e-tests.sh / STORAGE_INITIALIZER_IMAGE mechanism.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/controller/v1alpha2/llmisvc/workload_storage.go`:
- Around line 258-260: The HF_HUB_VERBOSITY env var is being added
unconditionally which triggers pod restarts during operator upgrades; update the
code so the utils.AddEnvVars(initContainer, []corev1.EnvVar{{Name:
"HF_HUB_VERBOSITY", Value: "debug"}}) call is moved inside the same conditional
block that handles skipping HF default env vars during upgrades (the block that
guards adding HF defaults), so HF_HUB_VERBOSITY is only added when defaults are
applied and not during upgrade-only reconciliations; locate the conditional
around the HF defaults and place the HF_HUB_VERBOSITY AddEnvVars call there,
preserving the initContainer and env var name/value.

In `@test/overlays/openshift-ci/seaweedfs-init-job-odh.yaml`:
- Around line 8-27: The init container "download-hf-model" is missing a
securityContext allowing it to run as root; add a securityContext under the
initContainer definition for download-hf-model to enforce non-root and minimal
privileges (e.g., set runAsNonRoot: true and runAsUser to a non-zero UID like
1000), disallow privilege escalation (allowPrivilegeEscalation: false), drop all
Linux capabilities (capabilities: drop: ["ALL"]), and enforce a read-only root
filesystem if possible (readOnlyRootFilesystem: true) so the init container
cannot escalate privileges.
- Around line 29-57: The s3-init main container is missing a securityContext
allowing it to run as non-root; update the container named "s3-init" to include
a securityContext that enforces non-root and least privilege (e.g.,
runAsNonRoot: true and runAsUser: 1000), disallows privilege escalation
(allowPrivilegeEscalation: false), drops capabilities (capabilities.drop:
["ALL"]), and enables readOnlyRootFilesystem: true (and set runAsGroup if
needed); apply these settings to the container spec for s3-init to prevent
running as root and limit privileges.

---

Duplicate comments:
In `@test/scripts/openshift-ci/setup-e2e-tests.sh`:
- Line 86: Quote every variable expansion in the script to prevent
word-splitting/globbing: replace unquoted $PROJECT_ROOT, ${KSERVE_NAMESPACE},
${STORAGE_INITIALIZER_IMAGE} (and any other bare expansions) with
"$PROJECT_ROOT", "${KSERVE_NAMESPACE}", "${STORAGE_INITIALIZER_IMAGE}" in all
commands, assignments, and command substitutions (including the call to
install-kustomize.sh and kubectl/oc invocations); ensure concatenations use
quoted expansions (e.g., "${VAR}/path") and update all other occurrences flagged
(e.g., other ${...} usages) so no unquoted variable remains.

---

Nitpick comments:
In `@test/overlays/openshift-ci/seaweedfs-init-job-odh.yaml`:
- Line 10: The manifest currently pins a mutable image tag
"kserve/storage-initializer:latest"; update the manifest entry that sets image:
kserve/storage-initializer:latest to either a concrete, reproducible tag (e.g.,
kserve/storage-initializer:<version>) or add an inline comment next to that
image line documenting that the image is overwritten by the setup-e2e-tests.sh
script using the STORAGE_INITIALIZER_IMAGE environment variable, so reviewers
and CI users understand the substitution pattern; ensure the chosen fix
references the same image field and the setup-e2e-tests.sh /
STORAGE_INITIALIZER_IMAGE mechanism.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 418296c6-b8e3-421b-955a-0494893f939c

📥 Commits

Reviewing files that changed from the base of the PR and between e49ac26 and 8442c3d.

📒 Files selected for processing (11)
  • pkg/controller/v1alpha2/llmisvc/workload_storage.go
  • python/storage-initializer/scripts/initializer-entrypoint
  • test/e2e/llmisvc/fixtures.py
  • test/e2e/llmisvc/test_llm_inference_service_conversion.py
  • test/e2e/llmisvc/test_pod_watch.py
  • test/e2e/llmisvc/test_storage_version_migration.py
  • test/overlays/openshift-ci/seaweedfs-init-job-odh.yaml
  • test/overlays/openshift-ci/seaweedfs-s3-creds-secret.yaml
  • test/scripts/openshift-ci/setup-ci-namespace.sh
  • test/scripts/openshift-ci/setup-e2e-tests.sh
  • test/scripts/openshift-ci/tls/setup-s3-tls.sh
✅ Files skipped from review due to trivial changes (1)
  • test/overlays/openshift-ci/seaweedfs-s3-creds-secret.yaml
🚧 Files skipped from review as they are similar to previous changes (4)
  • test/scripts/openshift-ci/tls/setup-s3-tls.sh
  • test/e2e/llmisvc/test_llm_inference_service_conversion.py
  • test/scripts/openshift-ci/setup-ci-namespace.sh
  • test/e2e/llmisvc/fixtures.py

Comment on lines +258 to +260
utils.AddEnvVars(initContainer, []corev1.EnvVar{
{Name: "HF_HUB_VERBOSITY", Value: "debug"},
})

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Inconsistent upgrade behavior: HF_HUB_VERBOSITY forces pod restart during operator upgrades.

Lines 252-256 conditionally skip adding HF default env vars when upgrading from an older operator (to avoid unnecessary pod restarts), but lines 258-260 add HF_HUB_VERBOSITY unconditionally. This forces a pod restart during upgrades even when the model hasn't changed.

Move the HF_HUB_VERBOSITY assignment inside the same conditional block (after line 255) to maintain consistency with the upgrade-avoidance pattern.

🔧 Proposed fix
 		// Add HF default env vars when the init container is new or already has HF_ env vars
 		// from a previous reconciliation. Skip only when upgrading from an older operator version
 		// that didn't set these vars, to avoid unnecessary pod restarts.
 		if currentInitContainer == nil || slices.ContainsFunc(currentInitContainer.Env, func(e corev1.EnvVar) bool {
 			return strings.HasPrefix(e.Name, "HF_")
 		}) {
 			utils.AddDefaultHuggingFaceEnvVars(initContainer)
+			utils.AddEnvVars(initContainer, []corev1.EnvVar{
+				{Name: "HF_HUB_VERBOSITY", Value: "debug"},
+			})
 		}

-		utils.AddEnvVars(initContainer, []corev1.EnvVar{
-			{Name: "HF_HUB_VERBOSITY", Value: "debug"},
-		})
-
 		if containerName == tokenizerContainerName {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controller/v1alpha2/llmisvc/workload_storage.go` around lines 258 - 260,
The HF_HUB_VERBOSITY env var is being added unconditionally which triggers pod
restarts during operator upgrades; update the code so the
utils.AddEnvVars(initContainer, []corev1.EnvVar{{Name: "HF_HUB_VERBOSITY",
Value: "debug"}}) call is moved inside the same conditional block that handles
skipping HF default env vars during upgrades (the block that guards adding HF
defaults), so HF_HUB_VERBOSITY is only added when defaults are applied and not
during upgrade-only reconciliations; locate the conditional around the HF
defaults and place the HF_HUB_VERBOSITY AddEnvVars call there, preserving the
initContainer and env var name/value.

Comment on lines +8 to +27
initContainers:
- name: download-hf-model
image: kserve/storage-initializer:latest
command:
- python3
- -c
- |
from huggingface_hub import snapshot_download
snapshot_download(
"facebook/opt-125m",
local_dir="/shared/facebook/opt-125m",
ignore_patterns=["*.msgpack", "*.h5", "flax_model*", "tf_model*", "rust_model*"],
)
print("Model download complete")
env:
- name: HF_HUB_DISABLE_PROGRESS_BARS
value: "1"
volumeMounts:
- name: model-cache
mountPath: /shared

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Missing security context on init container allows privilege escalation.

The download-hf-model init container lacks a securityContext. This allows it to run as root and escalate privileges (CWE-250). Even in a CI-only test job, containers should run as non-root with minimal privileges.

Add a security context to the init container.

🔒 Proposed security hardening
       - name: download-hf-model
         image: kserve/storage-initializer:latest
+        securityContext:
+          runAsNonRoot: true
+          allowPrivilegeEscalation: false
+          capabilities:
+            drop:
+              - ALL
         command:
         - python3
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/overlays/openshift-ci/seaweedfs-init-job-odh.yaml` around lines 8 - 27,
The init container "download-hf-model" is missing a securityContext allowing it
to run as root; add a securityContext under the initContainer definition for
download-hf-model to enforce non-root and minimal privileges (e.g., set
runAsNonRoot: true and runAsUser to a non-zero UID like 1000), disallow
privilege escalation (allowPrivilegeEscalation: false), drop all Linux
capabilities (capabilities: drop: ["ALL"]), and enforce a read-only root
filesystem if possible (readOnlyRootFilesystem: true) so the init container
cannot escalate privileges.

Comment on lines +29 to +57
- name: s3-init
image: docker.io/amazon/aws-cli:2.33.11@sha256:88ea087837da1b31a3afeeb4538888244a3087a867ddb1909848643fcc4b324e
command:
- /bin/sh
- -c
- |
aws s3 mb s3://logger-output || true
aws s3 mb s3://example-models || true
curl -L https://storage.googleapis.com/kfserving-examples/models/sklearn/1.0/model/model.joblib -o /tmp/sklearn-model.joblib
aws s3 cp /tmp/sklearn-model.joblib s3://example-models/sklearn/model.joblib --no-verify-ssl
aws s3 sync /shared/facebook/opt-125m s3://example-models/facebook/opt-125m/ --no-verify-ssl
env:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: mlpipeline-s3-artifact
key: accesskey
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: mlpipeline-s3-artifact
key: secretkey
- name: AWS_DEFAULT_REGION
value: "us-east-1"
- name: AWS_ENDPOINT_URL_S3
value: "http://s3-service.kserve:8333"
volumeMounts:
- name: model-cache
mountPath: /shared

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Missing security context on main container allows privilege escalation.

The s3-init container lacks a securityContext. This allows it to run as root and escalate privileges (CWE-250). Even in a CI-only test job, containers should run as non-root with minimal privileges.

Add a security context to the main container.

🔒 Proposed security hardening
       - name: s3-init
         image: docker.io/amazon/aws-cli:2.33.11@sha256:88ea087837da1b31a3afeeb4538888244a3087a867ddb1909848643fcc4b324e
+        securityContext:
+          runAsNonRoot: true
+          allowPrivilegeEscalation: false
+          capabilities:
+            drop:
+              - ALL
         command:
         - /bin/sh
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/overlays/openshift-ci/seaweedfs-init-job-odh.yaml` around lines 29 - 57,
The s3-init main container is missing a securityContext allowing it to run as
non-root; update the container named "s3-init" to include a securityContext that
enforces non-root and least privilege (e.g., runAsNonRoot: true and runAsUser:
1000), disallows privilege escalation (allowPrivilegeEscalation: false), drops
capabilities (capabilities.drop: ["ALL"]), and enables readOnlyRootFilesystem:
true (and set runAsGroup if needed); apply these settings to the container spec
for s3-init to prevent running as root and limit privileges.

Set HF_HUB_VERBOSITY=debug so that huggingface_hub emits detailed
download progress, HTTP request info, and transfer diagnostics. This
helps diagnose hf:// download timeouts that currently produce no
actionable logs.

Two injection points cover all deployment scenarios:
- Go controller (llmisvc): injects the env var into the init container
  pod spec, effective even when the storage-initializer image is not
  rebuilt from source (ODH CI).
- Python entrypoint: sets the env var before huggingface_hub is
  imported, effective when the image IS rebuilt (upstream CI).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: James Ostrander <jostrand@redhat.com>
@jlost jlost force-pushed the debug/collect-storage-init-logs branch from 8442c3d to c016d6d Compare May 14, 2026 20:02
Increase XET logging verbosity from info to debug to capture per-term
download details (xorb hashes, byte ranges) that reveal where silent
hangs occur. Add a background thread to the storage-initializer
entrypoint that logs `du -sh` every 10s so we can see how far a
download got before the container is killed.

Also move OPT_125M_MODEL_URI from setup-e2e-tests.sh (subprocess) to
run-e2e-tests.sh (parent), fixing the env var propagation bug.
Commented out for now to keep reproducing the XET hang.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: James Ostrander <jostrand@redhat.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
test/scripts/openshift-ci/setup-e2e-tests.sh (1)

82-91: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Quote the remaining variable expansions.

Lines 82, 87, 91, 118, 122, 126, 133, 136, 138, 139, and 152 still pass unquoted expansions into commands. That leaves word-splitting/globbing on repo paths and namespace/image inputs, and it is exactly what the shell rule in this repo forbids.

Representative fix
-$PROJECT_ROOT/hack/setup/cli/install-kustomize.sh
+"$PROJECT_ROOT/hack/setup/cli/install-kustomize.sh"
@@
-pushd $PROJECT_ROOT >/dev/null
+pushd "$PROJECT_ROOT" >/dev/null
@@
-pushd $PROJECT_ROOT/python/kserve >/dev/null
+pushd "$PROJECT_ROOT/python/kserve" >/dev/null
@@
-    oc apply -n ${KSERVE_NAMESPACE} -f -
+    oc apply -n "${KSERVE_NAMESPACE}" -f -
@@
-oc rollout status deployment/seaweedfs -n ${KSERVE_NAMESPACE} --timeout=300s
+oc rollout status deployment/seaweedfs -n "${KSERVE_NAMESPACE}" --timeout=300s
@@
-if oc wait --for=condition=complete job/s3-init -n ${KSERVE_NAMESPACE} --timeout=60s 2>/dev/null; then
+if oc wait --for=condition=complete job/s3-init -n "${KSERVE_NAMESPACE}" --timeout=60s 2>/dev/null; then
@@
-    oc replace --force -n ${KSERVE_NAMESPACE} -f -
+    oc replace --force -n "${KSERVE_NAMESPACE}" -f -
@@
-  if ! oc wait --for=condition=complete job/s3-init -n ${KSERVE_NAMESPACE} --timeout=300s; then
+  if ! oc wait --for=condition=complete job/s3-init -n "${KSERVE_NAMESPACE}" --timeout=300s; then
@@
-    oc get pods -l job-name=s3-init -n ${KSERVE_NAMESPACE}
-    oc logs -l job-name=s3-init -n ${KSERVE_NAMESPACE} --tail=50 || true
+    oc get pods -l job-name=s3-init -n "${KSERVE_NAMESPACE}"
+    oc logs -l job-name=s3-init -n "${KSERVE_NAMESPACE}" --tail=50 || true
@@
-$SCRIPT_DIR/setup-ci-namespace.sh "$1"
+"$SCRIPT_DIR/setup-ci-namespace.sh" "$1"
#!/bin/bash
set -euo pipefail
shellcheck -f gcc test/scripts/openshift-ci/setup-e2e-tests.sh | rg 'SC2086'

As per coding guidelines, "SHELL SCRIPT SECURITY (CWE-78): ... Quote all variables to prevent injection".

Also applies to: 114-152

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/scripts/openshift-ci/setup-e2e-tests.sh` around lines 82 - 91, The
script passes several unquoted variable expansions which can cause
word-splitting and globbing; update all occurrences of $PROJECT_ROOT and any
other variable expansions used in commands (e.g., in the calls to the
install-kustomize script, make -C "$PROJECT_ROOT", export PATH assignments,
pushd/popd invocations, and ./test/scripts/gh-actions/setup-uv.sh and pushd
$PROJECT_ROOT/python/kserve) to be quoted (e.g., "$PROJECT_ROOT",
"${PROJECT_ROOT}/bin", and similar) throughout the file (including the remaining
occurrences around lines 114–152) so every command argument and PATH export uses
quoted expansions and prevents shell word-splitting/globbing.
pkg/controller/v1alpha2/llmisvc/workload_storage.go (1)

258-263: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep the new debug env vars inside the existing upgrade guard.

These additions sit outside the conditional on Lines 249-256, so reconciling an older deployment still mutates the init container env and rolls the workload even when the model has not changed. That defeats the upgrade-stability logic already in this function.

Possible fix
 		if currentInitContainer == nil || slices.ContainsFunc(currentInitContainer.Env, func(e corev1.EnvVar) bool {
 			return strings.HasPrefix(e.Name, "HF_")
 		}) {
 			utils.AddDefaultHuggingFaceEnvVars(initContainer)
+			utils.AddEnvVars(initContainer, []corev1.EnvVar{
+				{Name: "HF_HUB_VERBOSITY", Value: "debug"},
+				{Name: "HF_DEBUG", Value: "1"},
+				{Name: "RUST_LOG", Value: "debug"},
+				{Name: "RUST_BACKTRACE", Value: "1"},
+			})
 		}
-
-		utils.AddEnvVars(initContainer, []corev1.EnvVar{
-			{Name: "HF_HUB_VERBOSITY", Value: "debug"},
-			{Name: "HF_DEBUG", Value: "1"},
-			{Name: "RUST_LOG", Value: "debug"},
-			{Name: "RUST_BACKTRACE", Value: "1"},
-		})
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controller/v1alpha2/llmisvc/workload_storage.go` around lines 258 - 263,
The new debug env vars added via utils.AddEnvVars(initContainer, ...) are placed
outside the existing upgrade guard and will trigger unnecessary rollouts; move
the four EnvVar entries ("HF_HUB_VERBOSITY", "HF_DEBUG", "RUST_LOG",
"RUST_BACKTRACE") so they are added only inside the existing conditional block
that protects initContainer updates (the upgrade guard surrounding the init
container mutation logic), ensuring the env mutation occurs only when that
guard's condition is met and leaving all other logic unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@python/storage-initializer/scripts/initializer-entrypoint`:
- Around line 27-44: The recursive progress scanner _log_download_progress is
always started (progress_thread using dest_paths), causing du -sh to scan large
trees every 10s; gate this behavior behind an opt-in flag (e.g. environment
variable or config like ENABLE_DOWNLOAD_PROGRESS) and only create/start
progress_thread when that flag is truthy. Update the startup path to check the
flag before instantiating threading.Thread(target=_log_download_progress,
args=(dest_paths,), daemon=True) so the scanner never runs in normal runs unless
explicitly enabled.

---

Duplicate comments:
In `@pkg/controller/v1alpha2/llmisvc/workload_storage.go`:
- Around line 258-263: The new debug env vars added via
utils.AddEnvVars(initContainer, ...) are placed outside the existing upgrade
guard and will trigger unnecessary rollouts; move the four EnvVar entries
("HF_HUB_VERBOSITY", "HF_DEBUG", "RUST_LOG", "RUST_BACKTRACE") so they are added
only inside the existing conditional block that protects initContainer updates
(the upgrade guard surrounding the init container mutation logic), ensuring the
env mutation occurs only when that guard's condition is met and leaving all
other logic unchanged.

In `@test/scripts/openshift-ci/setup-e2e-tests.sh`:
- Around line 82-91: The script passes several unquoted variable expansions
which can cause word-splitting and globbing; update all occurrences of
$PROJECT_ROOT and any other variable expansions used in commands (e.g., in the
calls to the install-kustomize script, make -C "$PROJECT_ROOT", export PATH
assignments, pushd/popd invocations, and ./test/scripts/gh-actions/setup-uv.sh
and pushd $PROJECT_ROOT/python/kserve) to be quoted (e.g., "$PROJECT_ROOT",
"${PROJECT_ROOT}/bin", and similar) throughout the file (including the remaining
occurrences around lines 114–152) so every command argument and PATH export uses
quoted expansions and prevents shell word-splitting/globbing.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 4d993c91-dba9-455a-a4fc-122304b97513

📥 Commits

Reviewing files that changed from the base of the PR and between 8442c3d and 8c262f0.

📒 Files selected for processing (4)
  • pkg/controller/v1alpha2/llmisvc/workload_storage.go
  • python/storage-initializer/scripts/initializer-entrypoint
  • test/scripts/openshift-ci/run-e2e-tests.sh
  • test/scripts/openshift-ci/setup-e2e-tests.sh
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/scripts/openshift-ci/run-e2e-tests.sh

Comment on lines +27 to +44
def _log_download_progress(paths, interval=10):
while True:
time.sleep(interval)
for p in paths:
if not os.path.isdir(p):
continue
result = subprocess.run(
["du", "-sh", p], capture_output=True, text=True
)
if result.returncode == 0:
ts = datetime.datetime.now(datetime.timezone.utc).isoformat()
logger.info("download progress [%s]: %s", ts, result.stdout.strip())


progress_thread = threading.Thread(
target=_log_download_progress, args=(dest_paths,), daemon=True
)
progress_thread.start()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Gate the recursive progress scanner behind an opt-in flag.

du -sh walks the full model tree every 10s while the download is writing into the same directory. On large pulls this adds avoidable I/O contention to the startup path for every storage init, not just debug runs.

Possible fix
-def _log_download_progress(paths, interval=10):
+def _log_download_progress(paths, interval=60):
     while True:
         time.sleep(interval)
         for p in paths:
             if not os.path.isdir(p):
                 continue
@@
-progress_thread = threading.Thread(
-    target=_log_download_progress, args=(dest_paths,), daemon=True
-)
-progress_thread.start()
+if os.getenv("STORAGE_INITIALIZER_PROGRESS_LOGGING") == "1":
+    progress_thread = threading.Thread(
+        target=_log_download_progress, args=(dest_paths,), daemon=True
+    )
+    progress_thread.start()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/storage-initializer/scripts/initializer-entrypoint` around lines 27 -
44, The recursive progress scanner _log_download_progress is always started
(progress_thread using dest_paths), causing du -sh to scan large trees every
10s; gate this behavior behind an opt-in flag (e.g. environment variable or
config like ENABLE_DOWNLOAD_PROGRESS) and only create/start progress_thread when
that flag is truthy. Update the startup path to check the flag before
instantiating threading.Thread(target=_log_download_progress,
args=(dest_paths,), daemon=True) so the scanner never runs in normal runs unless
explicitly enabled.

andresllh and others added 6 commits May 15, 2026 09:17
… issues

hf-xet has upstream bugs that cause intermittent model download failures.
Pinning hf-xet>=1.5.0 as an explicit dependency in kserve-storage to
ensure all packages resolve to a stable version.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Andres Llausas <allausas@redhat.com>
The test-llmisvc kustomize overlay was not patching the inferenceservice-config
configmap, causing it to fall back to the base kserve/storage-initializer:latest
image instead of the freshly built image from the PR. This meant xet or other
storage-initializer fixes were never exercised in the llmisvc CI jobs.

Add a minimal configmap patch to the test-llmisvc overlay that sets the
storage-initializer image, and update update-test-overlays.sh and
manage.kserve-kustomize.sh to replace `latest` with the built TAG — mirroring
the pattern already used by the test overlay.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Andres Llausas <allausas@redhat.com>
…Face tests

hf-xet 1.5.0 (introduced via kserve-storage dependency) is slower for
direct HuggingFace Hub downloads than 1.1.3, causing model downloads via
--model_id to consistently exceed the default 600s wait_isvc_ready timeout.

Increase per-ISVC readiness timeout to 1200s in test_huggingface.py and
test_huggingface_vllm_cpu.py. Also raise GHA step timeouts to match:
test-llm (2 workers, 12 tests) to 60 min, test-huggingface-server-vllm
(1 worker, 5 tests) to 90 min.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Andres Llausas <allausas@redhat.com>
… HuggingFace tests"

This reverts commit 58df838.

Signed-off-by: Andres Llausas <allausas@redhat.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: James Ostrander <jostrand@redhat.com>
The cherry-pick of the hf-xet 1.5.0 upgrade removed the hf-transfer
package block but left behind the huggingface-hub optional-dependencies
section referencing it, causing uv to fail with a missing source error.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: James Ostrander <jostrand@redhat.com>
@jlost

jlost commented May 16, 2026

Copy link
Copy Markdown
Author

/retest
/group-test

Signed-off-by: James Ostrander <jostrand@redhat.com>
@jlost

jlost commented May 17, 2026

Copy link
Copy Markdown
Author

/retest
/group-test

@rhods-ci-bot

Copy link
Copy Markdown

@jlost: The following test has Failed:

OCI Artifact Browser URL

View in Artifact Browser

Inspecting Test Artifacts Manually

To inspect your test artifacts manually, follow these steps:

  1. Install ORAS (see the ORAS installation guide).
  2. Download artifacts with the following commands:
mkdir -p oras-artifacts
cd oras-artifacts
oras pull quay.io/opendatahub/odh-ci-artifacts:kserve-group-test-fp5gp

@jlost jlost closed this May 18, 2026
@github-project-automation github-project-automation Bot moved this from New/Backlog to Done in ODH Model Serving Planning May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants