Skip to content

feat: improve deploy and test script robustness#650

Open
somya-bhatnagar wants to merge 11 commits intoopendatahub-io:mainfrom
somya-bhatnagar:somya-harden-deploy-and-test-scripts
Open

feat: improve deploy and test script robustness#650
somya-bhatnagar wants to merge 11 commits intoopendatahub-io:mainfrom
somya-bhatnagar:somya-harden-deploy-and-test-scripts

Conversation

@somya-bhatnagar
Copy link
Copy Markdown
Contributor

@somya-bhatnagar somya-bhatnagar commented Mar 31, 2026

Related to - https://redhat.atlassian.net/browse/RHOAIENG-51671

Summary

Improve robustness of the deploy and test scripts with consistent error handling (set -euo pipefail, clear error messages, non-zero exit on failure), configurable or documented timeouts for wait loops (DataScienceCluster, LLMIS, MaaSModels, AuthPolicies), and idempotency where possible (e.g. create namespace only if not exists, skip install if already installed) so re-runs and partial runs are predictable.

Story

As a developer or CI, I want deploy and test scripts to have clear error handling, sensible timeouts, and idempotent steps where appropriate, so that failures are easy to diagnose and re-runs do not leave the system in an inconsistent state.

Acceptance Criteria

  • Deploy and test entry points use consistent shell options (e.g. set -euo pipefail) and exit with a non-zero code and clear message on failure
  • Wait loops (for operators, CRs, AuthPolicies) use configurable or documented timeouts and do not hang indefinitely
  • Idempotent operations (e.g. create namespace if not exists, skip cert-manager/ODH if already present) are preserved or added where safe
  • No regression in existing e2e pass rate

Changes

Error Handling

  • ✅ All scripts use `set -euo pipefail` for robust error handling
  • ✅ 91 error exit points with clear messages including timeout context
  • ✅ Example: `"DataScienceCluster CRD not available (timeout: ${CRD_TIMEOUT}s)"`

Timeout Configuration

Added 16 configurable timeout environment variables (defined in `scripts/deployment-helpers.sh`):

Core Timeouts:

- \`CUSTOM_RESOURCE_TIMEOUT=600\` - DataScienceCluster wait
- \`LLMIS_TIMEOUT=300\` - LLMInferenceService ready
- \`MAASMODELREF_TIMEOUT=300\` - MaaSModelRef ready
- \`AUTHPOLICY_TIMEOUT=180\` - AuthPolicy enforcement

Infrastructure Timeouts:

- \`CRD_TIMEOUT=180\` - CRD establishment
- \`CSV_TIMEOUT=180\` - CSV installation
- \`SUBSCRIPTION_TIMEOUT=300\` - Subscription install
- \`ROLLOUT_TIMEOUT=120\` - Deployment rollout
- \`AUTHORINO_TIMEOUT=120\` - Authorino ready
- \`NAMESPACE_TIMEOUT=300\` - Namespace creation

Additional:
\RESOURCE_TIMEOUT`, `POD_TIMEOUT`, `WEBHOOK_TIMEOUT`, `CUSTOM_CHECK_TIMEOUT`, `KUBECONFIG_WAIT_TIMEOUT`, `CATALOGSOURCE_TIMEOUT``

Usage Example:

# Deploy with extended timeouts for slow cluster
CUSTOM_RESOURCE_TIMEOUT=900 LLMIS_TIMEOUT=600 ./scripts/deploy.sh --operator-type odh

# Run E2E tests with custom timeouts
ROLLOUT_TIMEOUT=60 AUTHORINO_TIMEOUT=60 ./test/e2e/scripts/prow_run_smoke_test.sh

Idempotent Operations

  • ✅ Namespace creation (skip if exists)
  • ✅ GatewayClass (skip if exists)
  • ✅ Gateway (update if exists)
  • ✅ DSCInitialization (skip if auto-created by operator)

Files Modified

  • `scripts/deploy.sh` - Timeout variables, error messages, idempotency checks
  • `scripts/deployment-helpers.sh` - Timeout defaults, improved wait functions
  • `scripts/validate-deployment.sh` - Error handling mode (set -uo pipefail)
  • `test/e2e/scripts/prow_run_smoke_test.sh` - Timeout configuration with improved MaaSModelRef wait logic

Testability / Verification

Manual Testing

# 1. Idempotency test - run deploy twice
./scripts/deploy.sh --operator-type odh
./scripts/deploy.sh --operator-type odh  # Should complete without errors

# 2. Failure test - introduce error and verify clear message
MAAS_API_IMAGE=invalid:image ./scripts/deploy.sh --operator-type odh
# Should exit with clear error and non-zero code

# 3. Timeout test - fast failure with custom timeout
ROLLOUT_TIMEOUT=5 ./scripts/deploy.sh --operator-type odh
# Should timeout quickly with clear message

# 4. E2E tests - verify no regressions
./test/e2e/scripts/prow_run_smoke_test.sh

Verification Summary

Cluster: api.ci-ln-rxc9qqb-76ef8.aws-4.ci.openshift.org

✅ Tests Passed (6/6)

  1. Timeout Validation - Invalid values (invalid, -5) correctly rejected with exit code 1
  2. TOCTOU-Safe Namespace Creation - Idempotent namespace creation works correctly (create-then-check pattern)
  3. wait_for_namespace Timeout Calculation - Uses remaining time instead of hardcoded 60s (completed in 2s vs expected <10s)
  4. Timeout Error Handling - All timeout paths return exit code 1 (tested with non-existent deployment/CRD)
  5. Webhook Readiness Fatal Checks - Code review confirms fatal errors instead of warnings
  6. wait_authorino_ready Return Codes - Correctly returns 0 when gateway URL unavailable (skip verification), returns 1 on actual timeout

Key Fixes Verified

┌────────────────────┬────────────────────────────────────────┬──────────────────────────────────────────────┐
│ Fix │ Location │ Behavior │
├────────────────────┼────────────────────────────────────────┼──────────────────────────────────────────────┤
│ Timeout validation │ deployment-helpers.sh:110-156 │ Validates all 16 timeout vars at startup │
├────────────────────┼────────────────────────────────────────┼──────────────────────────────────────────────┤
│ TOCTOU race │ deploy.sh:636-648 │ Create-then-check pattern │
├────────────────────┼────────────────────────────────────────┼──────────────────────────────────────────────┤
│ Hardcoded timeouts │ deployment-helpers.sh:511-512, 525-526 │ Remaining time calculation │
├────────────────────┼────────────────────────────────────────┼──────────────────────────────────────────────┤
│ Webhook checks │ deploy.sh:1120-1132 │ Fatal errors with return 1 │
├────────────────────┼────────────────────────────────────────┼──────────────────────────────────────────────┤
│ Auth timeout │ deployment-helpers.sh:1430-1431 │ Returns 1 on timeout │
├────────────────────┼────────────────────────────────────────┼──────────────────────────────────────────────┤
│ Gateway URL │ deployment-helpers.sh:1376-1378 │ Returns 0 to skip verification (intentional) │
└────────────────────┴────────────────────────────────────────┴──────────────────────────────────────────────┘

All scripts properly handle errors with non-zero exit codes and include timeout context in error messages.

Automated Validation

  • ✅ All scripts pass `bash -n` syntax check
  • ✅ Error handling verified (`set -euo pipefail` in all entry points)
  • ✅ Timeout variables properly sourced and used
  • ✅ No hardcoded timeouts found
  • ✅ Idempotency checks in place

Summary by CodeRabbit

  • New Features

    • Timeouts are now configurable via environment variables across deployments, validation, and tests; defaults are provided and validated at startup.
  • Bug Fixes

    • Stronger fail-fast behavior and clearer timeout-aware error messages for stalled waits, rollouts, webhooks, and related checks.
  • Chores

    • Namespace/Gateway/GatewayClass creation made idempotent to support safe re-runs; some rollout failures now stop the process.
  • Documentation

    • Help text and validation output updated to document env-var timeouts and connection timeout notes; stricter validation enabled.

@openshift-ci openshift-ci bot requested review from israel-hdez and nerdalert March 31, 2026 15:05
@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Mar 31, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: somya-bhatnagar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 31, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Replaces many hardcoded wait/timeout values across deployment and test scripts with environment-overridable readonly timeout variables declared in scripts/deployment-helpers.sh. Propagates those timeouts into rollout, kubectl wait, and custom readiness helpers (CRD, CSV, webhook, pod, Gateway, Authorino, LLMInference, MaaSModelRef). Changes several helpers and callers to return non-zero on timeout/error instead of continuing, makes namespace and some resource creations idempotent, adjusts Gateway/GatewayClass existence handling, and enables stricter shell behavior (set -uo pipefail) in scripts/validate-deployment.sh. E2E test scripts updated to consume the new timeout variables.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Actionable Issues

  • CWE-20 (Improper Input Validation): _validate_timeout() must strictly enforce positive-integer-only values and reject empty, non-numeric, negative, or zero inputs. Add bounds checks (e.g., max sensible value) and explicit error messages. Fail fast on invalid env values.

  • CWE-248 / CWE-703 (Unchecked Return Value / Improper Error Handling): Many callers now return non-zero on failures. Audit all call sites to ensure return codes are checked and propagated (use || return 1 or explicit if-checks). Missing propagation can silently hide failures.

  • CWE-125 (Out-of-bounds Read) / CWE-188 (Reliance on Time-of-Day): Deadline arithmetic using SECONDS and incremental sleeps (MAASMODELREF_TIMEOUT, AUTHPOLICY_TIMEOUT, etc.) can produce off-by-one or overshoot windows. Verify loop termination conditions and that sleep intervals cannot cumulatively exceed configured timeouts.

  • CWE-250 (Execution with Unnecessary Privileges): validate-deployment.sh enabled set -u and pipefail but left -e disabled. For critical validation steps, either enable set -e or add explicit error checks to avoid continuing after failed commands.

  • CWE-200 (Information Exposure): Timeout values and timing diagnostics are logged verbatim. Review logging contexts where these might be exported; avoid printing environment variables that could contain sensitive prefixes or reveal internal timing behavior in publicly collected logs.

  • Idempotence edge-cases: Namespace and GatewayClass creation now treat "already exists" as success. Confirm error-path differentiation—only treat exact conflict-or-exists responses as success; other server errors should return failure and not be swallowed.

  • Behavioral regression risk: Some warnings were converted to hard failures (return 1). Add coordinated test assertions or gating in CI to ensure higher-failure behavior is intended and does not break deployment flows that depended on previous “proceed anyway” semantics.

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main objective of the PR—improving robustness of deploy and test scripts through consistent error handling, configurable timeouts, and idempotent operations.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
scripts/validate-deployment.sh (3)

205-228: ⚠️ Potential issue | 🟠 Major

print_fail and print_warning access optional parameters without defaults.

With set -u enabled, these functions will fail when called with fewer than 4 arguments (for print_fail) or 3 arguments (for print_warning). Use parameter expansion with defaults.

 print_fail() {
     echo -e "${RED}❌ FAIL: $1${NC}"
-    if [ -n "$2" ]; then
-        echo -e "${RED}   Reason: $2${NC}"
+    if [ -n "${2:-}" ]; then
+        echo -e "${RED}   Reason: ${2}${NC}"
     fi
-    if [ -n "$3" ]; then
-        echo -e "${YELLOW}   Suggestion: $3${NC}"
+    if [ -n "${3:-}" ]; then
+        echo -e "${YELLOW}   Suggestion: ${3}${NC}"
     fi
-    if [ -n "$4" ]; then
-        echo -e "${YELLOW}   Suggestion: $4${NC}"
+    if [ -n "${4:-}" ]; then
+        echo -e "${YELLOW}   Suggestion: ${4}${NC}"
     fi
     ((PASSED++))
 }

 print_warning() {
     echo -e "${YELLOW}⚠️  WARNING: $1${NC}"
-    if [ -n "$2" ]; then
-        echo -e "${YELLOW}   Note: $2${NC}"
+    if [ -n "${2:-}" ]; then
+        echo -e "${YELLOW}   Note: ${2}${NC}"
     fi
-    if [ -n "$3" ]; then
-        echo -e "${YELLOW}   $3${NC}"
+    if [ -n "${3:-}" ]; then
+        echo -e "${YELLOW}   ${3}${NC}"
     fi
     ((WARNINGS++))
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/validate-deployment.sh` around lines 205 - 228, The functions
print_fail and print_warning currently reference positional params directly
which breaks under set -u when callers pass fewer args; update both to use safe
parameter expansion (e.g. ${1:-}, ${2:-}, ${3:-}, ${4:-}) when checking and
echoing optional values so missing parameters default to empty strings, and keep
their increments of FAILED and WARNINGS intact; locate the print_fail and
print_warning function bodies and replace direct "$2"/"$3"/"$4" and "$2"/"$3"
usages with parameter-expansion forms to avoid unbound variable errors.

31-31: ⚠️ Potential issue | 🟡 Minor

Unbound variable error when no arguments provided.

With set -u enabled (line 6), accessing $1 directly when no arguments are passed causes the script to fail with "unbound variable". The check should guard against this.

-if [ "$1" = "--help" ] || [ "$1" = "-h" ]; then
+if [ "${1:-}" = "--help" ] || [ "${1:-}" = "-h" ]; then
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/validate-deployment.sh` at line 31, The conditional that reads if [
"$1" = "--help" ] || [ "$1" = "-h" ] directly references $1 while set -u is
enabled, causing an unbound variable error when no args are passed; fix it by
guarding the check — either test the argument count first (e.g., check "$#" -gt
0) or use a safe parameter expansion for $1 (e.g., ${1-}) in that if condition
so the script never reads an unset variable.

645-645: ⚠️ Potential issue | 🔴 Critical

Unbound variable: $TIER causes script failure with set -u.

$TIER is used at line 645 within a conditional check (if [ -n "$TIER" ]), but it is never assigned or initialized anywhere in the script. The script explicitly enables set -u (line 6) to treat unset variables as errors. Reference to $TIER will cause an "unbound variable" error before the conditional can evaluate, halting execution.

Either define $TIER (extract from JWT token or accept as environment variable), or check if it should be sourced from deployment-helpers.sh. If $TIER is optional, initialize it with a default value before line 645.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/validate-deployment.sh` at line 645, The script fails under set -u
because the variable TIER is referenced without being initialized; fix by
ensuring TIER is defined before use: either initialize TIER to a safe default
(e.g., empty) at script start or explicitly populate it from the JWT parsing
logic or from deployment-helpers.sh (whichever intended), and/or replace the
direct reference with a guarded check that tolerates unset variables; update
references to the TIER variable (the conditional using [ -n "$TIER" ] and any
other uses) so they rely on the initialized variable or the source that provides
it.
test/e2e/scripts/prow_run_smoke_test.sh (1)

290-365: ⚠️ Potential issue | 🔴 Critical

Unresolved merge conflict markers break script execution.

The file contains git merge conflict markers (<<<<<<< HEAD, =======, >>>>>>> 82764cc) that must be resolved. This causes the SC1073/SC1072 shellcheck errors and will fail at runtime.

Resolve the conflict by choosing either the deadline-based approach (HEAD) or the retry-count-based approach (feature branch), then remove all conflict markers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/scripts/prow_run_smoke_test.sh` around lines 290 - 365, The file
contains unresolved git conflict markers around the MaaSModelRefs readiness
loop; remove the conflict markers and choose one implementation (either the
deadline/SECONDS approach using timeout/deadline and found_any/all_ready or the
retry-count approach using MAASMODELREF_TIMEOUT, max_retries, retries and the
controller bounce logic). Ensure you delete the lines starting with <<<<<<<,
=======, and >>>>>>>, keep a single coherent loop (with variables
all_ready/found_any or all_ready/retries as appropriate), and retain related
calls like oc get maasmodelrefs -n "$MODEL_NAMESPACE" and any kubectl rollout
restart/deployment/maas-controller handling if you pick the retry-bounce variant
so the script is syntactically valid and free of merge markers.
🧹 Nitpick comments (4)
scripts/deployment-helpers.sh (2)

1155-1158: Declare and assign separately to avoid masking return values (SC2155).

Same issue - split local declaration from command substitution assignment.

-    local state=$(kubectl get catalogsource "$name" -n "$namespace" \
+    local state
+    state=$(kubectl get catalogsource "$name" -n "$namespace" \
       -o jsonpath='{.status.connectionState.lastObservedState}' 2>/dev/null)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/deployment-helpers.sh` around lines 1155 - 1158, The code declares
and assigns local state with command substitution (triggering SC2155); split the
declaration and assignment in the block that checks CatalogSource readiness:
first do local state, then assign state=$(kubectl get catalogsource "$name" -n
"$namespace" -o jsonpath='{.status.connectionState.lastObservedState}'
2>/dev/null); keep the surrounding logic using CATALOGSOURCE_TIMEOUT, name and
namespace the same so the echo uses the populated state value.

967-967: Declare and assign separately to avoid masking return values (SC2155).

When local and command substitution are combined, the exit status of the command is masked by local (which always succeeds). Split declaration and assignment.

-    local csv_name=$(find_csv_with_min_version "$operator_prefix" "$min_version" "$namespace")
+    local csv_name
+    csv_name=$(find_csv_with_min_version "$operator_prefix" "$min_version" "$namespace")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/deployment-helpers.sh` at line 967, The statement "local
csv_name=$(find_csv_with_min_version "$operator_prefix" "$min_version"
"$namespace")" masks the command's exit status; change it to declare the local
variable first (local csv_name) and then assign via separate command
substitution (csv_name=$(find_csv_with_min_version ...)) so the exit code from
find_csv_with_min_version is preserved; update the code around the
find_csv_with_min_version call to perform declaration and assignment on separate
lines.
scripts/validate-deployment.sh (1)

3-6: Redundant comment block.

Lines 18-19 duplicate the explanation already provided in the comment block at lines 3-5. Remove the redundancy.

Also applies to: 18-19

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/validate-deployment.sh` around lines 3 - 6, The script contains a
duplicated explanatory comment about "Bash strict mode" that repeats the same
text already provided above the set -uo pipefail line; remove the redundant
comment block later in the file so the single explanatory comment remains
directly adjacent to the set -uo pipefail directive (leave the original comment
explaining -u and -o pipefail intact and delete the later duplicate).
scripts/deploy.sh (1)

1400-1400: Integer arithmetic with potentially non-integer variable.

$((CUSTOM_CHECK_TIMEOUT / 2)) assumes CUSTOM_CHECK_TIMEOUT is a valid integer. If someone sets CUSTOM_CHECK_TIMEOUT=abc, this will fail. Consider adding validation.

+# Validate timeout is numeric
+if ! [[ "$CUSTOM_CHECK_TIMEOUT" =~ ^[0-9]+$ ]]; then
+  log_error "CUSTOM_CHECK_TIMEOUT must be a positive integer, got: $CUSTOM_CHECK_TIMEOUT"
+  return 1
+fi
 local kuadrant_initial_timeout=$((CUSTOM_CHECK_TIMEOUT / 2))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/deploy.sh` at line 1400, The assignment to kuadrant_initial_timeout
uses arithmetic expansion on CUSTOM_CHECK_TIMEOUT without validation; add a
validation step for CUSTOM_CHECK_TIMEOUT (e.g., check it matches a non-negative
integer via regex or use a safe numeric test) before computing
kuadrant_initial_timeout, and if validation fails either exit with a clear error
or fall back to a safe default value; update the code around
kuadrant_initial_timeout and any callers that rely on it (references to
kuadrant_initial_timeout and CUSTOM_CHECK_TIMEOUT) to use the
validated/normalized integer.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/scripts/prow_run_smoke_test.sh`:
- Line 351: Replace the hardcoded namespace string "llm" in the oc get
maasmodelrefs command with the script's namespace variable so the command uses
$MODEL_NAMESPACE (e.g., change the oc get maasmodelrefs -n llm ... invocation to
use -n "${MODEL_NAMESPACE}" or -n $MODEL_NAMESPACE); update the command that
reads from oc get maasmodelrefs so it consistently uses the MODEL_NAMESPACE
variable to match the earlier usage at line 317 and avoid namespace mismatch
when MODEL_NAMESPACE is overridden.

---

Outside diff comments:
In `@scripts/validate-deployment.sh`:
- Around line 205-228: The functions print_fail and print_warning currently
reference positional params directly which breaks under set -u when callers pass
fewer args; update both to use safe parameter expansion (e.g. ${1:-}, ${2:-},
${3:-}, ${4:-}) when checking and echoing optional values so missing parameters
default to empty strings, and keep their increments of FAILED and WARNINGS
intact; locate the print_fail and print_warning function bodies and replace
direct "$2"/"$3"/"$4" and "$2"/"$3" usages with parameter-expansion forms to
avoid unbound variable errors.
- Line 31: The conditional that reads if [ "$1" = "--help" ] || [ "$1" = "-h" ]
directly references $1 while set -u is enabled, causing an unbound variable
error when no args are passed; fix it by guarding the check — either test the
argument count first (e.g., check "$#" -gt 0) or use a safe parameter expansion
for $1 (e.g., ${1-}) in that if condition so the script never reads an unset
variable.
- Line 645: The script fails under set -u because the variable TIER is
referenced without being initialized; fix by ensuring TIER is defined before
use: either initialize TIER to a safe default (e.g., empty) at script start or
explicitly populate it from the JWT parsing logic or from deployment-helpers.sh
(whichever intended), and/or replace the direct reference with a guarded check
that tolerates unset variables; update references to the TIER variable (the
conditional using [ -n "$TIER" ] and any other uses) so they rely on the
initialized variable or the source that provides it.

In `@test/e2e/scripts/prow_run_smoke_test.sh`:
- Around line 290-365: The file contains unresolved git conflict markers around
the MaaSModelRefs readiness loop; remove the conflict markers and choose one
implementation (either the deadline/SECONDS approach using timeout/deadline and
found_any/all_ready or the retry-count approach using MAASMODELREF_TIMEOUT,
max_retries, retries and the controller bounce logic). Ensure you delete the
lines starting with <<<<<<<, =======, and >>>>>>>, keep a single coherent loop
(with variables all_ready/found_any or all_ready/retries as appropriate), and
retain related calls like oc get maasmodelrefs -n "$MODEL_NAMESPACE" and any
kubectl rollout restart/deployment/maas-controller handling if you pick the
retry-bounce variant so the script is syntactically valid and free of merge
markers.

---

Nitpick comments:
In `@scripts/deploy.sh`:
- Line 1400: The assignment to kuadrant_initial_timeout uses arithmetic
expansion on CUSTOM_CHECK_TIMEOUT without validation; add a validation step for
CUSTOM_CHECK_TIMEOUT (e.g., check it matches a non-negative integer via regex or
use a safe numeric test) before computing kuadrant_initial_timeout, and if
validation fails either exit with a clear error or fall back to a safe default
value; update the code around kuadrant_initial_timeout and any callers that rely
on it (references to kuadrant_initial_timeout and CUSTOM_CHECK_TIMEOUT) to use
the validated/normalized integer.

In `@scripts/deployment-helpers.sh`:
- Around line 1155-1158: The code declares and assigns local state with command
substitution (triggering SC2155); split the declaration and assignment in the
block that checks CatalogSource readiness: first do local state, then assign
state=$(kubectl get catalogsource "$name" -n "$namespace" -o
jsonpath='{.status.connectionState.lastObservedState}' 2>/dev/null); keep the
surrounding logic using CATALOGSOURCE_TIMEOUT, name and namespace the same so
the echo uses the populated state value.
- Line 967: The statement "local csv_name=$(find_csv_with_min_version
"$operator_prefix" "$min_version" "$namespace")" masks the command's exit
status; change it to declare the local variable first (local csv_name) and then
assign via separate command substitution (csv_name=$(find_csv_with_min_version
...)) so the exit code from find_csv_with_min_version is preserved; update the
code around the find_csv_with_min_version call to perform declaration and
assignment on separate lines.

In `@scripts/validate-deployment.sh`:
- Around line 3-6: The script contains a duplicated explanatory comment about
"Bash strict mode" that repeats the same text already provided above the set -uo
pipefail line; remove the redundant comment block later in the file so the
single explanatory comment remains directly adjacent to the set -uo pipefail
directive (leave the original comment explaining -u and -o pipefail intact and
delete the later duplicate).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: e467181c-8556-4454-b603-038477caaa3c

📥 Commits

Reviewing files that changed from the base of the PR and between d4a15d8 and 1c8cd94.

📒 Files selected for processing (4)
  • scripts/deploy.sh
  • scripts/deployment-helpers.sh
  • scripts/validate-deployment.sh
  • test/e2e/scripts/prow_run_smoke_test.sh

Add configurable timeout constants, consistent error handling, and
idempotency improvements across deployment and test scripts to make
deployments more predictable and debuggable.

- Add environment-variable-overridable timeout constants to deployment-helpers.sh
- Replace all hardcoded timeouts with configurable constants
- Improve error messages to include actual timeout values
- Add idempotency checks for gateway and namespace creation
- Add set -uo pipefail to validate-deployment.sh for stricter error handling
- Document timeout configuration in script headers
- Use improved MaaSModelRef wait logic from main with configurable timeout

Acceptance Criteria:
- [x] Consistent shell options (set -euo pipefail) across all scripts
- [x] Clear error messages with timeout context
- [x] Configurable timeouts (16 environment variables)
- [x] Idempotent operations for namespace, gateway, gatewayclass
- [x] No hardcoded timeouts

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@somya-bhatnagar somya-bhatnagar force-pushed the somya-harden-deploy-and-test-scripts branch from 1c8cd94 to 1bf660c Compare March 31, 2026 15:15
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
scripts/deployment-helpers.sh (2)

200-215: ⚠️ Potential issue | 🟠 Major

Bound the CSV creation poll; this still hangs indefinitely.

After status.currentCSV is populated, Lines 208-210 loop without any deadline. If OLM points at a CSV that never materializes, this bypasses both SUBSCRIPTION_TIMEOUT and CSV_TIMEOUT and the job can stall forever.

Suggested fix
   local csv
   csv=$(kubectl get subscription.operators.coreos.com -n "$ns" "$name" -o jsonpath='{.status.currentCSV}')
 
   # Because, sometimes, the CSV is not there immediately.
+  local csv_deadline=$((SECONDS + CSV_TIMEOUT))
   while ! kubectl get -n "$ns" csv "$csv" > /dev/null 2>&1; do
+    if (( SECONDS >= csv_deadline )); then
+      echo "    * ERROR: CSV $csv was not created within ${CSV_TIMEOUT}s"
+      return 1
+    fi
     sleep 1
   done
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/deployment-helpers.sh` around lines 200 - 215, The unbounded while
loop waiting for the CSV (variable csv) can hang indefinitely; modify that loop
to enforce a deadline by recording a start time (e.g. start_ts=$(date +%s)) and
on each iteration compute elapsed and compare to a timeout (reuse CSV_TIMEOUT or
introduce CSV_POLL_TIMEOUT) so if elapsed >= timeout you echo an error and
return 1; keep the same kubectl existence check (kubectl get -n "$ns" csv "$csv"
> /dev/null 2>&1) and sleep 1 between attempts, and ensure downstream kubectl
wait -n "$ns" --for=jsonpath=... csv "$csv" still runs if the CSV becomes
available before the timeout.

440-468: ⚠️ Potential issue | 🔴 Critical

Remove eval from wait_for_custom_check.

Severity: critical (CWE-78). This executes shell source, not just a probe command. Any caller that interpolates env or CLI data into check_command can turn a readiness check into arbitrary command execution. Accept a function name or argv array instead of a string.

As per coding guidelines, **/*.sh: Avoid eval and dynamic command execution.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/deployment-helpers.sh` around lines 440 - 468, The
wait_for_custom_check function currently uses eval on a string and must be
changed to avoid command injection: stop accepting a single command string and
instead accept either (A) a function name (e.g., a shell function the caller
defines) or (B) a command + argv array passed as separate positional parameters;
remove eval and invoke the check safely by calling the function name directly
(e.g., "$check_fn") or by executing the command array via "$@"/"${cmd[@]}" so no
shell interpolation occurs. Update the parameter parsing in
wait_for_custom_check to treat the remainder of arguments as the command/args
(or a single function name), adjust callers to pass a function name or split
args, and keep the existing timeout/interval behavior and return values
unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/deploy.sh`:
- Around line 1115-1125: The readiness checks for the webhook currently only log
warnings and continue, which delays a known prerequisite failure into later
steps; update the two timeout branches (the wait_for_resource failure and the
kubectl wait || block) to treat timeouts as fatal: replace the log_warn calls
with log_error (or similar) and exit with a non-zero status (e.g., exit 1) so
the script stops instead of proceeding to apply_dsci/apply_dsc; ensure you
reference the existing symbols webhook_deployment, webhook_namespace,
ROLLOUT_TIMEOUT, wait_for_resource, and the kubectl wait block when making the
change.
- Around line 636-643: The current get/create split is vulnerable to TOCTOU
races; instead, call kubectl create namespace "$NAMESPACE" directly and, if it
fails, check whether the namespace now exists and treat that as success—i.e.,
after a failed kubectl create, run kubectl get namespace "$NAMESPACE" and only
call log_error/return 1 if that check also fails. Update the block using the
NAMESPACE variable and existing logging helpers (log_info, log_error, log_debug)
so creation errors that are actually AlreadyExists are treated as a successful
idempotent outcome.

In `@scripts/deployment-helpers.sh`:
- Around line 1251-1258: The timeout path in the wait_authorino_ready function
currently logs a warning but returns 0, which misreports a timeout as success;
change the final timeout handling in wait_authorino_ready to return 1 (non-zero)
on failure instead of 0, and make the same change in the analogous helper later
in the file (the function covering the block around 1352-1398) so both helpers
exit non-zero when auth probing times out; keep the warning log but ensure the
function's return code signals failure.
- Around line 110-127: The new timeout environment variables (e.g.,
CUSTOM_RESOURCE_TIMEOUT, CSV_TIMEOUT, ROLLOUT_TIMEOUT, etc.) must be validated
at script startup so non-integer or empty values (like "3m" or "") produce a
clear error; add a validation block after the readonly declarations that checks
each *_TIMEOUT is set and matches a positive-integer regex (or is unset and
defaults are applied), and if a value is invalid print a descriptive error and
exit non-zero; ensure this validation is used before any arithmetic/tests or
calls that pass values to kubectl --timeout so all uses of
CUSTOM_RESOURCE_TIMEOUT, NAMESPACE_TIMEOUT, RESOURCE_TIMEOUT, CRD_TIMEOUT,
CSV_TIMEOUT, SUBSCRIPTION_TIMEOUT, POD_TIMEOUT, WEBHOOK_TIMEOUT,
CUSTOM_CHECK_TIMEOUT, AUTHORINO_TIMEOUT, ROLLOUT_TIMEOUT,
KUBECONFIG_WAIT_TIMEOUT, CATALOGSOURCE_TIMEOUT, LLMIS_TIMEOUT,
MAASMODELREF_TIMEOUT, and AUTHPOLICY_TIMEOUT are safe.
- Around line 479-497: The inner kubectl wait calls currently use a hardcoded
60s; replace them so they use the configured timeout budget instead by computing
the remaining time (e.g., remaining=$((timeout - elapsed)) with a floor of 1)
and passing --timeout="${remaining}s" to kubectl wait; update both occurrences
that reference kubectl wait and ensure you adjust elapsed/interval logic so
remaining never becomes non-positive before calling wait.

---

Outside diff comments:
In `@scripts/deployment-helpers.sh`:
- Around line 200-215: The unbounded while loop waiting for the CSV (variable
csv) can hang indefinitely; modify that loop to enforce a deadline by recording
a start time (e.g. start_ts=$(date +%s)) and on each iteration compute elapsed
and compare to a timeout (reuse CSV_TIMEOUT or introduce CSV_POLL_TIMEOUT) so if
elapsed >= timeout you echo an error and return 1; keep the same kubectl
existence check (kubectl get -n "$ns" csv "$csv" > /dev/null 2>&1) and sleep 1
between attempts, and ensure downstream kubectl wait -n "$ns" --for=jsonpath=...
csv "$csv" still runs if the CSV becomes available before the timeout.
- Around line 440-468: The wait_for_custom_check function currently uses eval on
a string and must be changed to avoid command injection: stop accepting a single
command string and instead accept either (A) a function name (e.g., a shell
function the caller defines) or (B) a command + argv array passed as separate
positional parameters; remove eval and invoke the check safely by calling the
function name directly (e.g., "$check_fn") or by executing the command array via
"$@"/"${cmd[@]}" so no shell interpolation occurs. Update the parameter parsing
in wait_for_custom_check to treat the remainder of arguments as the command/args
(or a single function name), adjust callers to pass a function name or split
args, and keep the existing timeout/interval behavior and return values
unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: cfd527f5-2500-4815-a0a8-600b789c2922

📥 Commits

Reviewing files that changed from the base of the PR and between 1c8cd94 and 1bf660c.

📒 Files selected for processing (4)
  • scripts/deploy.sh
  • scripts/deployment-helpers.sh
  • scripts/validate-deployment.sh
  • test/e2e/scripts/prow_run_smoke_test.sh
✅ Files skipped from review due to trivial changes (1)
  • scripts/validate-deployment.sh
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/e2e/scripts/prow_run_smoke_test.sh

somya-bhatnagar and others added 4 commits March 31, 2026 13:17
- Fix TOCTOU race in namespace creation (use create-then-check pattern)
- Make webhook readiness checks fatal errors instead of warnings
- Add timeout variable validation at script startup
- Fix hardcoded kubectl wait timeouts in wait_for_namespace
- Fix wait_authorino_ready to return 1 on timeout/failure instead of 0

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@somya-bhatnagar somya-bhatnagar added the ready-for-review All changes are complete and CI checks have passed. label Mar 31, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
scripts/deployment-helpers.sh (2)

1000-1000: Declare and assign separately to capture return value (SC2155).

local always returns 0, masking the exit status of find_csv_with_min_version. If the function fails, $csv_name will be empty but the failure won't propagate correctly for set -e contexts.

-    local csv_name=$(find_csv_with_min_version "$operator_prefix" "$min_version" "$namespace")
+    local csv_name
+    csv_name=$(find_csv_with_min_version "$operator_prefix" "$min_version" "$namespace")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/deployment-helpers.sh` at line 1000, The current combined
declaration/assignment masks the exit status (SC2155) — change the line to
declare the variable first and then assign the output of
find_csv_with_min_version to it, e.g. declare or local csv_name on its own, then
run csv_name=$(find_csv_with_min_version "$operator_prefix" "$min_version"
"$namespace") and propagate failure (e.g. check the command's exit code or use
"|| return" / "|| exit" as appropriate) so failures from
find_csv_with_min_version are not swallowed in set -e contexts.

1189-1191: Declare and assign separately to capture return value (SC2155).

Same issue—local masks kubectl's exit status.

-    local state=$(kubectl get catalogsource "$name" -n "$namespace" \
-      -o jsonpath='{.status.connectionState.lastObservedState}' 2>/dev/null)
+    local state
+    state=$(kubectl get catalogsource "$name" -n "$namespace" \
+      -o jsonpath='{.status.connectionState.lastObservedState}' 2>/dev/null)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/deployment-helpers.sh` around lines 1189 - 1191, The kubectl exit
status is being masked by declaring and assigning `state` in one line; split the
declaration and assignment so you can capture kubectl's return code: first
`local state`, then run the `kubectl get catalogsource "$name" -n "$namespace"
-o jsonpath='{.status.connectionState.lastObservedState}' 2>/dev/null` into
`state` and capture its exit status (e.g., `rc=$?`) so failures aren't hidden;
keep the existing error echo that uses `state` and use the captured `rc` where
needed (references: variable `state`, command `kubectl get catalogsource`, and
`CATALOGSOURCE_TIMEOUT`).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@scripts/deployment-helpers.sh`:
- Line 1000: The current combined declaration/assignment masks the exit status
(SC2155) — change the line to declare the variable first and then assign the
output of find_csv_with_min_version to it, e.g. declare or local csv_name on its
own, then run csv_name=$(find_csv_with_min_version "$operator_prefix"
"$min_version" "$namespace") and propagate failure (e.g. check the command's
exit code or use "|| return" / "|| exit" as appropriate) so failures from
find_csv_with_min_version are not swallowed in set -e contexts.
- Around line 1189-1191: The kubectl exit status is being masked by declaring
and assigning `state` in one line; split the declaration and assignment so you
can capture kubectl's return code: first `local state`, then run the `kubectl
get catalogsource "$name" -n "$namespace" -o
jsonpath='{.status.connectionState.lastObservedState}' 2>/dev/null` into `state`
and capture its exit status (e.g., `rc=$?`) so failures aren't hidden; keep the
existing error echo that uses `state` and use the captured `rc` where needed
(references: variable `state`, command `kubectl get catalogsource`, and
`CATALOGSOURCE_TIMEOUT`).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 58777453-33bb-49e8-9d92-8c5033ef6ae0

📥 Commits

Reviewing files that changed from the base of the PR and between f6b1437 and ee60900.

📒 Files selected for processing (2)
  • scripts/deploy.sh
  • scripts/deployment-helpers.sh

The gateway URL determination is not inside a retry loop, so making it
a fatal error would cause premature failures when the gateway is still
configuring. The original behavior (skip auth verification and return
success) is intentional and correct.

The actual issue was the timeout path in the auth verification loop,
which is still fixed (returns 1 on timeout).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
scripts/deployment-helpers.sh (2)

229-244: ⚠️ Potential issue | 🟠 Major

Bound the CSV existence polling loop to prevent indefinite hangs.

Line 237 to Line 239 loops forever if currentCSV is set but the CSV object never materializes. That bypasses the timeout guarantees added in this PR and can deadlock CI/deploy flows.

Suggested fix
   local csv
   csv=$(kubectl get subscription.operators.coreos.com -n "$ns" "$name" -o jsonpath='{.status.currentCSV}')

   # Because, sometimes, the CSV is not there immediately.
-  while ! kubectl get -n "$ns" csv "$csv" > /dev/null 2>&1; do
-    sleep 1
-  done
+  if ! wait_for_resource "csv" "$csv" "$ns" "$CSV_TIMEOUT"; then
+    echo "    * ERROR: CSV $csv for Subscription $ns/$name was not found within ${CSV_TIMEOUT}s" >&2
+    return 1
+  fi

As per coding guidelines, "REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/deployment-helpers.sh` around lines 229 - 244, The while loop that
polls for the CSV object (the kubectl get -n "$ns" csv "$csv" > /dev/null 2>&1
loop) can hang indefinitely; bound it by using a timeout/deadline derived from
CSV_TIMEOUT (or a new CSV_POLL_TIMEOUT) by recording the start time and breaking
with an error/return 1 if elapsed seconds exceed the timeout, and keep the
existing sleep 1 between retries; ensure the error message mentions CSV and
timeout (same style as the later kubectl wait error) so callers see why the
function failed.

477-488: ⚠️ Potential issue | 🟠 Major

Remove eval from custom checks to eliminate command-injection risk (CWE-78).

Severity: major.
Exploit scenario: if check_command is composed from env/CLI/user-controlled input, eval can execute arbitrary shell payloads (e.g., command chaining), leading to unintended cluster actions.

Suggested fix
-# wait_for_custom_check description check_command timeout interval
+# wait_for_custom_check description timeout interval -- command [args...]
 wait_for_custom_check() {
   local description=${1?description is required}; shift
-  local check_command=${1?check command is required}; shift
-  local timeout=${1:-$CUSTOM_CHECK_TIMEOUT}; shift || true
-  local interval=${1:-5}; shift || true
+  local timeout=${1:-$CUSTOM_CHECK_TIMEOUT}; shift || true
+  local interval=${1:-5}; shift || true
+  [[ "${1:-}" == "--" ]] && shift
+  local -a check_cmd=( "$@" )
+  if [[ ${`#check_cmd`[@]} -eq 0 ]]; then
+    log_error "wait_for_custom_check requires a command after --"
+    return 1
+  fi

   log_info "Waiting for: $description (timeout: ${timeout}s)"

   local elapsed=0
   while [ $elapsed -lt $timeout ]; do
-    if eval "$check_command"; then
+    if "${check_cmd[@]}"; then
       log_info "$description - Ready"
       return 0
     fi

As per coding guidelines, "SHELL SCRIPT SECURITY (CWE-78): ... Avoid eval and dynamic command execution".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/deployment-helpers.sh` around lines 477 - 488, The
wait_for_custom_check function currently uses eval on the check_command variable
which enables command injection; change the implementation to stop using eval
and instead execute the check command without string evaluation by treating it
as an argument array (accept the command as positional arguments or as an
explicit array) and running it directly (e.g., call the command via "$@" or
"${check_command[@]}") in the if test; update the function signature/usage so
callers pass the command and its args as separate tokens rather than a single
concatenated string, leaving other locals (timeout, interval, elapsed) intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@scripts/deployment-helpers.sh`:
- Around line 229-244: The while loop that polls for the CSV object (the kubectl
get -n "$ns" csv "$csv" > /dev/null 2>&1 loop) can hang indefinitely; bound it
by using a timeout/deadline derived from CSV_TIMEOUT (or a new CSV_POLL_TIMEOUT)
by recording the start time and breaking with an error/return 1 if elapsed
seconds exceed the timeout, and keep the existing sleep 1 between retries;
ensure the error message mentions CSV and timeout (same style as the later
kubectl wait error) so callers see why the function failed.
- Around line 477-488: The wait_for_custom_check function currently uses eval on
the check_command variable which enables command injection; change the
implementation to stop using eval and instead execute the check command without
string evaluation by treating it as an argument array (accept the command as
positional arguments or as an explicit array) and running it directly (e.g.,
call the command via "$@" or "${check_command[@]}") in the if test; update the
function signature/usage so callers pass the command and its args as separate
tokens rather than a single concatenated string, leaving other locals (timeout,
interval, elapsed) intact.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 05f59d6a-e150-4da6-9977-1f5db571684e

📥 Commits

Reviewing files that changed from the base of the PR and between ee60900 and cc774ff.

📒 Files selected for processing (1)
  • scripts/deployment-helpers.sh

echo " WARNING: Auth request verification timed out, continuing anyway"
return 0
echo " ERROR: Auth request verification timed out after ${timeout}s"
return 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would revert return 1 here as well. Let's continue anyway like we did previously with return 0.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Its here now

Auth request verification is a supplementary check that happens after
Authorino CR status is already confirmed ready. Timing out on the auth
verification shouldn't be fatal - it should continue with a warning,
similar to the gateway URL check.

This reverts the timeout behavior to match the original intentional
design: proceed with deployment even if auth verification times out.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@somya-bhatnagar somya-bhatnagar requested a review from yu-teo March 31, 2026 18:56
somya-bhatnagar and others added 2 commits March 31, 2026 15:18
The kubectl wait for Available condition may not succeed if the webhook
is still starting up. The original behavior (warn and proceed) is correct
since we already verified the deployment exists. Making this fatal is too
aggressive and can cause deployment failures when the webhook is legitimately
still initializing.

The first check (deployment exists) remains fatal as intended.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@jland-redhat
Copy link
Copy Markdown
Contributor

Looks like the validation script got an exception

 Deployment Validation
  Using namespace: opendatahub
  �[0;31m❌ FAIL: Not running on OpenShift�[0m
  �[0;31m   Reason: This validation script is designed for OpenShift clusters�[0m
  �[1;33m   Suggestion: Use a different validation approach for vanilla Kubernetes�[0m
  /workspace/source/scripts/validate-deployment.sh: line 213: $4: unbound variable
  ⚠️  First validation attempt failed, waiting 30 seconds and retrying...
  �[0;31m❌ FAIL: Not running on OpenShift�[0m
  �[0;31m   Reason: This validation script is designed for OpenShift clusters�[0m
  �[1;33m   Suggestion: Use a different validation approach for vanilla Kubernetes�[0m
  /workspace/source/scripts/validate-deployment.sh: line 213: $4: unbound variable
  ❌ ERROR: Deployment validation failed after retry

The print_fail and print_warning functions check optional parameters $2,
$3, and $4, but with set -u (nounset), accessing undefined parameters
causes the script to fail with "unbound variable" error.

Fixed by using ${N:-} syntax to provide empty defaults for optional
parameters, making the checks safe with set -u.

Error: /workspace/source/scripts/validate-deployment.sh: line 213: $4: unbound variable

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Apr 1, 2026

@somya-bhatnagar: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-test-maas 1dcd523 link false /test e2e-test-maas

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

The OpenShift check was failing in Konflux when kubectl api-resources
timed out or errored. Changed to:
1. Capture kubectl output separately
2. Check command exit code first
3. Warn and continue if kubectl fails (transient timeout)
4. Only fail if kubectl succeeds but we're not on OpenShift

This prevents validation failures due to transient kubectl timeouts
while still ensuring we're on OpenShift when the API is responsive.

Tested on live cluster with simulated kubectl failures.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved ready-for-review All changes are complete and CI checks have passed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants