feat: improve deploy and test script robustness#650
feat: improve deploy and test script robustness#650somya-bhatnagar wants to merge 11 commits intoopendatahub-io:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: somya-bhatnagar The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughReplaces many hardcoded wait/timeout values across deployment and test scripts with environment-overridable readonly timeout variables declared in scripts/deployment-helpers.sh. Propagates those timeouts into rollout, kubectl wait, and custom readiness helpers (CRD, CSV, webhook, pod, Gateway, Authorino, LLMInference, MaaSModelRef). Changes several helpers and callers to return non-zero on timeout/error instead of continuing, makes namespace and some resource creations idempotent, adjusts Gateway/GatewayClass existence handling, and enables stricter shell behavior ( Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Actionable Issues
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
scripts/validate-deployment.sh (3)
205-228:⚠️ Potential issue | 🟠 Major
print_failandprint_warningaccess optional parameters without defaults.With
set -uenabled, these functions will fail when called with fewer than 4 arguments (forprint_fail) or 3 arguments (forprint_warning). Use parameter expansion with defaults.print_fail() { echo -e "${RED}❌ FAIL: $1${NC}" - if [ -n "$2" ]; then - echo -e "${RED} Reason: $2${NC}" + if [ -n "${2:-}" ]; then + echo -e "${RED} Reason: ${2}${NC}" fi - if [ -n "$3" ]; then - echo -e "${YELLOW} Suggestion: $3${NC}" + if [ -n "${3:-}" ]; then + echo -e "${YELLOW} Suggestion: ${3}${NC}" fi - if [ -n "$4" ]; then - echo -e "${YELLOW} Suggestion: $4${NC}" + if [ -n "${4:-}" ]; then + echo -e "${YELLOW} Suggestion: ${4}${NC}" fi ((PASSED++)) } print_warning() { echo -e "${YELLOW}⚠️ WARNING: $1${NC}" - if [ -n "$2" ]; then - echo -e "${YELLOW} Note: $2${NC}" + if [ -n "${2:-}" ]; then + echo -e "${YELLOW} Note: ${2}${NC}" fi - if [ -n "$3" ]; then - echo -e "${YELLOW} $3${NC}" + if [ -n "${3:-}" ]; then + echo -e "${YELLOW} ${3}${NC}" fi ((WARNINGS++)) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/validate-deployment.sh` around lines 205 - 228, The functions print_fail and print_warning currently reference positional params directly which breaks under set -u when callers pass fewer args; update both to use safe parameter expansion (e.g. ${1:-}, ${2:-}, ${3:-}, ${4:-}) when checking and echoing optional values so missing parameters default to empty strings, and keep their increments of FAILED and WARNINGS intact; locate the print_fail and print_warning function bodies and replace direct "$2"/"$3"/"$4" and "$2"/"$3" usages with parameter-expansion forms to avoid unbound variable errors.
31-31:⚠️ Potential issue | 🟡 MinorUnbound variable error when no arguments provided.
With
set -uenabled (line 6), accessing$1directly when no arguments are passed causes the script to fail with "unbound variable". The check should guard against this.-if [ "$1" = "--help" ] || [ "$1" = "-h" ]; then +if [ "${1:-}" = "--help" ] || [ "${1:-}" = "-h" ]; then🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/validate-deployment.sh` at line 31, The conditional that reads if [ "$1" = "--help" ] || [ "$1" = "-h" ] directly references $1 while set -u is enabled, causing an unbound variable error when no args are passed; fix it by guarding the check — either test the argument count first (e.g., check "$#" -gt 0) or use a safe parameter expansion for $1 (e.g., ${1-}) in that if condition so the script never reads an unset variable.
645-645:⚠️ Potential issue | 🔴 CriticalUnbound variable:
$TIERcauses script failure withset -u.
$TIERis used at line 645 within a conditional check (if [ -n "$TIER" ]), but it is never assigned or initialized anywhere in the script. The script explicitly enablesset -u(line 6) to treat unset variables as errors. Reference to$TIERwill cause an "unbound variable" error before the conditional can evaluate, halting execution.Either define
$TIER(extract from JWT token or accept as environment variable), or check if it should be sourced fromdeployment-helpers.sh. If$TIERis optional, initialize it with a default value before line 645.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/validate-deployment.sh` at line 645, The script fails under set -u because the variable TIER is referenced without being initialized; fix by ensuring TIER is defined before use: either initialize TIER to a safe default (e.g., empty) at script start or explicitly populate it from the JWT parsing logic or from deployment-helpers.sh (whichever intended), and/or replace the direct reference with a guarded check that tolerates unset variables; update references to the TIER variable (the conditional using [ -n "$TIER" ] and any other uses) so they rely on the initialized variable or the source that provides it.test/e2e/scripts/prow_run_smoke_test.sh (1)
290-365:⚠️ Potential issue | 🔴 CriticalUnresolved merge conflict markers break script execution.
The file contains git merge conflict markers (
<<<<<<< HEAD,=======,>>>>>>> 82764cc) that must be resolved. This causes the SC1073/SC1072 shellcheck errors and will fail at runtime.Resolve the conflict by choosing either the deadline-based approach (HEAD) or the retry-count-based approach (feature branch), then remove all conflict markers.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/scripts/prow_run_smoke_test.sh` around lines 290 - 365, The file contains unresolved git conflict markers around the MaaSModelRefs readiness loop; remove the conflict markers and choose one implementation (either the deadline/SECONDS approach using timeout/deadline and found_any/all_ready or the retry-count approach using MAASMODELREF_TIMEOUT, max_retries, retries and the controller bounce logic). Ensure you delete the lines starting with <<<<<<<, =======, and >>>>>>>, keep a single coherent loop (with variables all_ready/found_any or all_ready/retries as appropriate), and retain related calls like oc get maasmodelrefs -n "$MODEL_NAMESPACE" and any kubectl rollout restart/deployment/maas-controller handling if you pick the retry-bounce variant so the script is syntactically valid and free of merge markers.
🧹 Nitpick comments (4)
scripts/deployment-helpers.sh (2)
1155-1158: Declare and assign separately to avoid masking return values (SC2155).Same issue - split
localdeclaration from command substitution assignment.- local state=$(kubectl get catalogsource "$name" -n "$namespace" \ + local state + state=$(kubectl get catalogsource "$name" -n "$namespace" \ -o jsonpath='{.status.connectionState.lastObservedState}' 2>/dev/null)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/deployment-helpers.sh` around lines 1155 - 1158, The code declares and assigns local state with command substitution (triggering SC2155); split the declaration and assignment in the block that checks CatalogSource readiness: first do local state, then assign state=$(kubectl get catalogsource "$name" -n "$namespace" -o jsonpath='{.status.connectionState.lastObservedState}' 2>/dev/null); keep the surrounding logic using CATALOGSOURCE_TIMEOUT, name and namespace the same so the echo uses the populated state value.
967-967: Declare and assign separately to avoid masking return values (SC2155).When
localand command substitution are combined, the exit status of the command is masked bylocal(which always succeeds). Split declaration and assignment.- local csv_name=$(find_csv_with_min_version "$operator_prefix" "$min_version" "$namespace") + local csv_name + csv_name=$(find_csv_with_min_version "$operator_prefix" "$min_version" "$namespace")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/deployment-helpers.sh` at line 967, The statement "local csv_name=$(find_csv_with_min_version "$operator_prefix" "$min_version" "$namespace")" masks the command's exit status; change it to declare the local variable first (local csv_name) and then assign via separate command substitution (csv_name=$(find_csv_with_min_version ...)) so the exit code from find_csv_with_min_version is preserved; update the code around the find_csv_with_min_version call to perform declaration and assignment on separate lines.scripts/validate-deployment.sh (1)
3-6: Redundant comment block.Lines 18-19 duplicate the explanation already provided in the comment block at lines 3-5. Remove the redundancy.
Also applies to: 18-19
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/validate-deployment.sh` around lines 3 - 6, The script contains a duplicated explanatory comment about "Bash strict mode" that repeats the same text already provided above the set -uo pipefail line; remove the redundant comment block later in the file so the single explanatory comment remains directly adjacent to the set -uo pipefail directive (leave the original comment explaining -u and -o pipefail intact and delete the later duplicate).scripts/deploy.sh (1)
1400-1400: Integer arithmetic with potentially non-integer variable.
$((CUSTOM_CHECK_TIMEOUT / 2))assumesCUSTOM_CHECK_TIMEOUTis a valid integer. If someone setsCUSTOM_CHECK_TIMEOUT=abc, this will fail. Consider adding validation.+# Validate timeout is numeric +if ! [[ "$CUSTOM_CHECK_TIMEOUT" =~ ^[0-9]+$ ]]; then + log_error "CUSTOM_CHECK_TIMEOUT must be a positive integer, got: $CUSTOM_CHECK_TIMEOUT" + return 1 +fi local kuadrant_initial_timeout=$((CUSTOM_CHECK_TIMEOUT / 2))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/deploy.sh` at line 1400, The assignment to kuadrant_initial_timeout uses arithmetic expansion on CUSTOM_CHECK_TIMEOUT without validation; add a validation step for CUSTOM_CHECK_TIMEOUT (e.g., check it matches a non-negative integer via regex or use a safe numeric test) before computing kuadrant_initial_timeout, and if validation fails either exit with a clear error or fall back to a safe default value; update the code around kuadrant_initial_timeout and any callers that rely on it (references to kuadrant_initial_timeout and CUSTOM_CHECK_TIMEOUT) to use the validated/normalized integer.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/scripts/prow_run_smoke_test.sh`:
- Line 351: Replace the hardcoded namespace string "llm" in the oc get
maasmodelrefs command with the script's namespace variable so the command uses
$MODEL_NAMESPACE (e.g., change the oc get maasmodelrefs -n llm ... invocation to
use -n "${MODEL_NAMESPACE}" or -n $MODEL_NAMESPACE); update the command that
reads from oc get maasmodelrefs so it consistently uses the MODEL_NAMESPACE
variable to match the earlier usage at line 317 and avoid namespace mismatch
when MODEL_NAMESPACE is overridden.
---
Outside diff comments:
In `@scripts/validate-deployment.sh`:
- Around line 205-228: The functions print_fail and print_warning currently
reference positional params directly which breaks under set -u when callers pass
fewer args; update both to use safe parameter expansion (e.g. ${1:-}, ${2:-},
${3:-}, ${4:-}) when checking and echoing optional values so missing parameters
default to empty strings, and keep their increments of FAILED and WARNINGS
intact; locate the print_fail and print_warning function bodies and replace
direct "$2"/"$3"/"$4" and "$2"/"$3" usages with parameter-expansion forms to
avoid unbound variable errors.
- Line 31: The conditional that reads if [ "$1" = "--help" ] || [ "$1" = "-h" ]
directly references $1 while set -u is enabled, causing an unbound variable
error when no args are passed; fix it by guarding the check — either test the
argument count first (e.g., check "$#" -gt 0) or use a safe parameter expansion
for $1 (e.g., ${1-}) in that if condition so the script never reads an unset
variable.
- Line 645: The script fails under set -u because the variable TIER is
referenced without being initialized; fix by ensuring TIER is defined before
use: either initialize TIER to a safe default (e.g., empty) at script start or
explicitly populate it from the JWT parsing logic or from deployment-helpers.sh
(whichever intended), and/or replace the direct reference with a guarded check
that tolerates unset variables; update references to the TIER variable (the
conditional using [ -n "$TIER" ] and any other uses) so they rely on the
initialized variable or the source that provides it.
In `@test/e2e/scripts/prow_run_smoke_test.sh`:
- Around line 290-365: The file contains unresolved git conflict markers around
the MaaSModelRefs readiness loop; remove the conflict markers and choose one
implementation (either the deadline/SECONDS approach using timeout/deadline and
found_any/all_ready or the retry-count approach using MAASMODELREF_TIMEOUT,
max_retries, retries and the controller bounce logic). Ensure you delete the
lines starting with <<<<<<<, =======, and >>>>>>>, keep a single coherent loop
(with variables all_ready/found_any or all_ready/retries as appropriate), and
retain related calls like oc get maasmodelrefs -n "$MODEL_NAMESPACE" and any
kubectl rollout restart/deployment/maas-controller handling if you pick the
retry-bounce variant so the script is syntactically valid and free of merge
markers.
---
Nitpick comments:
In `@scripts/deploy.sh`:
- Line 1400: The assignment to kuadrant_initial_timeout uses arithmetic
expansion on CUSTOM_CHECK_TIMEOUT without validation; add a validation step for
CUSTOM_CHECK_TIMEOUT (e.g., check it matches a non-negative integer via regex or
use a safe numeric test) before computing kuadrant_initial_timeout, and if
validation fails either exit with a clear error or fall back to a safe default
value; update the code around kuadrant_initial_timeout and any callers that rely
on it (references to kuadrant_initial_timeout and CUSTOM_CHECK_TIMEOUT) to use
the validated/normalized integer.
In `@scripts/deployment-helpers.sh`:
- Around line 1155-1158: The code declares and assigns local state with command
substitution (triggering SC2155); split the declaration and assignment in the
block that checks CatalogSource readiness: first do local state, then assign
state=$(kubectl get catalogsource "$name" -n "$namespace" -o
jsonpath='{.status.connectionState.lastObservedState}' 2>/dev/null); keep the
surrounding logic using CATALOGSOURCE_TIMEOUT, name and namespace the same so
the echo uses the populated state value.
- Line 967: The statement "local csv_name=$(find_csv_with_min_version
"$operator_prefix" "$min_version" "$namespace")" masks the command's exit
status; change it to declare the local variable first (local csv_name) and then
assign via separate command substitution (csv_name=$(find_csv_with_min_version
...)) so the exit code from find_csv_with_min_version is preserved; update the
code around the find_csv_with_min_version call to perform declaration and
assignment on separate lines.
In `@scripts/validate-deployment.sh`:
- Around line 3-6: The script contains a duplicated explanatory comment about
"Bash strict mode" that repeats the same text already provided above the set -uo
pipefail line; remove the redundant comment block later in the file so the
single explanatory comment remains directly adjacent to the set -uo pipefail
directive (leave the original comment explaining -u and -o pipefail intact and
delete the later duplicate).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: e467181c-8556-4454-b603-038477caaa3c
📒 Files selected for processing (4)
scripts/deploy.shscripts/deployment-helpers.shscripts/validate-deployment.shtest/e2e/scripts/prow_run_smoke_test.sh
Add configurable timeout constants, consistent error handling, and idempotency improvements across deployment and test scripts to make deployments more predictable and debuggable. - Add environment-variable-overridable timeout constants to deployment-helpers.sh - Replace all hardcoded timeouts with configurable constants - Improve error messages to include actual timeout values - Add idempotency checks for gateway and namespace creation - Add set -uo pipefail to validate-deployment.sh for stricter error handling - Document timeout configuration in script headers - Use improved MaaSModelRef wait logic from main with configurable timeout Acceptance Criteria: - [x] Consistent shell options (set -euo pipefail) across all scripts - [x] Clear error messages with timeout context - [x] Configurable timeouts (16 environment variables) - [x] Idempotent operations for namespace, gateway, gatewayclass - [x] No hardcoded timeouts Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
1c8cd94 to
1bf660c
Compare
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
scripts/deployment-helpers.sh (2)
200-215:⚠️ Potential issue | 🟠 MajorBound the CSV creation poll; this still hangs indefinitely.
After
status.currentCSVis populated, Lines 208-210 loop without any deadline. If OLM points at a CSV that never materializes, this bypasses bothSUBSCRIPTION_TIMEOUTandCSV_TIMEOUTand the job can stall forever.Suggested fix
local csv csv=$(kubectl get subscription.operators.coreos.com -n "$ns" "$name" -o jsonpath='{.status.currentCSV}') # Because, sometimes, the CSV is not there immediately. + local csv_deadline=$((SECONDS + CSV_TIMEOUT)) while ! kubectl get -n "$ns" csv "$csv" > /dev/null 2>&1; do + if (( SECONDS >= csv_deadline )); then + echo " * ERROR: CSV $csv was not created within ${CSV_TIMEOUT}s" + return 1 + fi sleep 1 done🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/deployment-helpers.sh` around lines 200 - 215, The unbounded while loop waiting for the CSV (variable csv) can hang indefinitely; modify that loop to enforce a deadline by recording a start time (e.g. start_ts=$(date +%s)) and on each iteration compute elapsed and compare to a timeout (reuse CSV_TIMEOUT or introduce CSV_POLL_TIMEOUT) so if elapsed >= timeout you echo an error and return 1; keep the same kubectl existence check (kubectl get -n "$ns" csv "$csv" > /dev/null 2>&1) and sleep 1 between attempts, and ensure downstream kubectl wait -n "$ns" --for=jsonpath=... csv "$csv" still runs if the CSV becomes available before the timeout.
440-468:⚠️ Potential issue | 🔴 CriticalRemove
evalfromwait_for_custom_check.Severity: critical (CWE-78). This executes shell source, not just a probe command. Any caller that interpolates env or CLI data into
check_commandcan turn a readiness check into arbitrary command execution. Accept a function name or argv array instead of a string.As per coding guidelines,
**/*.sh:Avoid eval and dynamic command execution.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/deployment-helpers.sh` around lines 440 - 468, The wait_for_custom_check function currently uses eval on a string and must be changed to avoid command injection: stop accepting a single command string and instead accept either (A) a function name (e.g., a shell function the caller defines) or (B) a command + argv array passed as separate positional parameters; remove eval and invoke the check safely by calling the function name directly (e.g., "$check_fn") or by executing the command array via "$@"/"${cmd[@]}" so no shell interpolation occurs. Update the parameter parsing in wait_for_custom_check to treat the remainder of arguments as the command/args (or a single function name), adjust callers to pass a function name or split args, and keep the existing timeout/interval behavior and return values unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/deploy.sh`:
- Around line 1115-1125: The readiness checks for the webhook currently only log
warnings and continue, which delays a known prerequisite failure into later
steps; update the two timeout branches (the wait_for_resource failure and the
kubectl wait || block) to treat timeouts as fatal: replace the log_warn calls
with log_error (or similar) and exit with a non-zero status (e.g., exit 1) so
the script stops instead of proceeding to apply_dsci/apply_dsc; ensure you
reference the existing symbols webhook_deployment, webhook_namespace,
ROLLOUT_TIMEOUT, wait_for_resource, and the kubectl wait block when making the
change.
- Around line 636-643: The current get/create split is vulnerable to TOCTOU
races; instead, call kubectl create namespace "$NAMESPACE" directly and, if it
fails, check whether the namespace now exists and treat that as success—i.e.,
after a failed kubectl create, run kubectl get namespace "$NAMESPACE" and only
call log_error/return 1 if that check also fails. Update the block using the
NAMESPACE variable and existing logging helpers (log_info, log_error, log_debug)
so creation errors that are actually AlreadyExists are treated as a successful
idempotent outcome.
In `@scripts/deployment-helpers.sh`:
- Around line 1251-1258: The timeout path in the wait_authorino_ready function
currently logs a warning but returns 0, which misreports a timeout as success;
change the final timeout handling in wait_authorino_ready to return 1 (non-zero)
on failure instead of 0, and make the same change in the analogous helper later
in the file (the function covering the block around 1352-1398) so both helpers
exit non-zero when auth probing times out; keep the warning log but ensure the
function's return code signals failure.
- Around line 110-127: The new timeout environment variables (e.g.,
CUSTOM_RESOURCE_TIMEOUT, CSV_TIMEOUT, ROLLOUT_TIMEOUT, etc.) must be validated
at script startup so non-integer or empty values (like "3m" or "") produce a
clear error; add a validation block after the readonly declarations that checks
each *_TIMEOUT is set and matches a positive-integer regex (or is unset and
defaults are applied), and if a value is invalid print a descriptive error and
exit non-zero; ensure this validation is used before any arithmetic/tests or
calls that pass values to kubectl --timeout so all uses of
CUSTOM_RESOURCE_TIMEOUT, NAMESPACE_TIMEOUT, RESOURCE_TIMEOUT, CRD_TIMEOUT,
CSV_TIMEOUT, SUBSCRIPTION_TIMEOUT, POD_TIMEOUT, WEBHOOK_TIMEOUT,
CUSTOM_CHECK_TIMEOUT, AUTHORINO_TIMEOUT, ROLLOUT_TIMEOUT,
KUBECONFIG_WAIT_TIMEOUT, CATALOGSOURCE_TIMEOUT, LLMIS_TIMEOUT,
MAASMODELREF_TIMEOUT, and AUTHPOLICY_TIMEOUT are safe.
- Around line 479-497: The inner kubectl wait calls currently use a hardcoded
60s; replace them so they use the configured timeout budget instead by computing
the remaining time (e.g., remaining=$((timeout - elapsed)) with a floor of 1)
and passing --timeout="${remaining}s" to kubectl wait; update both occurrences
that reference kubectl wait and ensure you adjust elapsed/interval logic so
remaining never becomes non-positive before calling wait.
---
Outside diff comments:
In `@scripts/deployment-helpers.sh`:
- Around line 200-215: The unbounded while loop waiting for the CSV (variable
csv) can hang indefinitely; modify that loop to enforce a deadline by recording
a start time (e.g. start_ts=$(date +%s)) and on each iteration compute elapsed
and compare to a timeout (reuse CSV_TIMEOUT or introduce CSV_POLL_TIMEOUT) so if
elapsed >= timeout you echo an error and return 1; keep the same kubectl
existence check (kubectl get -n "$ns" csv "$csv" > /dev/null 2>&1) and sleep 1
between attempts, and ensure downstream kubectl wait -n "$ns" --for=jsonpath=...
csv "$csv" still runs if the CSV becomes available before the timeout.
- Around line 440-468: The wait_for_custom_check function currently uses eval on
a string and must be changed to avoid command injection: stop accepting a single
command string and instead accept either (A) a function name (e.g., a shell
function the caller defines) or (B) a command + argv array passed as separate
positional parameters; remove eval and invoke the check safely by calling the
function name directly (e.g., "$check_fn") or by executing the command array via
"$@"/"${cmd[@]}" so no shell interpolation occurs. Update the parameter parsing
in wait_for_custom_check to treat the remainder of arguments as the command/args
(or a single function name), adjust callers to pass a function name or split
args, and keep the existing timeout/interval behavior and return values
unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: cfd527f5-2500-4815-a0a8-600b789c2922
📒 Files selected for processing (4)
scripts/deploy.shscripts/deployment-helpers.shscripts/validate-deployment.shtest/e2e/scripts/prow_run_smoke_test.sh
✅ Files skipped from review due to trivial changes (1)
- scripts/validate-deployment.sh
🚧 Files skipped from review as they are similar to previous changes (1)
- test/e2e/scripts/prow_run_smoke_test.sh
…-service into somya-harden-deploy-and-test-scripts
….com/somya-bhatnagar/models-as-a-service into somya-harden-deploy-and-test-scripts
- Fix TOCTOU race in namespace creation (use create-then-check pattern) - Make webhook readiness checks fatal errors instead of warnings - Add timeout variable validation at script startup - Fix hardcoded kubectl wait timeouts in wait_for_namespace - Fix wait_authorino_ready to return 1 on timeout/failure instead of 0 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
🧹 Nitpick comments (2)
scripts/deployment-helpers.sh (2)
1000-1000: Declare and assign separately to capture return value (SC2155).
localalways returns 0, masking the exit status offind_csv_with_min_version. If the function fails,$csv_namewill be empty but the failure won't propagate correctly forset -econtexts.- local csv_name=$(find_csv_with_min_version "$operator_prefix" "$min_version" "$namespace") + local csv_name + csv_name=$(find_csv_with_min_version "$operator_prefix" "$min_version" "$namespace")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/deployment-helpers.sh` at line 1000, The current combined declaration/assignment masks the exit status (SC2155) — change the line to declare the variable first and then assign the output of find_csv_with_min_version to it, e.g. declare or local csv_name on its own, then run csv_name=$(find_csv_with_min_version "$operator_prefix" "$min_version" "$namespace") and propagate failure (e.g. check the command's exit code or use "|| return" / "|| exit" as appropriate) so failures from find_csv_with_min_version are not swallowed in set -e contexts.
1189-1191: Declare and assign separately to capture return value (SC2155).Same issue—
localmasks kubectl's exit status.- local state=$(kubectl get catalogsource "$name" -n "$namespace" \ - -o jsonpath='{.status.connectionState.lastObservedState}' 2>/dev/null) + local state + state=$(kubectl get catalogsource "$name" -n "$namespace" \ + -o jsonpath='{.status.connectionState.lastObservedState}' 2>/dev/null)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/deployment-helpers.sh` around lines 1189 - 1191, The kubectl exit status is being masked by declaring and assigning `state` in one line; split the declaration and assignment so you can capture kubectl's return code: first `local state`, then run the `kubectl get catalogsource "$name" -n "$namespace" -o jsonpath='{.status.connectionState.lastObservedState}' 2>/dev/null` into `state` and capture its exit status (e.g., `rc=$?`) so failures aren't hidden; keep the existing error echo that uses `state` and use the captured `rc` where needed (references: variable `state`, command `kubectl get catalogsource`, and `CATALOGSOURCE_TIMEOUT`).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@scripts/deployment-helpers.sh`:
- Line 1000: The current combined declaration/assignment masks the exit status
(SC2155) — change the line to declare the variable first and then assign the
output of find_csv_with_min_version to it, e.g. declare or local csv_name on its
own, then run csv_name=$(find_csv_with_min_version "$operator_prefix"
"$min_version" "$namespace") and propagate failure (e.g. check the command's
exit code or use "|| return" / "|| exit" as appropriate) so failures from
find_csv_with_min_version are not swallowed in set -e contexts.
- Around line 1189-1191: The kubectl exit status is being masked by declaring
and assigning `state` in one line; split the declaration and assignment so you
can capture kubectl's return code: first `local state`, then run the `kubectl
get catalogsource "$name" -n "$namespace" -o
jsonpath='{.status.connectionState.lastObservedState}' 2>/dev/null` into `state`
and capture its exit status (e.g., `rc=$?`) so failures aren't hidden; keep the
existing error echo that uses `state` and use the captured `rc` where needed
(references: variable `state`, command `kubectl get catalogsource`, and
`CATALOGSOURCE_TIMEOUT`).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 58777453-33bb-49e8-9d92-8c5033ef6ae0
📒 Files selected for processing (2)
scripts/deploy.shscripts/deployment-helpers.sh
The gateway URL determination is not inside a retry loop, so making it a fatal error would cause premature failures when the gateway is still configuring. The original behavior (skip auth verification and return success) is intentional and correct. The actual issue was the timeout path in the auth verification loop, which is still fixed (returns 1 on timeout). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
scripts/deployment-helpers.sh (2)
229-244:⚠️ Potential issue | 🟠 MajorBound the CSV existence polling loop to prevent indefinite hangs.
Line 237 to Line 239 loops forever if
currentCSVis set but the CSV object never materializes. That bypasses the timeout guarantees added in this PR and can deadlock CI/deploy flows.Suggested fix
local csv csv=$(kubectl get subscription.operators.coreos.com -n "$ns" "$name" -o jsonpath='{.status.currentCSV}') # Because, sometimes, the CSV is not there immediately. - while ! kubectl get -n "$ns" csv "$csv" > /dev/null 2>&1; do - sleep 1 - done + if ! wait_for_resource "csv" "$csv" "$ns" "$CSV_TIMEOUT"; then + echo " * ERROR: CSV $csv for Subscription $ns/$name was not found within ${CSV_TIMEOUT}s" >&2 + return 1 + fiAs per coding guidelines, "REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/deployment-helpers.sh` around lines 229 - 244, The while loop that polls for the CSV object (the kubectl get -n "$ns" csv "$csv" > /dev/null 2>&1 loop) can hang indefinitely; bound it by using a timeout/deadline derived from CSV_TIMEOUT (or a new CSV_POLL_TIMEOUT) by recording the start time and breaking with an error/return 1 if elapsed seconds exceed the timeout, and keep the existing sleep 1 between retries; ensure the error message mentions CSV and timeout (same style as the later kubectl wait error) so callers see why the function failed.
477-488:⚠️ Potential issue | 🟠 MajorRemove
evalfrom custom checks to eliminate command-injection risk (CWE-78).Severity: major.
Exploit scenario: ifcheck_commandis composed from env/CLI/user-controlled input,evalcan execute arbitrary shell payloads (e.g., command chaining), leading to unintended cluster actions.Suggested fix
-# wait_for_custom_check description check_command timeout interval +# wait_for_custom_check description timeout interval -- command [args...] wait_for_custom_check() { local description=${1?description is required}; shift - local check_command=${1?check command is required}; shift - local timeout=${1:-$CUSTOM_CHECK_TIMEOUT}; shift || true - local interval=${1:-5}; shift || true + local timeout=${1:-$CUSTOM_CHECK_TIMEOUT}; shift || true + local interval=${1:-5}; shift || true + [[ "${1:-}" == "--" ]] && shift + local -a check_cmd=( "$@" ) + if [[ ${`#check_cmd`[@]} -eq 0 ]]; then + log_error "wait_for_custom_check requires a command after --" + return 1 + fi log_info "Waiting for: $description (timeout: ${timeout}s)" local elapsed=0 while [ $elapsed -lt $timeout ]; do - if eval "$check_command"; then + if "${check_cmd[@]}"; then log_info "$description - Ready" return 0 fiAs per coding guidelines, "SHELL SCRIPT SECURITY (CWE-78): ... Avoid eval and dynamic command execution".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/deployment-helpers.sh` around lines 477 - 488, The wait_for_custom_check function currently uses eval on the check_command variable which enables command injection; change the implementation to stop using eval and instead execute the check command without string evaluation by treating it as an argument array (accept the command as positional arguments or as an explicit array) and running it directly (e.g., call the command via "$@" or "${check_command[@]}") in the if test; update the function signature/usage so callers pass the command and its args as separate tokens rather than a single concatenated string, leaving other locals (timeout, interval, elapsed) intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@scripts/deployment-helpers.sh`:
- Around line 229-244: The while loop that polls for the CSV object (the kubectl
get -n "$ns" csv "$csv" > /dev/null 2>&1 loop) can hang indefinitely; bound it
by using a timeout/deadline derived from CSV_TIMEOUT (or a new CSV_POLL_TIMEOUT)
by recording the start time and breaking with an error/return 1 if elapsed
seconds exceed the timeout, and keep the existing sleep 1 between retries;
ensure the error message mentions CSV and timeout (same style as the later
kubectl wait error) so callers see why the function failed.
- Around line 477-488: The wait_for_custom_check function currently uses eval on
the check_command variable which enables command injection; change the
implementation to stop using eval and instead execute the check command without
string evaluation by treating it as an argument array (accept the command as
positional arguments or as an explicit array) and running it directly (e.g.,
call the command via "$@" or "${check_command[@]}") in the if test; update the
function signature/usage so callers pass the command and its args as separate
tokens rather than a single concatenated string, leaving other locals (timeout,
interval, elapsed) intact.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 05f59d6a-e150-4da6-9977-1f5db571684e
📒 Files selected for processing (1)
scripts/deployment-helpers.sh
scripts/deployment-helpers.sh
Outdated
| echo " WARNING: Auth request verification timed out, continuing anyway" | ||
| return 0 | ||
| echo " ERROR: Auth request verification timed out after ${timeout}s" | ||
| return 1 |
There was a problem hiding this comment.
I would revert return 1 here as well. Let's continue anyway like we did previously with return 0.
There was a problem hiding this comment.
Yes, Its here now
Auth request verification is a supplementary check that happens after Authorino CR status is already confirmed ready. Timing out on the auth verification shouldn't be fatal - it should continue with a warning, similar to the gateway URL check. This reverts the timeout behavior to match the original intentional design: proceed with deployment even if auth verification times out. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The kubectl wait for Available condition may not succeed if the webhook is still starting up. The original behavior (warn and proceed) is correct since we already verified the deployment exists. Making this fatal is too aggressive and can cause deployment failures when the webhook is legitimately still initializing. The first check (deployment exists) remains fatal as intended. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…al error" This reverts commit ac4c78c.
|
Looks like the validation script got an exception |
The print_fail and print_warning functions check optional parameters $2,
$3, and $4, but with set -u (nounset), accessing undefined parameters
causes the script to fail with "unbound variable" error.
Fixed by using ${N:-} syntax to provide empty defaults for optional
parameters, making the checks safe with set -u.
Error: /workspace/source/scripts/validate-deployment.sh: line 213: $4: unbound variable
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
@somya-bhatnagar: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
The OpenShift check was failing in Konflux when kubectl api-resources timed out or errored. Changed to: 1. Capture kubectl output separately 2. Check command exit code first 3. Warn and continue if kubectl fails (transient timeout) 4. Only fail if kubectl succeeds but we're not on OpenShift This prevents validation failures due to transient kubectl timeouts while still ensuring we're on OpenShift when the API is responsive. Tested on live cluster with simulated kubectl failures. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Related to - https://redhat.atlassian.net/browse/RHOAIENG-51671
Summary
Improve robustness of the deploy and test scripts with consistent error handling (set -euo pipefail, clear error messages, non-zero exit on failure), configurable or documented timeouts for wait loops (DataScienceCluster, LLMIS, MaaSModels, AuthPolicies), and idempotency where possible (e.g. create namespace only if not exists, skip install if already installed) so re-runs and partial runs are predictable.
Story
As a developer or CI, I want deploy and test scripts to have clear error handling, sensible timeouts, and idempotent steps where appropriate, so that failures are easy to diagnose and re-runs do not leave the system in an inconsistent state.
Acceptance Criteria
Changes
Error Handling
Timeout Configuration
Added 16 configurable timeout environment variables (defined in `scripts/deployment-helpers.sh`):
Core Timeouts:
Infrastructure Timeouts:
Additional:
\RESOURCE_TIMEOUT`, `POD_TIMEOUT`, `WEBHOOK_TIMEOUT`, `CUSTOM_CHECK_TIMEOUT`, `KUBECONFIG_WAIT_TIMEOUT`, `CATALOGSOURCE_TIMEOUT``Usage Example:
Idempotent Operations
Files Modified
Testability / Verification
Manual Testing
Verification Summary
Cluster: api.ci-ln-rxc9qqb-76ef8.aws-4.ci.openshift.org
✅ Tests Passed (6/6)
Key Fixes Verified
┌────────────────────┬────────────────────────────────────────┬──────────────────────────────────────────────┐
│ Fix │ Location │ Behavior │
├────────────────────┼────────────────────────────────────────┼──────────────────────────────────────────────┤
│ Timeout validation │ deployment-helpers.sh:110-156 │ Validates all 16 timeout vars at startup │
├────────────────────┼────────────────────────────────────────┼──────────────────────────────────────────────┤
│ TOCTOU race │ deploy.sh:636-648 │ Create-then-check pattern │
├────────────────────┼────────────────────────────────────────┼──────────────────────────────────────────────┤
│ Hardcoded timeouts │ deployment-helpers.sh:511-512, 525-526 │ Remaining time calculation │
├────────────────────┼────────────────────────────────────────┼──────────────────────────────────────────────┤
│ Webhook checks │ deploy.sh:1120-1132 │ Fatal errors with return 1 │
├────────────────────┼────────────────────────────────────────┼──────────────────────────────────────────────┤
│ Auth timeout │ deployment-helpers.sh:1430-1431 │ Returns 1 on timeout │
├────────────────────┼────────────────────────────────────────┼──────────────────────────────────────────────┤
│ Gateway URL │ deployment-helpers.sh:1376-1378 │ Returns 0 to skip verification (intentional) │
└────────────────────┴────────────────────────────────────────┴──────────────────────────────────────────────┘
All scripts properly handle errors with non-zero exit codes and include timeout context in error messages.
Automated Validation
Summary by CodeRabbit
New Features
Bug Fixes
Chores
Documentation