Skip to content

feat: improve deploy and test script robustness#647

Closed
somya-bhatnagar wants to merge 10 commits intoopendatahub-io:mainfrom
somya-bhatnagar:somya-harden-deploy-and-test-scripts
Closed

feat: improve deploy and test script robustness#647
somya-bhatnagar wants to merge 10 commits intoopendatahub-io:mainfrom
somya-bhatnagar:somya-harden-deploy-and-test-scripts

Conversation

@somya-bhatnagar
Copy link
Copy Markdown
Contributor

@somya-bhatnagar somya-bhatnagar commented Mar 31, 2026

Summary

Improve robustness of deploy and test scripts with consistent error handling, configurable timeouts, and idempotent operations to make failures easier to diagnose and re-runs predictable.

Story

Improve robustness of the deploy and test scripts: consistent error handling (set -euo pipefail, clear error messages, non-zero exit on failure), configurable or documented timeouts for wait loops (DataScienceCluster, LLMIS, MaaSModels, AuthPolicies), and idempotency where possible (e.g. create namespace only if not exists, skip install if already installed) so re-runs and partial runs are predictable.

Acceptance Criteria

  • Deploy and test entry points use consistent shell options (e.g. set -euo pipefail) and exit with a non-zero code and clear message on failure
  • Wait loops (for operators, CRs, AuthPolicies) use configurable or documented timeouts and do not hang indefinitely
  • Idempotent operations (e.g. create namespace if not exists, skip cert-manager/ODH if already present) are preserved or added where safe
  • No regression in existing e2e pass rate

Changes

Error Handling

  • ✅ All scripts use `set -euo pipefail` for robust error handling
  • ✅ 91 error exit points with clear messages including timeout context
  • ✅ Example: `"DataScienceCluster CRD not available (timeout: ${CRD_TIMEOUT}s)"`

Timeout Configuration

Added 16 configurable timeout environment variables (defined in `scripts/deployment-helpers.sh`):

Core Timeouts:

  • `CUSTOM_RESOURCE_TIMEOUT=600` - DataScienceCluster wait
  • `LLMIS_TIMEOUT=300` - LLMInferenceService ready
  • `MAASMODELREF_TIMEOUT=300` - MaaSModelRef ready
  • `AUTHPOLICY_TIMEOUT=180` - AuthPolicy enforcement

Infrastructure Timeouts:

  • `CRD_TIMEOUT=180` - CRD establishment
  • `CSV_TIMEOUT=180` - CSV installation
  • `SUBSCRIPTION_TIMEOUT=300` - Subscription install
  • `ROLLOUT_TIMEOUT=120` - Deployment rollout
  • `AUTHORINO_TIMEOUT=120` - Authorino ready
  • `NAMESPACE_TIMEOUT=300` - Namespace creation

Additional: `RESOURCE_TIMEOUT`, `POD_TIMEOUT`, `WEBHOOK_TIMEOUT`, `CUSTOM_CHECK_TIMEOUT`, `KUBECONFIG_WAIT_TIMEOUT`, `CATALOGSOURCE_TIMEOUT`

Usage Example:
```bash

Deploy with extended timeouts for slow cluster

CUSTOM_RESOURCE_TIMEOUT=900 LLMIS_TIMEOUT=600 ./scripts/deploy.sh --operator-type odh

Run E2E tests with custom timeouts

ROLLOUT_TIMEOUT=60 AUTHORINO_TIMEOUT=60 ./test/e2e/scripts/prow_run_smoke_test.sh
```

Idempotent Operations

  • ✅ Namespace creation (skip if exists)
  • ✅ GatewayClass (skip if exists)
  • ✅ Gateway (update if exists)
  • ✅ DSCInitialization (skip if auto-created by operator)

Files Modified

  • `scripts/deploy.sh` - Timeout variables, error messages, idempotency checks
  • `scripts/deployment-helpers.sh` - Timeout defaults, improved wait functions
  • `scripts/validate-deployment.sh` - Error handling mode
  • `test/e2e/scripts/prow_run_smoke_test.sh` - Timeout configuration, improved MaaSModelRef wait logic

Testing

Validation Performed

  • ✅ All scripts pass `bash -n` syntax check
  • ✅ Error handling verified (`set -euo pipefail` in all entry points)
  • ✅ Timeout variables properly sourced and used
  • ✅ No hardcoded timeouts found
  • ✅ Idempotency checks in place

Related Issues

Addresses requirements for robust CI/CD execution with clear diagnostics and configurable timeouts for different cluster speeds.

Summary by CodeRabbit

  • New Features

    • Extensive deployment timeouts made configurable via environment variables for finer control.
  • Bug Fixes

    • Improved idempotent behavior and error handling in deployment and readiness flows; waits/rollouts now honor and report configured timeouts.
    • CI trigger logic refined to better skip irrelevant file changes.
    • Validation script tightened to fail on unset variables and pipeline errors.
  • Documentation

    • Docs and release notes updated for v3.4.0 and schema changes requiring per-model rate limits.
  • Tests

    • Test harness updated to index subscriptions for improved reconciliation testing.

somya-bhatnagar and others added 2 commits March 24, 2026 11:32
Add configurable timeout constants, consistent error handling, and
idempotency improvements across deployment and test scripts to make
deployments more predictable and debuggable.

- Add environment-variable-overridable timeout constants to deployment-helpers.sh
- Replace all hardcoded timeouts with configurable constants
- Improve error messages to include actual timeout values
- Add idempotency checks for gateway and namespace creation
- Add set -uo pipefail to validate-deployment.sh for stricter error handling
- Document timeout configuration in script headers

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Resolve conflicts in test/e2e/scripts/prow_run_smoke_test.sh:
- Keep timeout configuration documentation from feature branch
- Add MODEL_NAMESPACE documentation from main
- Use improved MaaSModelRefs wait logic from main (race condition fix)
- Keep configurable MAASMODELREF_TIMEOUT from feature branch
- Remove controller bounce workaround (no longer needed per main)
@openshift-ci openshift-ci bot requested review from SB159 and usize March 31, 2026 08:07
@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Mar 31, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: somya-bhatnagar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 31, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Centralized and exposed many deployment timeouts as environment-overridable readonly variables and replaced hard-coded wait/rollout durations across scripts and tests. Made namespace/Gateway/GatewayClass/ingress creation idempotent with existence checks and added explicit error handling for wait timeouts (functions now return non-zero on timeout and log timeout values). Added stricter shell flags to validate-deployment.sh. Updated Tekton CEL skip patterns. CRD/API change: removed tokenRateLimitRef, made tokenRateLimits required with minItems: 1 and limit minimum 1; updated deepcopy code and tests. Docs updated and a release notes entry added.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Actionable Issues

  • Validate timeout env vars as numeric positive integers and enforce bounds (e.g., 1 ≤ value ≤ 86400). Action: add parsing and fail-fast on invalid values. (CWE-20: Improper Input Validation)

  • Prevent shell injection when using env-derived values in command args. Action: strictly validate timeouts and use safe argument forms (e.g., --timeout="${timeout}s" only after numeric validation). (CWE-77: Improper Neutralization of Special Elements used in a Command)

  • Standardize shell strictness across scripts. Action: apply set -o pipefail -u consistently or centralize invocation to avoid differing failure semantics and unexpected silent failures.

  • Ensure idempotency checks verify resource health, not only existence. Action: when detecting existing namespace/Gateway/GatewayClass/ingress, also check relevant status/conditions and reconcile or fail when unhealthy.

  • Make wait helpers return consistent non-zero exit codes on timeouts. Action: convert remaining warnings to errors and ensure callers propagate failures; document these semantics.

  • Centralize timeout documentation and variable mapping. Action: add a single authoritative table in deployment-helpers.sh listing each timeout var, default, and the exact resources/commands it governs.

  • Enforce upper bounds or CI-side caps to prevent prolonged/stalled pipelines. Action: reject or clamp overly large timeout values and document recommended ranges. (CWE-400: Uncontrolled Resource Consumption)

  • Avoid logging sensitive content. Action: audit and redact logs to ensure manifests, tokens, or secrets are not printed. (CWE-200: Exposure of Sensitive Information)

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately summarizes the primary change: improving robustness of deploy and test scripts through enhanced error handling and timeout management, which is reflected across all modified files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
scripts/validate-deployment.sh (2)

205-227: ⚠️ Potential issue | 🔴 Critical

Unset variable access in helper functions with set -u.

print_fail and print_warning access $2, $3, $4 without defaults. With set -u, callers must always provide all arguments or the script errors.

Fix using default empty values
 print_fail() {
     echo -e "${RED}❌ FAIL: $1${NC}"
-    if [ -n "$2" ]; then
-        echo -e "${RED}   Reason: $2${NC}"
+    if [ -n "${2:-}" ]; then
+        echo -e "${RED}   Reason: ${2:-}${NC}"
     fi
-    if [ -n "$3" ]; then
-        echo -e "${YELLOW}   Suggestion: $3${NC}"
+    if [ -n "${3:-}" ]; then
+        echo -e "${YELLOW}   Suggestion: ${3:-}${NC}"
     fi
-    if [ -n "$4" ]; then
-        echo -e "${YELLOW}   Suggestion: $4${NC}"
+    if [ -n "${4:-}" ]; then
+        echo -e "${YELLOW}   Suggestion: ${4:-}${NC}"
     fi
     ((FAILED++))
 }

Apply similar pattern to print_warning.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/validate-deployment.sh` around lines 205 - 227, print_fail and
print_warning read optional positional params ($2,$3,$4) which will cause errors
under set -u; update both functions to use default-empty parameter expansion
(e.g. use ${2:-} / ${3:-} / ${4:-}) or assign local variables with defaults at
the top of each function before testing them, then use those locals in the echo
checks so missing args do not trigger unbound variable errors.

30-31: ⚠️ Potential issue | 🔴 Critical

Unset variable access with set -u enabled.

Line 31 accesses $1 unconditionally. With set -u (line 6), this errors when no arguments are provided:

./validate-deployment.sh: line 31: $1: unbound variable
Fix
-if [ "$1" = "--help" ] || [ "$1" = "-h" ]; then
+if [ "${1:-}" = "--help" ] || [ "${1:-}" = "-h" ]; then
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/validate-deployment.sh` around lines 30 - 31, The script
validate-deployment.sh enables set -u but then reads $1 directly in the help
check (if [ "$1" = "--help" ] || [ "$1" = "-h" ];), causing an unbound variable
error when no arguments are passed; modify the help check to safely handle an
empty $1 by either first testing argument count (e.g., if [ "$#" -gt 0 ] && (
... )) or by using a safe expansion like "${1:-}" in the comparisons (if [
"${1:-}" = "--help" ] || [ "${1:-}" = "-h" ];), keeping the rest of the help
logic unchanged.
scripts/deployment-helpers.sh (1)

458-458: ⚠️ Potential issue | 🟠 Major

CWE-78: Command injection via eval.

eval "$check_command" executes arbitrary shell code. If any caller passes user-controlled or externally-derived input, this is exploitable. Current callers appear to use hardcoded commands, but this is fragile.

Consider replacing with a safer pattern using bash -c or restructuring to accept a command array:

Proposed safer pattern
-    if eval "$check_command"; then
+    if bash -c "$check_command"; then

Or better, refactor callers to pass command arrays and use "$@" execution.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/deployment-helpers.sh` at line 458, Replace the use of eval on the
variable check_command (the conditional using if eval "$check_command"; then) to
eliminate command injection risk: change callers to pass a command as an array
(e.g., check_command_array) and execute it with the safe array expansion
("${check_command_array[@]}") or, if that's not possible, run the string via
bash -c with explicit arguments (bash -c -- "$check_command") so no untrusted
input is evaluated by eval; update the calling sites to construct and pass an
array when possible and adjust the if condition to use the safe execution form.
🧹 Nitpick comments (1)
test/e2e/scripts/prow_run_smoke_test.sh (1)

51-60: Timeout documentation is partial.

Header lists 6 timeout variables but deployment-helpers.sh defines 16. Consider either documenting all or just referencing the helper file without duplicating the subset.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/scripts/prow_run_smoke_test.sh` around lines 51 - 60, The comment
block currently lists only a subset of timeout variables (e.g.,
CUSTOM_RESOURCE_TIMEOUT, LLMIS_TIMEOUT, MAASMODELREF_TIMEOUT,
AUTHPOLICY_TIMEOUT, AUTHORINO_TIMEOUT, ROLLOUT_TIMEOUT) while
deployment-helpers.sh actually defines many more; update the header in
prow_run_smoke_test.sh to either enumerate all timeout variables defined in
deployment-helpers.sh or replace the partial list with a single clear reference
line that points users to deployment-helpers.sh for the authoritative list
(mentioning deployment-helpers.sh by name so readers know where to look).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/deployment-helpers.sh`:
- Line 967: The local assignment to csv_name masks the function return and
triggers SC2155; change the line so you declare csv_name first and assign
afterward (i.e., use a separate local declaration for csv_name and then set
csv_name=$(find_csv_with_min_version "$operator_prefix" "$min_version"
"$namespace")), referring to the symbol find_csv_with_min_version and variable
csv_name to locate the code to update.
- Around line 1155-1158: The use of "local state=$(kubectl get catalogsource
...)" masks the kubectl exit code; declare the variable first and assign on a
separate line so the command's exit status isn't lost. Change the block around
the catalogsource check to first "local state" and then run "state=$(kubectl get
catalogsource \"$name\" -n \"$namespace\" -o
jsonpath='{.status.connectionState.lastObservedState}' 2>/dev/null)"; if you
need the kubectl exit code for branching capture "$?" immediately after the
command. This targets the "state" variable and the "kubectl get catalogsource"
invocation in the script.

---

Outside diff comments:
In `@scripts/deployment-helpers.sh`:
- Line 458: Replace the use of eval on the variable check_command (the
conditional using if eval "$check_command"; then) to eliminate command injection
risk: change callers to pass a command as an array (e.g., check_command_array)
and execute it with the safe array expansion ("${check_command_array[@]}") or,
if that's not possible, run the string via bash -c with explicit arguments (bash
-c -- "$check_command") so no untrusted input is evaluated by eval; update the
calling sites to construct and pass an array when possible and adjust the if
condition to use the safe execution form.

In `@scripts/validate-deployment.sh`:
- Around line 205-227: print_fail and print_warning read optional positional
params ($2,$3,$4) which will cause errors under set -u; update both functions to
use default-empty parameter expansion (e.g. use ${2:-} / ${3:-} / ${4:-}) or
assign local variables with defaults at the top of each function before testing
them, then use those locals in the echo checks so missing args do not trigger
unbound variable errors.
- Around line 30-31: The script validate-deployment.sh enables set -u but then
reads $1 directly in the help check (if [ "$1" = "--help" ] || [ "$1" = "-h"
];), causing an unbound variable error when no arguments are passed; modify the
help check to safely handle an empty $1 by either first testing argument count
(e.g., if [ "$#" -gt 0 ] && ( ... )) or by using a safe expansion like "${1:-}"
in the comparisons (if [ "${1:-}" = "--help" ] || [ "${1:-}" = "-h" ];), keeping
the rest of the help logic unchanged.

---

Nitpick comments:
In `@test/e2e/scripts/prow_run_smoke_test.sh`:
- Around line 51-60: The comment block currently lists only a subset of timeout
variables (e.g., CUSTOM_RESOURCE_TIMEOUT, LLMIS_TIMEOUT, MAASMODELREF_TIMEOUT,
AUTHPOLICY_TIMEOUT, AUTHORINO_TIMEOUT, ROLLOUT_TIMEOUT) while
deployment-helpers.sh actually defines many more; update the header in
prow_run_smoke_test.sh to either enumerate all timeout variables defined in
deployment-helpers.sh or replace the partial list with a single clear reference
line that points users to deployment-helpers.sh for the authoritative list
(mentioning deployment-helpers.sh by name so readers know where to look).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 0718846f-a8e4-44bd-9d72-b06aad384896

📥 Commits

Reviewing files that changed from the base of the PR and between 6b2d7e4 and 948b4df.

📒 Files selected for processing (4)
  • scripts/deploy.sh
  • scripts/deployment-helpers.sh
  • scripts/validate-deployment.sh
  • test/e2e/scripts/prow_run_smoke_test.sh

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
@somya-bhatnagar somya-bhatnagar marked this pull request as ready for review March 31, 2026 12:15
@openshift-ci openshift-ci bot requested a review from ryancham715 March 31, 2026 12:15
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
scripts/deployment-helpers.sh (1)

458-458: ⚠️ Potential issue | 🟠 Major

CWE-78: eval usage violates security guideline.

Coding guidelines explicitly prohibit eval and dynamic command execution. Line 458 uses eval "$check_command", where $check_command contains variable-interpolated kubectl commands from callers. This pattern is fragile and violates the restriction.

Refactor to accept a function name and invoke it directly:

Safer alternative
 wait_for_custom_check() {
   local description=${1?description is required}; shift
-  local check_command=${1?check command is required}; shift
+  local check_func=${1?check function is required}; shift
   local timeout=${1:-$CUSTOM_CHECK_TIMEOUT}; shift || true
   local interval=${1:-5}; shift || true
 
   log_info "Waiting for: $description (timeout: ${timeout}s)"
 
   local elapsed=0
   while [ $elapsed -lt $timeout ]; do
-    if eval "$check_command"; then
+    if "$check_func"; then
       log_info "$description - Ready"
       return 0
     fi

Callers define a function and pass its name:

_check_kuadrant_ready() { kubectl get kuadrant kuadrant -n "$namespace" -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}' 2>/dev/null | grep -q True; }
wait_for_custom_check "Kuadrant ready in $namespace" _check_kuadrant_ready "$kuadrant_initial_timeout" 5
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/deployment-helpers.sh` at line 458, The use of eval on check_command
is a security risk; change the API so wait_for_custom_check accepts a check
function name (not a shell string) and invoke that function directly instead of
eval: update callers to define a function like _check_kuadrant_ready and pass
its name, update wait_for_custom_check (replace check_command and the eval call)
to take a check_fn parameter and call it (e.g., "$check_fn" or call via bash
indirection) and test its exit status, and ensure arguments like
namespace/timeouts are forwarded and all variables are properly quoted; remove
all eval usages and update any call sites that build command strings.
♻️ Duplicate comments (1)
scripts/deployment-helpers.sh (1)

1156-1159: ⚠️ Potential issue | 🟡 Minor

SC2155: Declare and assign separately to avoid masking return values.

Same issue as the prior fix at line 967. The local declaration masks the kubectl exit code.

Fix
     local state
-    local state=$(kubectl get catalogsource "$name" -n "$namespace" \
+    state=$(kubectl get catalogsource "$name" -n "$namespace" \
       -o jsonpath='{.status.connectionState.lastObservedState}' 2>/dev/null)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/deployment-helpers.sh` around lines 1156 - 1159, The local assignment
for state masks the kubectl exit code; change the two-part declaration so you
first run kubectl to capture its output into a temp variable and check its exit
status, then do a separate local assignment like using kubectl get catalogsource
"$name" -n "$namespace" -o
jsonpath='{.status.connectionState.lastObservedState}' 2>/dev/null into a
non-local var, check the command succeeded, and only then set local
state="$that_var" (referencing the variable name state and the kubectl get
catalogsource invocation) so the kubectl exit code is preserved and handled
correctly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@scripts/deployment-helpers.sh`:
- Line 458: The use of eval on check_command is a security risk; change the API
so wait_for_custom_check accepts a check function name (not a shell string) and
invoke that function directly instead of eval: update callers to define a
function like _check_kuadrant_ready and pass its name, update
wait_for_custom_check (replace check_command and the eval call) to take a
check_fn parameter and call it (e.g., "$check_fn" or call via bash indirection)
and test its exit status, and ensure arguments like namespace/timeouts are
forwarded and all variables are properly quoted; remove all eval usages and
update any call sites that build command strings.

---

Duplicate comments:
In `@scripts/deployment-helpers.sh`:
- Around line 1156-1159: The local assignment for state masks the kubectl exit
code; change the two-part declaration so you first run kubectl to capture its
output into a temp variable and check its exit status, then do a separate local
assignment like using kubectl get catalogsource "$name" -n "$namespace" -o
jsonpath='{.status.connectionState.lastObservedState}' 2>/dev/null into a
non-local var, check the command succeeded, and only then set local
state="$that_var" (referencing the variable name state and the kubectl get
catalogsource invocation) so the kubectl exit code is preserved and handled
correctly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 3ae2acfc-d9d9-4869-a359-f644f0d3cbac

📥 Commits

Reviewing files that changed from the base of the PR and between 948b4df and b25b092.

📒 Files selected for processing (1)
  • scripts/deployment-helpers.sh

mynhardtburger and others added 4 commits March 31, 2026 08:53
…hub-io#559)

## Summary
- Remove incorrect `3xx` from the list of response codes that grant
access in the model-listing-flow documentation
- The source code (`discovery.go:256-300`) confirms that 3xx codes fall
into the `default` switch case which returns `authRetry` (not
`authGranted`). Only `2xx` and `405` result in `authGranted`

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes

* **Bug Fixes**
* Updated model listing behavior to refine access validation criteria.
The model listing endpoint now filters models more precisely based on
HTTP response codes from endpoint probes.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
…ub-io#642)

Extend the CEL expression in Tekton PipelineRun definitions to also skip
Konflux builds when only `.gitignore`, `OWNERS`, `PROJECT`, `LICENSE`,
.`github/`, or `.tekton/` files are changed — matching the existing
skip_if_only_changed regex used by the Prow ci-operator jobs.


An example of the existing prow ci-operator job skip condition:
```yaml
skip_if_only_changed: ^docs/|\.md$|^(?:.*/)?(?:\.gitignore|OWNERS|PROJECT|LICENSE|CONTRIBUTING\.md)$|^\.github/|^\.tekton/
```
-
https://github.com/openshift/release/blob/01591970ea5008bda7b05a49ce8a0905207182ee/ci-operator/jobs/opendatahub-io/models-as-a-service/opendatahub-io-models-as-a-service-main-presubmits.yaml#L21

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes

* **Chores**
* Updated pull request pipeline triggering conditions to exclude
additional configuration files and CI/CD directories from initiating
automated checks.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
…pendatahub-io#644)

## Summary

Related to - https://redhat.atlassian.net/browse/RHOAIENG-55976

This PR makes `tokenRateLimits` required on each `MaaSSubscription`
model reference and removes the unused `tokenRateLimitRef` field from
the CRD.

## Changes

### API Changes
- ✅ Removed `tokenRateLimitRef` field from `ModelSubscriptionRef`
- ✅ Made `tokenRateLimits` required with `MinItems=1` validation
- ✅ Regenerated CRD manifests and deepcopy code

### Tests
- ✅ Fixed 4 unit tests (added missing field indexes)
- ✅ All tests passing
- ✅ Live cluster validation complete

### Documentation
- ✅ Updated CRD reference docs
- ✅ Updated architecture overview
- ✅ Created migration guide:
`docs/content/migration/tokenratelimits-required-3.4.md`
- ✅ Added v3.4.0 release notes with breaking change notice

## Live Cluster Testing Results
(https://console-openshift-console.apps.ci-ln-hdzds3t-76ef8.aws-4.ci.openshift.org/)

Deployed custom controller and verified:
- ✅ Subscriptions **without** `tokenRateLimits` → **REJECTED** with
"Required value" error
- ✅ Subscriptions **with** `tokenRateLimits` → **ACCEPTED**
- ✅ Subscriptions with old `tokenRateLimitRef` → **REJECTED** with
"unknown field" error
- ✅ Controller reconciliation working correctly
- ✅ No RBAC errors

## Migration Impact

**Before (3.3 and earlier):**
```yaml
modelRefs:
  - name: my-model
    namespace: llm
    tokenRateLimits:  # Optional
      - limit: 1000
        window: 1m
    tokenRateLimitRef: "some-ref"  # Unused by controller
```

**After (3.4+):**
```yaml
modelRefs:
  - name: my-model
    namespace: llm
    tokenRateLimits:  # REQUIRED - at least one
      - limit: 1000
        window: 1m
    # tokenRateLimitRef removed
```

## Migration Guide

See detailed migration instructions in
`docs/content/migration/tokenratelimits-required-3.4.md`

Users must add inline `tokenRateLimits` to all `MaaSSubscription`
resources before upgrading to 3.4.

## Acceptance Criteria

All criteria from the user story met:

- [x] `tokenRateLimitRef` not present in API
- [x] Each `modelRefs` entry must include non-empty `tokenRateLimits`
- [x] Validation rejects non-compliant resources with clear messages
- [x] Tests cover required limits, no `tokenRateLimitRef` references
- [x] Generated CRD YAML matches types
- [x] Documentation updated
- [x] Migration guide provided
- [x] Breaking change documented in release notes

## Files Changed

- `maas-controller/api/maas/v1alpha1/maassubscription_types.go` - API
types
-
`deployment/base/maas-controller/crd/bases/maas.opendatahub.io_maassubscriptions.yaml`
- CRD manifest
- `maas-controller/api/maas/v1alpha1/zz_generated.deepcopy.go` -
Generated code
-
`maas-controller/pkg/controller/maas/maassubscription_controller_test.go`
- Test fixes
-
`docs/content/configuration-and-management/maas-controller-overview.md`
- Architecture docs
- `docs/content/reference/crds/maas-subscription.md` - CRD reference
- `docs/content/migration/tokenratelimits-required-3.4.md` - Migration
guide (new)
- `docs/content/release-notes/index.md` - Release notes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

## Release Notes

* **Breaking Changes**
* Rate limit configuration for each model subscription is now mandatory
and must be specified inline rather than referenced externally
  * At least one rate limit entry is required per model
  * Rate limit values must be greater than zero

* **Documentation**
* Updated configuration and CRD documentation to reflect new rate limit
requirements
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Separate local declaration from command substitution to preserve
kubectl's exit code. While the exit code isn't currently used in
this error path, following this best practice prevents potential
issues and improves code maintainability.

Before:
  local state=$(kubectl get catalogsource ...)

After:
  local state
  state=$(kubectl get catalogsource ...)

This ensures $? reflects kubectl's exit status, not the local builtin.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
maas-controller/api/maas/v1alpha1/maassubscription_types.go (1)

78-83: ⚠️ Potential issue | 🟠 Major

Close the remaining window validation gap before regenerating the CRD.

Line 82 still accepts 0s/0m/0h/0d, and Window has no MaxLength. That leaves this user-facing field able to carry invalid or oversized values even after tightening limit.

Diff
 type TokenRateLimit struct {
 	// Limit is the maximum number of tokens allowed
 	// +kubebuilder:validation:Minimum=1
 	Limit int64 `json:"limit"`

 	// Window is the time window (e.g., "1m", "1h", "24h")
-	// +kubebuilder:validation:Pattern=`^(\d+)(s|m|h|d)$`
+	// +kubebuilder:validation:MaxLength=20
+	// +kubebuilder:validation:Pattern=`^([1-9]\d*)(s|m|h|d)$`
 	Window string `json:"window"`
 }
As per coding guidelines, API TYPES SECURITY: Kubebuilder validation markers on all user-facing fields; MaxLength validation on string fields.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@maas-controller/api/maas/v1alpha1/maassubscription_types.go` around lines 78
- 83, The Window field currently allows zero (e.g., "0s") and has no MaxLength;
update the kubebuilder markers on the Window field to enforce a positive integer
and a bounded length: replace the existing Pattern with one that disallows zero
such as ^([1-9]\d*)(s|m|h|d)$ and add a MaxLength marker (e.g.,
+kubebuilder:validation:MaxLength=6) above the Window string definition, then
regenerate the CRD so the new validation is applied; target the Window field in
maassubscription_types.go to make this change.
docs/content/reference/crds/maas-subscription.md (1)

27-35: ⚠️ Potential issue | 🟡 Minor

Document the new limit >= 1 constraint here too.

Line 27 updates tokenRateLimits, but Line 34 still reads as if any int64 is valid. The CRD now rejects tokenRateLimits[].limit: 0, so this table will send users into avoidable validation errors.

Diff
-| limit | int64 | Yes | Maximum number of tokens allowed |
+| limit | int64 | Yes | Maximum number of tokens allowed. Minimum: `1` |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/content/reference/crds/maas-subscription.md` around lines 27 - 35,
Update the TokenRateLimit documentation to state the minimum value constraint:
change the description for the limit field in the TokenRateLimit table
(referenced as TokenRateLimit and tokenRateLimits) to indicate the integer must
be >= 1 (e.g., "int64 (minimum 1)" or "Maximum number of tokens allowed (must be
>= 1)") so it matches the CRD validation that rejects tokenRateLimits[].limit ==
0.
🧹 Nitpick comments (1)
maas-controller/pkg/controller/maas/maassubscription_controller_test.go (1)

400-400: Use the shared index-key constant instead of repeating the raw field name.

These fake-client registrations have to stay byte-for-byte aligned with the reconciler’s MatchingFields key. Adding more "spec.modelRef" literals here creates extra drift points; use modelRefIndexKey so a future rename stays atomic.

Diff
-		WithIndex(&maasv1alpha1.MaaSSubscription{}, "spec.modelRef", subscriptionModelRefIndexer).
+		WithIndex(&maasv1alpha1.MaaSSubscription{}, modelRefIndexKey, subscriptionModelRefIndexer).

Also applies to: 496-496, 722-722, 816-816

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@maas-controller/pkg/controller/maas/maassubscription_controller_test.go` at
line 400, Replace the repeated raw index key string "spec.modelRef" used in the
fake-client WithIndex registrations with the shared constant modelRefIndexKey so
the tests stay byte-for-byte aligned with the reconciler's MatchingFields;
update each call that currently uses "spec.modelRef" (e.g., the WithIndex call
that uses subscriptionModelRefIndexer for maasv1alpha1.MaaSSubscription) to pass
modelRefIndexKey instead of the literal, and apply the same change at the other
occurrences mentioned (lines around the second, third and fourth hits).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.tekton/odh-maas-controller-pull-request.yaml:
- Around line 12-14: The PR skip pattern currently excludes changes under
".tekton/" (the string literal '.tekton/' in the conditional), so pipeline runs
are suppressed for pipeline config changes; remove '.tekton/' from the OR-list
of matches in the conditional expression (the clause containing
x.matches('^\\.tekton/')) in the PR pipeline YAML files (both the pull-request
YAML shown and the analogous .tekton/odh-maas-api-pull-request.yaml) so that
modifications to .tekton/ will no longer be skipped and the validation pipeline
will run.

In `@docs/content/release-notes/index.md`:
- Around line 7-9: The release notes mention "See the overall migration plan"
but leave a placeholder; update the text under the Version 3.4.0 entry to point
to a real migration document or anchor (e.g., replace the phrase "See the
overall migration plan for detailed upgrade instructions from previous
versions." with a direct link or relative path to the migration guide or an
in-page anchor like "#migration-plan"); ensure the link target exists (create a
migration guide page or add an anchor named "migration-plan" in the docs) so
readers of the release-notes index can follow concrete upgrade instructions.

---

Outside diff comments:
In `@docs/content/reference/crds/maas-subscription.md`:
- Around line 27-35: Update the TokenRateLimit documentation to state the
minimum value constraint: change the description for the limit field in the
TokenRateLimit table (referenced as TokenRateLimit and tokenRateLimits) to
indicate the integer must be >= 1 (e.g., "int64 (minimum 1)" or "Maximum number
of tokens allowed (must be >= 1)") so it matches the CRD validation that rejects
tokenRateLimits[].limit == 0.

In `@maas-controller/api/maas/v1alpha1/maassubscription_types.go`:
- Around line 78-83: The Window field currently allows zero (e.g., "0s") and has
no MaxLength; update the kubebuilder markers on the Window field to enforce a
positive integer and a bounded length: replace the existing Pattern with one
that disallows zero such as ^([1-9]\d*)(s|m|h|d)$ and add a MaxLength marker
(e.g., +kubebuilder:validation:MaxLength=6) above the Window string definition,
then regenerate the CRD so the new validation is applied; target the Window
field in maassubscription_types.go to make this change.

---

Nitpick comments:
In `@maas-controller/pkg/controller/maas/maassubscription_controller_test.go`:
- Line 400: Replace the repeated raw index key string "spec.modelRef" used in
the fake-client WithIndex registrations with the shared constant
modelRefIndexKey so the tests stay byte-for-byte aligned with the reconciler's
MatchingFields; update each call that currently uses "spec.modelRef" (e.g., the
WithIndex call that uses subscriptionModelRefIndexer for
maasv1alpha1.MaaSSubscription) to pass modelRefIndexKey instead of the literal,
and apply the same change at the other occurrences mentioned (lines around the
second, third and fourth hits).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: eaa44915-16ce-4c3d-b0e0-b737ebd849bc

📥 Commits

Reviewing files that changed from the base of the PR and between b25b092 and 3d2208b.

📒 Files selected for processing (11)
  • .tekton/odh-maas-api-pull-request.yaml
  • .tekton/odh-maas-controller-pull-request.yaml
  • deployment/base/maas-controller/crd/bases/maas.opendatahub.io_maassubscriptions.yaml
  • docs/content/configuration-and-management/maas-controller-overview.md
  • docs/content/configuration-and-management/model-listing-flow.md
  • docs/content/reference/crds/maas-subscription.md
  • docs/content/release-notes/index.md
  • maas-controller/api/maas/v1alpha1/maassubscription_types.go
  • maas-controller/api/maas/v1alpha1/zz_generated.deepcopy.go
  • maas-controller/pkg/controller/maas/maassubscription_controller_test.go
  • scripts/deployment-helpers.sh
💤 Files with no reviewable changes (1)
  • maas-controller/api/maas/v1alpha1/zz_generated.deepcopy.go
✅ Files skipped from review due to trivial changes (2)
  • docs/content/configuration-and-management/maas-controller-overview.md
  • docs/content/configuration-and-management/model-listing-flow.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • scripts/deployment-helpers.sh

Comment on lines +7 to +9
Version 3.4.0 introduces new CRDs and API resources that are not compatible with previous versions. All MaaS custom resources (`MaaSModelRef`, `MaaSAuthPolicy`, `MaaSSubscription`) are new in this release.

**Migration:** See the overall migration plan for detailed upgrade instructions from previous versions.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Replace the migration placeholder with a real link or path.

Line 9 tells readers to see an “overall migration plan” but doesn’t point to any document or anchor. For a breaking-change entry, that leaves upgrade guidance as a dead end.

🧰 Tools
🪛 LanguageTool

[style] ~7-~7: Consider using “incompatible” to avoid wordiness.
Context: ...ces new CRDs and API resources that are not compatible with previous versions. All MaaS custom...

(NOT_ABLE_PREMIUM)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/content/release-notes/index.md` around lines 7 - 9, The release notes
mention "See the overall migration plan" but leave a placeholder; update the
text under the Version 3.4.0 entry to point to a real migration document or
anchor (e.g., replace the phrase "See the overall migration plan for detailed
upgrade instructions from previous versions." with a direct link or relative
path to the migration guide or an in-page anchor like "#migration-plan"); ensure
the link target exists (create a migration guide page or add an anchor named
"migration-plan" in the docs) so readers of the release-notes index can follow
concrete upgrade instructions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
cat docs/content/release-notes/index.md

Length of output: 437


@somya-bhatnagar, the current branch still has the placeholder text with no actual link — the file reads:

**Migration:** See the overall migration plan for detailed upgrade instructions from previous versions.

The fix doesn't appear to have been committed yet. Could you double-check? The line should point to a real document or anchor, for example:

**Migration:** See the [migration guide](../migration/v3.4.0.md) for detailed upgrade instructions from previous versions.

(adjusting the path to wherever the migration document actually lives in your docs tree). If the migration guide doesn't exist yet, you could either create a stub page or add an inline anchor and note that the full guide is forthcoming — just so readers aren't left at a dead end on a breaking-change release.

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.tekton/odh-maas-controller-pull-request.yaml:
- Around line 11-14: In the pipelinesascode on-cel expression (the annotation
key pipelinesascode.tekton.dev/on-cel-expression) remove the extra skip pattern
"|| x.matches('^\\.tekton/')" so the CEL boolean expression matches the
controller behavior; locate the expression containing files.all.all(...) and
delete the x.matches('^\\.tekton/') clause (including the preceding ||) so
.tekton/ changes are not excluded from validation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 5ec8ad56-5f96-4093-98ac-51e447cb5ade

📥 Commits

Reviewing files that changed from the base of the PR and between 3d2208b and f6a5815.

📒 Files selected for processing (1)
  • .tekton/odh-maas-controller-pull-request.yaml

@somya-bhatnagar
Copy link
Copy Markdown
Contributor Author

/retest

…ub-io#642)

Extend the CEL expression in Tekton PipelineRun definitions to also skip
Konflux builds when only `.gitignore`, `OWNERS`, `PROJECT`, `LICENSE`,
.`github/`, or `.tekton/` files are changed — matching the existing
skip_if_only_changed regex used by the Prow ci-operator jobs.

An example of the existing prow ci-operator job skip condition:
```yaml
skip_if_only_changed: ^docs/|\.md$|^(?:.*/)?(?:\.gitignore|OWNERS|PROJECT|LICENSE|CONTRIBUTING\.md)$|^\.github/|^\.tekton/
```
-
https://github.com/openshift/release/blob/01591970ea5008bda7b05a49ce8a0905207182ee/ci-operator/jobs/opendatahub-io/models-as-a-service/opendatahub-io-models-as-a-service-main-presubmits.yaml#L21

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

* **Chores**
* Updated pull request pipeline triggering conditions to exclude
additional configuration files and CI/CD directories from initiating
automated checks.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Remove the extra skip pattern || x.matches('^\.tekton/') from the
CEL expression so that changes to .tekton/ pipeline files are validated
by the Konflux builds, matching the expected controller behavior.
@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Mar 31, 2026

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Mar 31, 2026

@somya-bhatnagar: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/pr-image-mirror-maas-controller 30a66e0 link false /test pr-image-mirror-maas-controller
ci/prow/pr-image-mirror-maas-api 30a66e0 link false /test pr-image-mirror-maas-api
ci/prow/images 30a66e0 link true /test images

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@somya-bhatnagar
Copy link
Copy Markdown
Contributor Author

closing the ticket. Will create new.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants