fix(ci): restart kuadrant operator pod on MissingDependency retry by jlost · Pull Request #1425 · opendatahub-io/kserve

jlost · 2026-04-16T13:46:31Z

Summary

The Kuadrant operator checks dependencies (Limitador, Authorino, DNS) only at startup. If it starts before a dependency CSV is fully ready, it caches MissingDependency and never re-checks -- even after the dependency finishes installing.

The existing retry loop (added in #1301) deleted and recreated the Kuadrant CR, which helps operator versions that only subscribe to Create events. However, it does not clear the stale dependency cache inside the running operator pod.

This adds an operator pod restart between retry attempts so the new pod performs a fresh dependency check against the now-ready CSVs. This matches what the operator itself requests in its status message: "please restart Kuadrant Operator pod once dependency has been installed".

Changes

Delete the kuadrant operator pod (app=kuadrant,control-plane=controller-manager) after deleting the CR
Wait for the restarted pod to become ready before recreating the CR
Both new commands use || true to preserve the diagnostic dump on final failure

Test plan

Trigger e2e-llm-inference-service Konflux group test and verify Kuadrant installs successfully
Verify the retry path works by inspecting logs for the "Restarting operator pod" message

Follow-up to #1301.

Made with Cursor

Summary by CodeRabbit

Tests
- Enhanced deployment remediation logic for Kuadrant in OpenShift CI environments, improving operator recovery and readiness verification procedures when deployment issues occur.

The Kuadrant operator checks its dependencies (Limitador, Authorino, DNS) only at startup. If it starts before a dependency CSV is fully ready, it caches a MissingDependency status and never re-checks -- even after the dependency finishes installing. The existing retry loop deleted and recreated the Kuadrant CR, but that only helps operator versions subscribing to Create events. It does not clear the stale dependency cache inside the running operator pod. Add an operator pod restart between attempts so the new pod performs a fresh dependency check against the now-ready CSVs. Follow-up to kserve#1301. Signed-off-by: James Ostrander <jostrand@redhat.com>

openshift-ci · 2026-04-16T13:46:36Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci · 2026-04-16T13:46:39Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlost

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [jlost]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-04-16T13:46:41Z

📝 Walkthrough

Walkthrough

The script's Kuadrant failure remediation logic was expanded to include operator pod restart. When Kuadrant fails to reach Ready state, the script now logs the remediation attempt, deletes the Kuadrant CR, forcibly deletes the operator controller-manager pod(s) selected by app=kuadrant,control-plane=controller-manager with error suppression, waits for pod readiness, then recreates the Kuadrant CR. Previously, only CR deletion and recreation occurred.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Actionable Issues

Pod deletion without safety verification: The script deletes operator controller-manager pods using --wait and || true, which suppresses deletion failures silently. If pod deletion hangs or fails due to stuck finalizers or disruption budgets, the script proceeds unaware, potentially leading to orphaned or inconsistent state. Consider explicit timeout handling and error logging instead of silent suppression.

Kuadrant CR deletion without dependency checks: Deleting the Kuadrant CR does not verify whether it has finalizers, dependent resources, or active reconciliations. This can leave orphaned resources or trigger cascading deletions. Validate CR deletion completion before proceeding to pod restart.

🚥 Pre-merge checks | ✅ 2

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly describes the main fix: restarting the Kuadrant operator pod during retry attempts when dependencies are missing, which directly matches the primary change in the script.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

jlost · 2026-04-16T13:50:32Z

/hold
I'll rerun a few times to try to repro.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/scripts/openshift-ci/infra/deploy.kuadrant.sh`:
- Around line 100-105: The current sequence suppresses failures by using "||
true" after the oc delete pod command and wait_for_pod_ready, then always calls
create_kuadrant_cr; change this so that the pod delete (oc delete pod -n
"${KUADRANT_NS}" -l app=kuadrant,control-plane=controller-manager --wait=true
--timeout=120s) and the readiness check (wait_for_pod_ready "${KUADRANT_NS}"
"app=kuadrant,control-plane=controller-manager" 120s) are allowed to fail
(remove "|| true") and you only call create_kuadrant_cr when both succeed; if
delete or readiness fails, propagate the error (exit non-zero or return failure)
instead of recreating the CR.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 1dfac98d-b6c9-4c06-bd98-92970b2d5796

📥 Commits

Reviewing files that changed from the base of the PR and between 05dbf51 and 802c48f.

📒 Files selected for processing (1)

test/scripts/openshift-ci/infra/deploy.kuadrant.sh

coderabbitai · 2026-04-16T13:53:39Z

+  echo "  Restarting operator pod (clears stale dependency cache)…"
+  oc delete pod -n "${KUADRANT_NS}" -l app=kuadrant,control-plane=controller-manager --wait=true --timeout=120s || true
  echo "⏳ sleeping ${KUADRANT_POST_DELETE_SLEEP}s before recreating Kuadrant…"
  sleep "${KUADRANT_POST_DELETE_SLEEP}"
+  wait_for_pod_ready "${KUADRANT_NS}" "app=kuadrant,control-plane=controller-manager" 120s || true
  create_kuadrant_cr || true


⚠️ Potential issue | 🟠 Major

Do not suppress operator restart failures before CR recreation.

Line 101 and Line 104 swallow failures with || true, then Line 105 always recreates the CR. That can make the remediation a no-op (same operator pod, same stale cache) and reintroduce the race this PR is fixing. Gate CR recreation on successful pod delete + readiness, and fail/continue the attempt otherwise.

Suggested fix

echo " Restarting operator pod (clears stale dependency cache)…" - oc delete pod -n "${KUADRANT_NS}" -l app=kuadrant,control-plane=controller-manager --wait=true --timeout=120s || true + if ! oc delete pod -n "${KUADRANT_NS}" -l app=kuadrant,control-plane=controller-manager --wait=true --timeout=120s; then + echo "ERROR: failed to delete kuadrant operator pod(s); skipping CR recreation for this attempt." + kuadrant_ready_attempt=$((kuadrant_ready_attempt + 1)) + continue + fi echo "⏳ sleeping ${KUADRANT_POST_DELETE_SLEEP}s before recreating Kuadrant…" sleep "${KUADRANT_POST_DELETE_SLEEP}" - wait_for_pod_ready "${KUADRANT_NS}" "app=kuadrant,control-plane=controller-manager" 120s || true + if ! wait_for_pod_ready "${KUADRANT_NS}" "app=kuadrant,control-plane=controller-manager" 120s; then + echo "ERROR: operator pod(s) did not become ready after restart; skipping CR recreation for this attempt." + kuadrant_ready_attempt=$((kuadrant_ready_attempt + 1)) + continue + fi create_kuadrant_cr || true

As per coding guidelines, "REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test/scripts/openshift-ci/infra/deploy.kuadrant.sh` around lines 100 - 105, The current sequence suppresses failures by using "|| true" after the oc delete pod command and wait_for_pod_ready, then always calls create_kuadrant_cr; change this so that the pod delete (oc delete pod -n "${KUADRANT_NS}" -l app=kuadrant,control-plane=controller-manager --wait=true --timeout=120s) and the readiness check (wait_for_pod_ready "${KUADRANT_NS}" "app=kuadrant,control-plane=controller-manager" 120s) are allowed to fail (remove "|| true") and you only call create_kuadrant_cr when both succeed; if delete or readiness fails, propagate the error (exit non-zero or return failure) instead of recreating the CR.

rhods-ci-bot · 2026-04-16T14:13:02Z

/group-test

github-project-automation Bot added this to ODH Model Serving Planning Apr 16, 2026

github-project-automation Bot moved this to New/Backlog in ODH Model Serving Planning Apr 16, 2026

openshift-ci Bot added the do-not-merge/work-in-progress label Apr 16, 2026

openshift-ci Bot added the approved label Apr 16, 2026

jlost marked this pull request as ready for review April 16, 2026 13:50

openshift-ci Bot removed the do-not-merge/work-in-progress label Apr 16, 2026

openshift-ci Bot added the do-not-merge/hold label Apr 16, 2026

coderabbitai Bot reviewed Apr 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ci): restart kuadrant operator pod on MissingDependency retry#1425

fix(ci): restart kuadrant operator pod on MissingDependency retry#1425
jlost wants to merge 1 commit intoopendatahub-io:masterfrom
jlost:fix-kuadrant-limitador-race

jlost commented Apr 16, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

openshift-ci Bot commented Apr 16, 2026

Uh oh!

openshift-ci Bot commented Apr 16, 2026

Uh oh!

coderabbitai Bot commented Apr 16, 2026 •

edited

Loading

Walkthrough

Estimated code review effort

Actionable Issues

Uh oh!

jlost commented Apr 16, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 16, 2026

Uh oh!

rhods-ci-bot commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jlost commented Apr 16, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Summary by CodeRabbit

Uh oh!

openshift-ci Bot commented Apr 16, 2026

Uh oh!

openshift-ci Bot commented Apr 16, 2026

Uh oh!

coderabbitai Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Actionable Issues

Uh oh!

jlost commented Apr 16, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

rhods-ci-bot commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jlost commented Apr 16, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 16, 2026 •

edited

Loading