Skip to content

fix(ci): restart kuadrant operator pod on MissingDependency retry#1425

Open
jlost wants to merge 1 commit intoopendatahub-io:masterfrom
jlost:fix-kuadrant-limitador-race
Open

fix(ci): restart kuadrant operator pod on MissingDependency retry#1425
jlost wants to merge 1 commit intoopendatahub-io:masterfrom
jlost:fix-kuadrant-limitador-race

Conversation

@jlost
Copy link
Copy Markdown

@jlost jlost commented Apr 16, 2026

Summary

The Kuadrant operator checks dependencies (Limitador, Authorino, DNS) only at startup. If it starts before a dependency CSV is fully ready, it caches MissingDependency and never re-checks -- even after the dependency finishes installing.

The existing retry loop (added in #1301) deleted and recreated the Kuadrant CR, which helps operator versions that only subscribe to Create events. However, it does not clear the stale dependency cache inside the running operator pod.

This adds an operator pod restart between retry attempts so the new pod performs a fresh dependency check against the now-ready CSVs. This matches what the operator itself requests in its status message: "please restart Kuadrant Operator pod once dependency has been installed".

Changes

  • Delete the kuadrant operator pod (app=kuadrant,control-plane=controller-manager) after deleting the CR
  • Wait for the restarted pod to become ready before recreating the CR
  • Both new commands use || true to preserve the diagnostic dump on final failure

Test plan

  • Trigger e2e-llm-inference-service Konflux group test and verify Kuadrant installs successfully
  • Verify the retry path works by inspecting logs for the "Restarting operator pod" message

Follow-up to #1301.

Made with Cursor

Summary by CodeRabbit

  • Tests
    • Enhanced deployment remediation logic for Kuadrant in OpenShift CI environments, improving operator recovery and readiness verification procedures when deployment issues occur.

The Kuadrant operator checks its dependencies (Limitador, Authorino,
DNS) only at startup. If it starts before a dependency CSV is fully
ready, it caches a MissingDependency status and never re-checks --
even after the dependency finishes installing.

The existing retry loop deleted and recreated the Kuadrant CR, but
that only helps operator versions subscribing to Create events. It
does not clear the stale dependency cache inside the running operator
pod.

Add an operator pod restart between attempts so the new pod performs
a fresh dependency check against the now-ready CSVs.

Follow-up to kserve#1301.

Signed-off-by: James Ostrander <jostrand@redhat.com>
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 16, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 16, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlost

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 16, 2026

📝 Walkthrough

Walkthrough

The script's Kuadrant failure remediation logic was expanded to include operator pod restart. When Kuadrant fails to reach Ready state, the script now logs the remediation attempt, deletes the Kuadrant CR, forcibly deletes the operator controller-manager pod(s) selected by app=kuadrant,control-plane=controller-manager with error suppression, waits for pod readiness, then recreates the Kuadrant CR. Previously, only CR deletion and recreation occurred.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Actionable Issues

Pod deletion without safety verification: The script deletes operator controller-manager pods using --wait and || true, which suppresses deletion failures silently. If pod deletion hangs or fails due to stuck finalizers or disruption budgets, the script proceeds unaware, potentially leading to orphaned or inconsistent state. Consider explicit timeout handling and error logging instead of silent suppression.

Kuadrant CR deletion without dependency checks: Deleting the Kuadrant CR does not verify whether it has finalizers, dependent resources, or active reconciliations. This can leave orphaned resources or trigger cascading deletions. Validate CR deletion completion before proceeding to pod restart.

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main fix: restarting the Kuadrant operator pod during retry attempts when dependencies are missing, which directly matches the primary change in the script.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@jlost jlost marked this pull request as ready for review April 16, 2026 13:50
@jlost
Copy link
Copy Markdown
Author

jlost commented Apr 16, 2026

/hold
I'll rerun a few times to try to repro.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/scripts/openshift-ci/infra/deploy.kuadrant.sh`:
- Around line 100-105: The current sequence suppresses failures by using "||
true" after the oc delete pod command and wait_for_pod_ready, then always calls
create_kuadrant_cr; change this so that the pod delete (oc delete pod -n
"${KUADRANT_NS}" -l app=kuadrant,control-plane=controller-manager --wait=true
--timeout=120s) and the readiness check (wait_for_pod_ready "${KUADRANT_NS}"
"app=kuadrant,control-plane=controller-manager" 120s) are allowed to fail
(remove "|| true") and you only call create_kuadrant_cr when both succeed; if
delete or readiness fails, propagate the error (exit non-zero or return failure)
instead of recreating the CR.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 1dfac98d-b6c9-4c06-bd98-92970b2d5796

📥 Commits

Reviewing files that changed from the base of the PR and between 05dbf51 and 802c48f.

📒 Files selected for processing (1)
  • test/scripts/openshift-ci/infra/deploy.kuadrant.sh

Comment on lines +100 to 105
echo " Restarting operator pod (clears stale dependency cache)…"
oc delete pod -n "${KUADRANT_NS}" -l app=kuadrant,control-plane=controller-manager --wait=true --timeout=120s || true
echo "⏳ sleeping ${KUADRANT_POST_DELETE_SLEEP}s before recreating Kuadrant…"
sleep "${KUADRANT_POST_DELETE_SLEEP}"
wait_for_pod_ready "${KUADRANT_NS}" "app=kuadrant,control-plane=controller-manager" 120s || true
create_kuadrant_cr || true
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Do not suppress operator restart failures before CR recreation.

Line 101 and Line 104 swallow failures with || true, then Line 105 always recreates the CR. That can make the remediation a no-op (same operator pod, same stale cache) and reintroduce the race this PR is fixing. Gate CR recreation on successful pod delete + readiness, and fail/continue the attempt otherwise.

Suggested fix
   echo "  Restarting operator pod (clears stale dependency cache)…"
-  oc delete pod -n "${KUADRANT_NS}" -l app=kuadrant,control-plane=controller-manager --wait=true --timeout=120s || true
+  if ! oc delete pod -n "${KUADRANT_NS}" -l app=kuadrant,control-plane=controller-manager --wait=true --timeout=120s; then
+    echo "ERROR: failed to delete kuadrant operator pod(s); skipping CR recreation for this attempt."
+    kuadrant_ready_attempt=$((kuadrant_ready_attempt + 1))
+    continue
+  fi
   echo "⏳ sleeping ${KUADRANT_POST_DELETE_SLEEP}s before recreating Kuadrant…"
   sleep "${KUADRANT_POST_DELETE_SLEEP}"
-  wait_for_pod_ready "${KUADRANT_NS}" "app=kuadrant,control-plane=controller-manager" 120s || true
+  if ! wait_for_pod_ready "${KUADRANT_NS}" "app=kuadrant,control-plane=controller-manager" 120s; then
+    echo "ERROR: operator pod(s) did not become ready after restart; skipping CR recreation for this attempt."
+    kuadrant_ready_attempt=$((kuadrant_ready_attempt + 1))
+    continue
+  fi
   create_kuadrant_cr || true

As per coding guidelines, "REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/scripts/openshift-ci/infra/deploy.kuadrant.sh` around lines 100 - 105,
The current sequence suppresses failures by using "|| true" after the oc delete
pod command and wait_for_pod_ready, then always calls create_kuadrant_cr; change
this so that the pod delete (oc delete pod -n "${KUADRANT_NS}" -l
app=kuadrant,control-plane=controller-manager --wait=true --timeout=120s) and
the readiness check (wait_for_pod_ready "${KUADRANT_NS}"
"app=kuadrant,control-plane=controller-manager" 120s) are allowed to fail
(remove "|| true") and you only call create_kuadrant_cr when both succeed; if
delete or readiness fails, propagate the error (exit non-zero or return failure)
instead of recreating the CR.

@rhods-ci-bot
Copy link
Copy Markdown

/group-test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: New/Backlog

Development

Successfully merging this pull request may close these issues.

2 participants