fix(ci): restart kuadrant operator pod on MissingDependency retry#1425
fix(ci): restart kuadrant operator pod on MissingDependency retry#1425jlost wants to merge 1 commit intoopendatahub-io:masterfrom
Conversation
The Kuadrant operator checks its dependencies (Limitador, Authorino, DNS) only at startup. If it starts before a dependency CSV is fully ready, it caches a MissingDependency status and never re-checks -- even after the dependency finishes installing. The existing retry loop deleted and recreated the Kuadrant CR, but that only helps operator versions subscribing to Create events. It does not clear the stale dependency cache inside the running operator pod. Add an operator pod restart between attempts so the new pod performs a fresh dependency check against the now-ready CSVs. Follow-up to kserve#1301. Signed-off-by: James Ostrander <jostrand@redhat.com>
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jlost The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
📝 WalkthroughWalkthroughThe script's Kuadrant failure remediation logic was expanded to include operator pod restart. When Kuadrant fails to reach Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Actionable IssuesPod deletion without safety verification: The script deletes operator controller-manager pods using Kuadrant CR deletion without dependency checks: Deleting the Kuadrant CR does not verify whether it has finalizers, dependent resources, or active reconciliations. This can leave orphaned resources or trigger cascading deletions. Validate CR deletion completion before proceeding to pod restart. 🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/hold |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/scripts/openshift-ci/infra/deploy.kuadrant.sh`:
- Around line 100-105: The current sequence suppresses failures by using "||
true" after the oc delete pod command and wait_for_pod_ready, then always calls
create_kuadrant_cr; change this so that the pod delete (oc delete pod -n
"${KUADRANT_NS}" -l app=kuadrant,control-plane=controller-manager --wait=true
--timeout=120s) and the readiness check (wait_for_pod_ready "${KUADRANT_NS}"
"app=kuadrant,control-plane=controller-manager" 120s) are allowed to fail
(remove "|| true") and you only call create_kuadrant_cr when both succeed; if
delete or readiness fails, propagate the error (exit non-zero or return failure)
instead of recreating the CR.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: 1dfac98d-b6c9-4c06-bd98-92970b2d5796
📒 Files selected for processing (1)
test/scripts/openshift-ci/infra/deploy.kuadrant.sh
| echo " Restarting operator pod (clears stale dependency cache)…" | ||
| oc delete pod -n "${KUADRANT_NS}" -l app=kuadrant,control-plane=controller-manager --wait=true --timeout=120s || true | ||
| echo "⏳ sleeping ${KUADRANT_POST_DELETE_SLEEP}s before recreating Kuadrant…" | ||
| sleep "${KUADRANT_POST_DELETE_SLEEP}" | ||
| wait_for_pod_ready "${KUADRANT_NS}" "app=kuadrant,control-plane=controller-manager" 120s || true | ||
| create_kuadrant_cr || true |
There was a problem hiding this comment.
Do not suppress operator restart failures before CR recreation.
Line 101 and Line 104 swallow failures with || true, then Line 105 always recreates the CR. That can make the remediation a no-op (same operator pod, same stale cache) and reintroduce the race this PR is fixing. Gate CR recreation on successful pod delete + readiness, and fail/continue the attempt otherwise.
Suggested fix
echo " Restarting operator pod (clears stale dependency cache)…"
- oc delete pod -n "${KUADRANT_NS}" -l app=kuadrant,control-plane=controller-manager --wait=true --timeout=120s || true
+ if ! oc delete pod -n "${KUADRANT_NS}" -l app=kuadrant,control-plane=controller-manager --wait=true --timeout=120s; then
+ echo "ERROR: failed to delete kuadrant operator pod(s); skipping CR recreation for this attempt."
+ kuadrant_ready_attempt=$((kuadrant_ready_attempt + 1))
+ continue
+ fi
echo "⏳ sleeping ${KUADRANT_POST_DELETE_SLEEP}s before recreating Kuadrant…"
sleep "${KUADRANT_POST_DELETE_SLEEP}"
- wait_for_pod_ready "${KUADRANT_NS}" "app=kuadrant,control-plane=controller-manager" 120s || true
+ if ! wait_for_pod_ready "${KUADRANT_NS}" "app=kuadrant,control-plane=controller-manager" 120s; then
+ echo "ERROR: operator pod(s) did not become ready after restart; skipping CR recreation for this attempt."
+ kuadrant_ready_attempt=$((kuadrant_ready_attempt + 1))
+ continue
+ fi
create_kuadrant_cr || trueAs per coding guidelines, "REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/scripts/openshift-ci/infra/deploy.kuadrant.sh` around lines 100 - 105,
The current sequence suppresses failures by using "|| true" after the oc delete
pod command and wait_for_pod_ready, then always calls create_kuadrant_cr; change
this so that the pod delete (oc delete pod -n "${KUADRANT_NS}" -l
app=kuadrant,control-plane=controller-manager --wait=true --timeout=120s) and
the readiness check (wait_for_pod_ready "${KUADRANT_NS}"
"app=kuadrant,control-plane=controller-manager" 120s) are allowed to fail
(remove "|| true") and you only call create_kuadrant_cr when both succeed; if
delete or readiness fails, propagate the error (exit non-zero or return failure)
instead of recreating the CR.
|
/group-test |
Summary
The Kuadrant operator checks dependencies (Limitador, Authorino, DNS) only at startup. If it starts before a dependency CSV is fully ready, it caches
MissingDependencyand never re-checks -- even after the dependency finishes installing.The existing retry loop (added in #1301) deleted and recreated the Kuadrant CR, which helps operator versions that only subscribe to Create events. However, it does not clear the stale dependency cache inside the running operator pod.
This adds an operator pod restart between retry attempts so the new pod performs a fresh dependency check against the now-ready CSVs. This matches what the operator itself requests in its status message: "please restart Kuadrant Operator pod once dependency has been installed".
Changes
app=kuadrant,control-plane=controller-manager) after deleting the CR|| trueto preserve the diagnostic dump on final failureTest plan
e2e-llm-inference-serviceKonflux group test and verify Kuadrant installs successfullyFollow-up to #1301.
Made with Cursor
Summary by CodeRabbit