Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion test/scripts/openshift-ci/infra/deploy.kuadrant.sh
Original file line number Diff line number Diff line change
Expand Up @@ -94,10 +94,14 @@ while (( kuadrant_ready_attempt <= KUADRANT_READY_MAX_ATTEMPTS )); do
oc logs -n "${KUADRANT_NS}" deployment/kuadrant-operator-controller-manager --tail=200 || true
exit 1
fi
echo "Kuadrant not Ready; deleting and recreating CR to trigger a new Create reconcile (helps operator versions that only subscribe to Create)…"
echo "Kuadrant not Ready; attempting to fix…"
echo " Recreating CR (triggers fresh Create reconcile for operator versions that only subscribe to Create)…"
oc delete kuadrant kuadrant -n "${KUADRANT_NS}" --ignore-not-found=true --wait=true --timeout=300s
echo " Restarting operator pod (clears stale dependency cache)…"
oc delete pod -n "${KUADRANT_NS}" -l app=kuadrant,control-plane=controller-manager --wait=true --timeout=120s || true
echo "⏳ sleeping ${KUADRANT_POST_DELETE_SLEEP}s before recreating Kuadrant…"
sleep "${KUADRANT_POST_DELETE_SLEEP}"
wait_for_pod_ready "${KUADRANT_NS}" "app=kuadrant,control-plane=controller-manager" 120s || true
create_kuadrant_cr || true
Comment on lines +100 to 105
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Do not suppress operator restart failures before CR recreation.

Line 101 and Line 104 swallow failures with || true, then Line 105 always recreates the CR. That can make the remediation a no-op (same operator pod, same stale cache) and reintroduce the race this PR is fixing. Gate CR recreation on successful pod delete + readiness, and fail/continue the attempt otherwise.

Suggested fix
   echo "  Restarting operator pod (clears stale dependency cache)…"
-  oc delete pod -n "${KUADRANT_NS}" -l app=kuadrant,control-plane=controller-manager --wait=true --timeout=120s || true
+  if ! oc delete pod -n "${KUADRANT_NS}" -l app=kuadrant,control-plane=controller-manager --wait=true --timeout=120s; then
+    echo "ERROR: failed to delete kuadrant operator pod(s); skipping CR recreation for this attempt."
+    kuadrant_ready_attempt=$((kuadrant_ready_attempt + 1))
+    continue
+  fi
   echo "⏳ sleeping ${KUADRANT_POST_DELETE_SLEEP}s before recreating Kuadrant…"
   sleep "${KUADRANT_POST_DELETE_SLEEP}"
-  wait_for_pod_ready "${KUADRANT_NS}" "app=kuadrant,control-plane=controller-manager" 120s || true
+  if ! wait_for_pod_ready "${KUADRANT_NS}" "app=kuadrant,control-plane=controller-manager" 120s; then
+    echo "ERROR: operator pod(s) did not become ready after restart; skipping CR recreation for this attempt."
+    kuadrant_ready_attempt=$((kuadrant_ready_attempt + 1))
+    continue
+  fi
   create_kuadrant_cr || true

As per coding guidelines, "REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/scripts/openshift-ci/infra/deploy.kuadrant.sh` around lines 100 - 105,
The current sequence suppresses failures by using "|| true" after the oc delete
pod command and wait_for_pod_ready, then always calls create_kuadrant_cr; change
this so that the pod delete (oc delete pod -n "${KUADRANT_NS}" -l
app=kuadrant,control-plane=controller-manager --wait=true --timeout=120s) and
the readiness check (wait_for_pod_ready "${KUADRANT_NS}"
"app=kuadrant,control-plane=controller-manager" 120s) are allowed to fail
(remove "|| true") and you only call create_kuadrant_cr when both succeed; if
delete or readiness fails, propagate the error (exit non-zero or return failure)
instead of recreating the CR.

kuadrant_ready_attempt=$((kuadrant_ready_attempt + 1))
done
Expand Down
Loading