Skip to content

fix: prevent infinite loop when HostedCluster status not populated#10

Merged
linoyaslan merged 1 commit intorh-ecosystem-edge:mainfrom
linoyaslan:fix-infinite-loop-when-hc-status-not-populated
Jan 25, 2026
Merged

fix: prevent infinite loop when HostedCluster status not populated#10
linoyaslan merged 1 commit intorh-ecosystem-edge:mainfrom
linoyaslan:fix-infinite-loop-when-hc-status-not-populated

Conversation

@linoyaslan
Copy link
Copy Markdown
Collaborator

@linoyaslan linoyaslan commented Jan 25, 2026

Moves hostedClusterRef assignment outside phase-gated block and changes StatusSync to skip instead of requeue when HC status is empty.

NOTE: This bug discovered in testing (minikube with mock CRs) but represents a real timing issue during HostedCluster initialization that could occur in real env.

Summary by CodeRabbit

  • Refactor
    • Restructured HostedCluster reconciliation to use event-driven triggering instead of fixed-delay retries, improving responsiveness and efficiency.
    • Enhanced HostedCluster reference management by validating ownership relationships before setting references.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jan 25, 2026

Walkthrough

The changes refactor the DPFHCPBridge controller to adopt ownership-based reconciliation for HostedCluster resources instead of label-based mapping, and modify status synchronization to rely on event-driven watches rather than timed requeues when HostedCluster status is unavailable.

Changes

Cohort / File(s) Summary
HostedClusterRef and Watch Management
internal/controller/dpfhcpbridge_controller.go
Relocated HostedClusterRef assignment from unconditional post-creation to conditional (nil-check with fetch and ownership validation). Removed hostedClusterToRequests helper function. Replaced label-based HostedCluster watch mapping with owner-based EnqueueRequestForOwner approach tied to DPFHCPBridge ownership.
Status Synchronization
internal/controller/hostedcluster/status.go
Changed behavior when HostedCluster status is unavailable: removed fixed-delay requeue (RequeueAfter 10s) and replaced with no-op immediate return. Updated log message and removed RequeueDelayStatusPending constant and time import. Reconciliation now triggered by watch events instead of timed retry.
Test Updates
internal/controller/hostedcluster/status_test.go
Renamed test case from "should requeue when..." to "should skip sync when...". Changed assertions to verify no requeue (Requeue = false, RequeueAfter = 0) instead of fixed delay. Added explanatory comment on watch-driven reconciliation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: prevent infinite loop when HostedCluster status not populated' directly and accurately describes the main change: fixing an infinite loop issue by modifying status sync behavior to skip requeuing when HostedCluster status is empty.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/controller/dpfhcpbridge_controller.go (1)

285-295: Fix goconst lint failure: extract repeated ConfigMap name/namespace.

Lint reports ocp-bluefield-images repeated; extract constants and reuse.

♻️ Proposed fix
+const (
+    ocpBluefieldImagesConfigMap = "ocp-bluefield-images"
+    dpfHcpBridgeSystemNamespace = "dpf-hcp-bridge-system"
+)
- return e.Object.GetName() == "ocp-bluefield-images" &&
-     e.Object.GetNamespace() == "dpf-hcp-bridge-system"
+ return e.Object.GetName() == ocpBluefieldImagesConfigMap &&
+     e.Object.GetNamespace() == dpfHcpBridgeSystemNamespace
...
- return e.ObjectNew.GetName() == "ocp-bluefield-images" &&
-     e.ObjectNew.GetNamespace() == "dpf-hcp-bridge-system"
+ return e.ObjectNew.GetName() == ocpBluefieldImagesConfigMap &&
+     e.ObjectNew.GetNamespace() == dpfHcpBridgeSystemNamespace
...
- return e.Object.GetName() == "ocp-bluefield-images" &&
-     e.Object.GetNamespace() == "dpf-hcp-bridge-system"
+ return e.Object.GetName() == ocpBluefieldImagesConfigMap &&
+     e.Object.GetNamespace() == dpfHcpBridgeSystemNamespace
🤖 Fix all issues with AI agents
In `@internal/controller/dpfhcpbridge_controller.go`:
- Around line 204-221: The code silently ignores any non-nil error returned by
r.Get when trying to fetch the HostedCluster (hc, hcKey, r.Get) which can drop
reconciliation; update the logic in the Reconcile handler (where cr, hc and
hcKey are used) to distinguish errors.IsNotFound from other errors and handle
transient failures by logging the error and returning it (or returning a requeue
result) instead of ignoring it, so that RBAC/API/network errors cause a retry
while still proceeding to set cr.Status.HostedClusterRef when the HostedCluster
is found and owned.

Comment on lines +204 to +221
// Set hostedClusterRef if HostedCluster exists and is owned by this CR
// This ensures the ref is always set when the HostedCluster exists, regardless of phase
if cr.Status.HostedClusterRef == nil {
hc := &hyperv1.HostedCluster{}
hcKey := types.NamespacedName{Name: cr.Name, Namespace: cr.Namespace}
if err := r.Get(ctx, hcKey, hc); err == nil {
// HostedCluster exists - verify ownership and set ref
if metav1.IsControlledBy(hc, &cr) {
log.V(1).Info("Setting hostedClusterRef for existing HostedCluster")
cr.Status.HostedClusterRef = &corev1.ObjectReference{
Name: cr.Name,
Namespace: cr.Namespace,
Kind: "HostedCluster",
APIVersion: "hypershift.openshift.io/v1beta1",
}
}
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle HostedCluster Get errors to avoid dropping reconciliation.

Right now any non‑nil error from r.Get is silently ignored, which can leave HostedClusterRef unset indefinitely if no further events fire. Consider returning the error (or at least logging + requeue) for transient API/RBAC/network failures.

🐛 Proposed fix
+    if err := r.Get(ctx, hcKey, hc); err != nil {
+        if apierrors.IsNotFound(err) {
+            // HostedCluster not found yet; skip
+        } else {
+            log.Error(err, "Failed to get HostedCluster for hostedClusterRef", "hostedCluster", hcKey.String())
+            return ctrl.Result{}, err
+        }
+    } else {
+        // HostedCluster exists - verify ownership and set ref
+        if metav1.IsControlledBy(hc, &cr) {
+            log.V(1).Info("Setting hostedClusterRef for existing HostedCluster")
+            cr.Status.HostedClusterRef = &corev1.ObjectReference{
+                Name:       cr.Name,
+                Namespace:  cr.Namespace,
+                Kind:       "HostedCluster",
+                APIVersion: "hypershift.openshift.io/v1beta1",
+            }
+        }
+    }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Set hostedClusterRef if HostedCluster exists and is owned by this CR
// This ensures the ref is always set when the HostedCluster exists, regardless of phase
if cr.Status.HostedClusterRef == nil {
hc := &hyperv1.HostedCluster{}
hcKey := types.NamespacedName{Name: cr.Name, Namespace: cr.Namespace}
if err := r.Get(ctx, hcKey, hc); err == nil {
// HostedCluster exists - verify ownership and set ref
if metav1.IsControlledBy(hc, &cr) {
log.V(1).Info("Setting hostedClusterRef for existing HostedCluster")
cr.Status.HostedClusterRef = &corev1.ObjectReference{
Name: cr.Name,
Namespace: cr.Namespace,
Kind: "HostedCluster",
APIVersion: "hypershift.openshift.io/v1beta1",
}
}
}
}
// Set hostedClusterRef if HostedCluster exists and is owned by this CR
// This ensures the ref is always set when the HostedCluster exists, regardless of phase
if cr.Status.HostedClusterRef == nil {
hc := &hyperv1.HostedCluster{}
hcKey := types.NamespacedName{Name: cr.Name, Namespace: cr.Namespace}
if err := r.Get(ctx, hcKey, hc); err != nil {
if apierrors.IsNotFound(err) {
// HostedCluster not found yet; skip
} else {
log.Error(err, "Failed to get HostedCluster for hostedClusterRef", "hostedCluster", hcKey.String())
return ctrl.Result{}, err
}
} else {
// HostedCluster exists - verify ownership and set ref
if metav1.IsControlledBy(hc, &cr) {
log.V(1).Info("Setting hostedClusterRef for existing HostedCluster")
cr.Status.HostedClusterRef = &corev1.ObjectReference{
Name: cr.Name,
Namespace: cr.Namespace,
Kind: "HostedCluster",
APIVersion: "hypershift.openshift.io/v1beta1",
}
}
}
}
🤖 Prompt for AI Agents
In `@internal/controller/dpfhcpbridge_controller.go` around lines 204 - 221, The
code silently ignores any non-nil error returned by r.Get when trying to fetch
the HostedCluster (hc, hcKey, r.Get) which can drop reconciliation; update the
logic in the Reconcile handler (where cr, hc and hcKey are used) to distinguish
errors.IsNotFound from other errors and handle transient failures by logging the
error and returning it (or returning a requeue result) instead of ignoring it,
so that RBAC/API/network errors cause a retry while still proceeding to set
cr.Status.HostedClusterRef when the HostedCluster is found and owned.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will be handled in NVIDIA-493

@linoyaslan linoyaslan merged commit f1c25b6 into rh-ecosystem-edge:main Jan 25, 2026
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant