-
Notifications
You must be signed in to change notification settings - Fork 246
MGMT-22278: Skip deleting spoke resources for an uninstalled agent #8608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Skipping CI for Draft Pull Request. |
|
@CrystalChun: This pull request references MGMT-22278 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughAdds Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes ✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: Organization UI Review profile: CHILL Plan: Pro Cache: Disabled due to data retention organization setting Knowledge base: Disabled due to 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
🧰 Additional context used📓 Path-based instructions (1)**⚙️ CodeRabbit configuration file
Files:
🔇 Additional comments (5)
Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: CrystalChun The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Previously, when an Agent starts installation and hasn't completed installing, and a user comes in and tries to delete it, the Agent will be stuck deleting because spoke resource deletion will fail. This gates the spoke cleanup process with the Agent's current status. The Agent must have spoke resources that need to be removed or is installed. If the Agent does not have any spoke resources or is not installed, then the spoke cleanup process will be skipped.
44ca0ae to
1554411
Compare
|
/test ? |
|
@CrystalChun: The following commands are available to trigger required jobs: The following commands are available to trigger optional jobs: Use DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/test edge-subsystem-kubeapi-aws |
|
@CrystalChun: This pull request references MGMT-22278 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@CrystalChun: This pull request references MGMT-22278 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@CrystalChun: This pull request references MGMT-22278 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@CrystalChun: This pull request references MGMT-22278 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@CrystalChun: This pull request references MGMT-22278 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@CrystalChun: This pull request references MGMT-22278 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@CrystalChun: This pull request references MGMT-22278 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@CrystalChun: This pull request references MGMT-22278 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@CrystalChun: This pull request references MGMT-22278 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
1 similar comment
|
@CrystalChun: This pull request references MGMT-22278 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #8608 +/- ##
==========================================
+ Coverage 43.49% 43.50% +0.01%
==========================================
Files 411 411
Lines 71244 71271 +27
==========================================
+ Hits 30987 31010 +23
- Misses 37497 37501 +4
Partials 2760 2760
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
internal/controller/controllers/agent_controller.go (1)
929-949: Error handling already mitigates transient failures appropriately.The concern about
getBMHerrors blocking deletion is valid. However, the code already provides safeguards:client.IgnoreNotFound()filters out NotFound errors (lines 675-676), errors are logged with context (line 513), and the requeue mechanism enables automatic retry. This conservative error-handling pattern is consistent across all pre-deletion checks and intentionally prioritizes preventing incomplete cleanup over avoiding temporary deletion delays. If persistent transient errors become a production issue, metrics on error frequency would be a useful optional enhancement, but the current logging provides adequate visibility.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (1)
internal/controller/controllers/agent_controller.go
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
internal/controller/controllers/agent_controller.go
🧬 Code graph analysis (1)
internal/controller/controllers/agent_controller.go (3)
api/v1beta1/agent_types.go (1)
CSRStatus(324-330)internal/controller/controllers/bmh_agent_controller.go (2)
AGENT_BMH_LABEL(81-81)BMH_SPOKE_CREATED_ANNOTATION(99-99)models/host_stage.go (1)
HostStageDone(67-67)
🔇 Additional comments (2)
internal/controller/controllers/agent_controller.go (2)
665-687: LGTM! Clean helper extraction.The
getBMHhelper properly handles the case where no BMH label exists and usesclient.IgnoreNotFoundto treat "not found" as a non-error. RefactoringbmhExiststo delegate togetBMHeliminates duplication.
511-517: LGTM! Cleanup gating aligns with PR objectives.The spoke cleanup is now correctly gated on three conditions:
- No skip annotation
- Agent bound to a cluster
- Spoke resources actually exist (per
spokeResourcesExist)This prevents the deletion flow from attempting cleanup when the agent hasn't progressed far enough to create spoke resources, addressing the stuck-deletion scenario described in the PR.
baac52b to
2ba0d2b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
internal/controller/controllers/agent_controller.go (1)
963-974: Deferred patch error won't propagate to caller.The
err = patchErrassignment inside the defer won't affect the returned error because Go evaluates return values before executing deferred functions, and the return values here are unnamed. The patch error is logged but silently discarded.If patch errors should propagate, use named return values:
🔎 Use named return values to propagate patch errors
-func (r *AgentReconciler) updateStatus(ctx context.Context, log logrus.FieldLogger, agent, origAgent *aiv1beta1.Agent, h *models.Host, clusterId *strfmt.UUID, syncErr error, internal bool) (ctrl.Result, error) { +func (r *AgentReconciler) updateStatus(ctx context.Context, log logrus.FieldLogger, agent, origAgent *aiv1beta1.Agent, h *models.Host, clusterId *strfmt.UUID, syncErr error, internal bool) (ret ctrl.Result, err error) { var ( - err error shouldAutoApproveCSRs bool spokeClient spoke_k8s_client.SpokeK8sClient node *corev1.Node - ret ctrl.Result = ctrl.Result{} )Alternatively, if patch errors should only be logged without affecting reconciliation, remove the
err = patchErrassignment to avoid confusion about intent.
🧹 Nitpick comments (1)
internal/controller/controllers/agent_controller.go (1)
1057-1059: Settingerr = nilbefore return is redundant.At line 1058,
err = nilis set explicitly, but then line 1059 returnsret, nildirectly anyway. The assignment has no effect.🔎 Simplify by removing redundant assignment
if err != nil { ret = ctrl.Result{Requeue: true} - err = nil return ret, nil }
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (1)
internal/controller/controllers/agent_controller.go
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
internal/controller/controllers/agent_controller.go
🧬 Code graph analysis (1)
internal/controller/controllers/agent_controller.go (3)
api/v1beta1/agent_types.go (2)
Agent(342-348)CSRStatus(324-330)internal/controller/controllers/bmh_agent_controller.go (2)
AGENT_BMH_LABEL(81-81)BMH_SPOKE_CREATED_ANNOTATION(99-99)models/host_stage.go (1)
HostStageDone(67-67)
🔇 Additional comments (3)
internal/controller/controllers/agent_controller.go (3)
511-517: LGTM! Proper gating of spoke resource cleanup.The new
spokeResourcesExistcheck correctly gates the cleanup logic, ensuring spoke resources are only removed when they actually exist. This addresses the PR objective of preventing stuck agent deletions when installation hasn't completed.
665-687: Clean helper extraction.Good refactor extracting
getBMHas a reusable helper thatbmhExistsandspokeResourcesExistcan both leverage. The error handling withclient.IgnoreNotFoundis appropriate.
929-949: Logic correctly identifies when spoke resources exist.The three conditions comprehensively cover scenarios where spoke resources would exist:
- Agent completed installation (Done stage)
- BMH with spoke-created annotation indicates spoke resources were provisioned
- Approved CSRs indicate the node is joining/has joined the spoke cluster
This aligns well with the PR objective to skip cleanup for agents that haven't progressed far enough in installation.
2ba0d2b to
fc1268f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
internal/controller/controllers/agent_controller_test.go (1)
5024-5042: Consider testing both CSR types for comprehensive coverage.The test correctly verifies that node removal proceeds during installation when CSRs are approved (indicating spoke resources exist). However, it only tests with a serving CSR (
CSRTypeServing). For more comprehensive coverage, consider adding a test that verifies behavior with both client and serving CSRs, since both types are tracked inagent.Status.CSRStatus.ApprovedCSRs.🔎 Optional: Test with both CSR types
Consider adding a similar test case that includes both CSR types:
agent.Status.CSRStatus.ApprovedCSRs = []v1beta1.CSRInfo{ { Name: "csr-host-client", Type: v1beta1.CSRTypeClient, ApprovedAt: metav1.Now(), }, { Name: "csr-host-serving", Type: v1beta1.CSRTypeServing, ApprovedAt: metav1.Now(), }, }This would verify the cleanup logic handles both CSR types correctly, matching the real-world scenario where both are typically approved during installation.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (2)
internal/controller/controllers/agent_controller.gointernal/controller/controllers/agent_controller_test.go
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
internal/controller/controllers/agent_controller.gointernal/controller/controllers/agent_controller_test.go
🧬 Code graph analysis (2)
internal/controller/controllers/agent_controller.go (3)
api/v1beta1/agent_types.go (2)
Agent(342-348)CSRStatus(324-330)internal/controller/controllers/bmh_agent_controller.go (2)
AGENT_BMH_LABEL(81-81)BMH_SPOKE_CREATED_ANNOTATION(99-99)models/host_stage.go (1)
HostStageDone(67-67)
internal/controller/controllers/agent_controller_test.go (4)
models/host.go (1)
HostStatusInstalling(635-635)models/host_stage.go (3)
HostStageRebooting(55-55)HostStageDone(67-67)HostStageInstalling(46-46)api/v1beta1/agent_types.go (3)
CSRStatus(324-330)CSRInfo(317-321)CSRTypeServing(313-313)internal/controller/controllers/bmh_agent_controller.go (2)
AGENT_BMH_LABEL(81-81)BMH_SPOKE_CREATED_ANNOTATION(99-99)
🔇 Additional comments (10)
internal/controller/controllers/agent_controller.go (5)
665-679: LGTM! Clean helper extraction.The
getBMHhelper properly handles the lookup logic with correct error handling: returnsnilwhen no BMH label exists or BMH is not found (both expected scenarios), and propagates other errors appropriately. This extraction improves code readability and reusability.
929-949: Excellent design decision to gate cleanup on resource existence.This helper correctly determines spoke resource presence by checking three indicators:
- Agent installation completed (
HostStageDone)- BMH has spoke creation annotation
- CSRs have been approved (node joining flow initiated)
This directly addresses the PR objective: preventing stuck Agent deletions when installation started but hasn't progressed far enough to create spoke resources. The logic properly handles uninitialized states (empty stage string ≠
HostStageDone).
511-517: LGTM! Core fix correctly gates spoke cleanup.The finalizer now properly checks
spokeResourcesExistbefore attempting cleanup, preventing the stuck deletion scenario when an agent started installation but hasn't created spoke resources yet. The condition correctly requires all three: no skip annotation, bound to cluster, and spoke resources exist.Note: The spoke client initialization on lines 535-543 is wisely deferred until cleanup is confirmed necessary, avoiding unnecessary spoke cluster connections.
961-974: LGTM! Defer refactoring addresses previous review concern.The
updateStatusrefactoring consolidates return paths through a singleretvariable and ensures status patching happens via defer. The past review concern about defer return signatures has been properly addressed—the defer now assigns to theerrvariable in the enclosing scope, ensuring patch errors propagate to the caller.Minor note: If both the function body and the patch operation fail, the patch error will override the original error on line 968. However, both errors are logged, so this trade-off is acceptable for cleaner status patching logic.
681-687: LGTM! Clean refactoring.The
bmhExistsmethod now correctly delegates to thegetBMHhelper, improving maintainability by centralizing the BMH lookup logic while preserving error propagation.internal/controller/controllers/agent_controller_test.go (5)
3442-3443: LGTM: Test expectations correctly updated.The updated expectations now verify that the agent maintains its status (
HostStatusInstalling) and stage (HostStageRebooting) even when encountering node retrieval errors, rather than clearing these fields. This aligns with the test setup and reflects proper error handling in the production code.
4786-4786: LGTM: Proper test setup for terminal state.Setting the agent's current stage to
Donein theBeforeEachestablishes the baseline that the agent has completed installation before finalization. This is appropriate for most finalizer test scenarios and allows individual tests to override when testing mid-installation cleanup.
4921-4952: LGTM: Test correctly verifies cleanup gating.This test properly validates that spoke resource cleanup is skipped when an agent is in the middle of installation (
HostStageInstalling) without any spoke resources created (no approved CSRs or BMH). The test correctly:
- Overrides the
BeforeEachstage to simulate mid-installation- Expects no spoke client initialization or node removal operations
- Verifies the finalizer completes without attempting spoke cleanup
This aligns with the PR objective to prevent stuck deletions when agents haven't completed installation.
5005-5005: LGTM: Consistent baseline setup.Setting the agent stage to
Donein the nestedBeforeEachestablishes the baseline for tests using the fake spoke client. This is appropriate since these tests typically verify spoke resource cleanup for completed installations, with individual tests able to override when testing edge cases.
5044-5071: LGTM: Test correctly verifies BMH cleanup logic.This test properly validates that BMH removal proceeds during installation when the BMH has the spoke-created annotation (
BMH_SPOKE_CREATED_ANNOTATION: "true"). The test correctly:
- Sets up the agent with the BMH label linking to the BMH
- Mocks the BMH with the annotation indicating it was created on the spoke cluster
- Verifies the BMH is deleted during finalization
- Uses consistent verification patterns (IsNotFound check)
The presence of the annotation indicates the BMH was created on the spoke cluster and should be cleaned up, even during mid-installation deletion.
2fa8d45 to
274af1a
Compare
The updateStatus function might exit early and not end up patching the agent's status since the patch was at the end. This ensures that it always patches it.
274af1a to
7183461
Compare
|
/retest-required |
|
/override ci/prow/okd-scos-images |
|
@gamli75: Overrode contexts on behalf of gamli75: ci/prow/okd-scos-images DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@CrystalChun: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Previously, when an Agent starts installation and hasn't completed installing, and a user comes in and tries to delete it, the Agent will be stuck deleting because spoke resource deletion will fail.
This gates the spoke cleanup process with the Agent's current status. The Agent must have spoke resources that need to be removed or is installed.
If the Agent does not have any spoke resources or is not installed, then the spoke cleanup process will be skipped.
List all the issues related to this PR
What environments does this code impact?
How was this code tested?
Manual Testing
Recreate customer scenario
Additional functionality testing
Regression testing
Checklist
docs, README, etc)Reviewers Checklist