Skip to content

Conversation

@CrystalChun
Copy link
Contributor

@CrystalChun CrystalChun commented Dec 22, 2025

Previously, when an Agent starts installation and hasn't completed installing, and a user comes in and tries to delete it, the Agent will be stuck deleting because spoke resource deletion will fail.

This gates the spoke cleanup process with the Agent's current status. The Agent must have spoke resources that need to be removed or is installed.

If the Agent does not have any spoke resources or is not installed, then the spoke cleanup process will be skipped.

List all the issues related to this PR

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Manual Testing

Recreate customer scenario

  1. Infraenv w/ 1 BMH and 1 agent
  2. Start installing agent
  3. Delete BMH, Agent, and InfraEnv
  4. Should delete all resources successfully

Additional functionality testing

  • Agent that is installing with a BMH created should have its BMH (and Node) deleted from the spoke cluster when it's deleted and should be deleted successfully afterwards
  • Agent that is installing with CSRs approved should have its Node deleted from the spoke cluster and should be deleted successfully afterwards

Regression testing

  • Agent that has not been bound to a cluster and has not started installation should be deleted successfully
  • Agent that has installed (with InfraEnv and Cluster not deleted) should have its spoke resources removed and be deleted successfully
  • Agent that has installed and needs spoke resource removal, but assisted does not have spoke client access (fails to delete spoke resources) should not be deleted
  • Agent that has installed (with InfraEnv deleted, but Cluster not deleted) should not have its spoke resources removed and be deleted successfully
  • Agent that has installed (with InfraEnv not deleted, but Cluster deleted) should not have its spoke resources removed and be deleted successfully
  • Agent that has installed (with InfraEnv and Cluster not deleted) with skip spoke cleanup annotation should not have its spoke resources removed and be deleted successfully

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

@openshift-ci
Copy link

openshift-ci bot commented Dec 22, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Dec 22, 2025
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 22, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 22, 2025

@CrystalChun: This pull request references MGMT-22278 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Previously, when an Agent starts installation and hasn't completed installing, and a user comes in and tries to delete it, the Agent will be stuck deleting because spoke resource deletion will fail.

This gates the spoke cleanup process with the Agent's current status. The Agent must have spoke resources that need to be removed or is installed.

If the Agent does not have any spoke resources or is not installed, then the spoke cleanup process will be skipped.

List all the issues related to this PR

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link

coderabbitai bot commented Dec 22, 2025

Walkthrough

Adds getBMH and spokeResourcesExist helpers; refactors status handling to use a local return value with deferred status patching; finalizer now conditions spoke cleanup on spoke/cluster/infra presence and initializes spoke client only when cleanup will run; tests extended for mid-install, CSR, and BMH cleanup paths.

Changes

Cohort / File(s) Summary
Agent controller (production)
internal/controller/controllers/agent_controller.go
Added getBMH(ctx, agent) (*bmh_v1alpha1.BareMetalHost, error) and spokeResourcesExist(ctx, agent) (bool, error); updated bmhExists to delegate to getBMH; spokeResourcesExist returns true for Done stage, spoke-annotated BMH, or approved CSRs. Refactored updateStatus to use a local ret with deferred status patching, integrated CSR approval handling, node retrieval, day-2 flows, event URL population, and uses presence checks (spokeResourcesExist, clusterExists, infraEnvExists) to gate spoke cleanup and spoke client initialization in finalization.
Agent controller tests
internal/controller/controllers/agent_controller_test.go
Added and adjusted tests for finalizer/cleanup behavior: not removing node when agent not installed; removal during mid-install when CSRs are approved; removal of BMH when annotated as spoke-created; set up HostStageDone in terminal cleanup flows and updated failure-path expectations.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 274af1a and 7183461.

📒 Files selected for processing (2)
  • internal/controller/controllers/agent_controller.go
  • internal/controller/controllers/agent_controller_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/controller/controllers/agent_controller_test.go
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • internal/controller/controllers/agent_controller.go
🔇 Additional comments (5)
internal/controller/controllers/agent_controller.go (5)

511-547: LGTM! The spoke cleanup gating logic is well-structured.

The conditional checks correctly gate spoke resource cleanup on:

  1. spokeResourcesExist being true
  2. Skip annotation not present
  3. Agent bound to a cluster
  4. Both cluster and infraEnv still existing

Moving the spoke client initialization inside the conditional block is a good optimization.


665-687: Clean refactor of BMH lookup logic.

The separation into getBMH returning the object and bmhExists as a convenience wrapper is well-designed. Returning (nil, nil) for missing label or not-found BMH correctly distinguishes "no BMH" from actual errors.


929-949: Sound logic for detecting spoke resource presence.

The three-pronged check (Done stage, BMH spoke annotation, approved CSRs) correctly identifies scenarios where spoke resources may exist. The len() check on ApprovedCSRs safely handles nil slices.


964-970: Defer pattern correctly propagates patch errors.

The defer properly patches status on function exit and assigns any patch error to the named return err, which will be returned to the caller. This addresses the previous review feedback.


1006-1034: Day-2 flow error handling is appropriately nuanced.

The differentiation between hard errors (propagated via err) and soft errors (returned with nil to trigger requeue without failing reconciliation) is well thought out. CSR approval check failures at lines 1006-1007 correctly allow reconciliation to proceed with a requeue rather than failing.


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 22, 2025
@openshift-ci
Copy link

openshift-ci bot commented Dec 22, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CrystalChun

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 22, 2025
Previously, when an Agent starts installation and hasn't completed
installing, and a user comes in and tries to delete it, the Agent
will be stuck deleting because spoke resource deletion will fail.

This gates the spoke cleanup process with the Agent's current
status. The Agent must have spoke resources that need to be removed
or is installed.

If the Agent does not have any spoke resources or is not installed,
then the spoke cleanup process will be skipped.
@CrystalChun
Copy link
Contributor Author

/test ?

@openshift-ci
Copy link

openshift-ci bot commented Dec 22, 2025

@CrystalChun: The following commands are available to trigger required jobs:

/test e2e-agent-compact-ipv4
/test edge-assisted-operator-catalog-publish-verify
/test edge-ci-index
/test edge-e2e-ai-operator-disconnected-capi
/test edge-e2e-ai-operator-ztp
/test edge-e2e-ai-operator-ztp-3masters
/test edge-e2e-ai-operator-ztp-capi
/test edge-e2e-ai-operator-ztp-disconnected
/test edge-e2e-metal-assisted-4-16
/test edge-e2e-metal-assisted-4-17
/test edge-e2e-metal-assisted-4-20
/test edge-e2e-metal-assisted-5-control-planes-4-20
/test edge-e2e-metal-assisted-external-4-20
/test edge-e2e-metal-assisted-lvm-4-20
/test edge-e2e-metal-assisted-none-4-20
/test edge-e2e-metal-assisted-openshift-ai-4-18
/test edge-e2e-metal-assisted-osc-4-20
/test edge-e2e-metal-assisted-virtualization-4-19
/test edge-e2e-metal-assisted-vlan-4-20
/test edge-e2e-vsphere-assisted-4-20
/test edge-images
/test edge-lint
/test edge-operator-publish-verify
/test edge-subsystem-aws
/test edge-subsystem-kubeapi-aws
/test edge-unit-test
/test edge-verify-generated-code
/test images
/test mce-images
/test okd-scos-images
/test verify-deps

The following commands are available to trigger optional jobs:

/test e2e-agent-4control-ipv4
/test e2e-agent-5control-ipv4
/test e2e-agent-ha-dualstack
/test e2e-agent-sno-ipv6
/test edge-e2e-ai-operator-ztp-4masters
/test edge-e2e-ai-operator-ztp-5masters
/test edge-e2e-ai-operator-ztp-compact-day2-masters
/test edge-e2e-ai-operator-ztp-compact-day2-workers
/test edge-e2e-ai-operator-ztp-multiarch-3masters-ocp
/test edge-e2e-ai-operator-ztp-multiarch-sno-ocp
/test edge-e2e-ai-operator-ztp-node-labels
/test edge-e2e-ai-operator-ztp-remove-node
/test edge-e2e-ai-operator-ztp-sno-day2-masters
/test edge-e2e-ai-operator-ztp-sno-day2-workers
/test edge-e2e-ai-operator-ztp-sno-day2-workers-ignitionoverride
/test edge-e2e-ai-operator-ztp-sno-day2-workers-late-binding
/test edge-e2e-metal-assisted-4-control-planes-4-20
/test edge-e2e-metal-assisted-4-masters-none-4-20
/test edge-e2e-metal-assisted-bond-4-20
/test edge-e2e-metal-assisted-day2-4-20
/test edge-e2e-metal-assisted-day2-arm-workers-4-20
/test edge-e2e-metal-assisted-day2-sno-4-20
/test edge-e2e-metal-assisted-dual-primary-v6-compact-4-20
/test edge-e2e-metal-assisted-dual-stack-primary-ipv6-4-20
/test edge-e2e-metal-assisted-ha-kube-api-ipv4-4-20
/test edge-e2e-metal-assisted-ha-kube-api-ipv6-4-20
/test edge-e2e-metal-assisted-ipv4v6-4-20
/test edge-e2e-metal-assisted-ipv6-4-20
/test edge-e2e-metal-assisted-kube-api-late-binding-sno-4-20
/test edge-e2e-metal-assisted-kube-api-late-unbinding-sno-4-20
/test edge-e2e-metal-assisted-kube-api-umlb-4-20
/test edge-e2e-metal-assisted-onprem-4-20
/test edge-e2e-metal-assisted-osc-sno-4-20
/test edge-e2e-metal-assisted-sno-4-20
/test edge-e2e-metal-assisted-static-ip-suite-4-20
/test edge-e2e-metal-assisted-tang-4-20
/test edge-e2e-metal-assisted-tpmv2-4-20
/test edge-e2e-metal-assisted-umlb-4-20
/test edge-e2e-metal-assisted-upgrade-agent-4-20
/test edge-e2e-oci-assisted-4-20
/test edge-e2e-oci-assisted-bm-iscsi-4-20
/test edge-e2e-vsphere-assisted-umlb-4-20
/test edge-e2e-vsphere-assisted-umn-4-20
/test okd-scos-e2e-aws-ovn
/test push-pr-image

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-assisted-service-master-e2e-agent-compact-ipv4
pull-ci-openshift-assisted-service-master-edge-ci-index
pull-ci-openshift-assisted-service-master-edge-e2e-ai-operator-disconnected-capi
pull-ci-openshift-assisted-service-master-edge-e2e-ai-operator-ztp
pull-ci-openshift-assisted-service-master-edge-e2e-ai-operator-ztp-capi
pull-ci-openshift-assisted-service-master-edge-e2e-metal-assisted-4-20
pull-ci-openshift-assisted-service-master-edge-images
pull-ci-openshift-assisted-service-master-edge-lint
pull-ci-openshift-assisted-service-master-edge-subsystem-aws
pull-ci-openshift-assisted-service-master-edge-subsystem-kubeapi-aws
pull-ci-openshift-assisted-service-master-edge-unit-test
pull-ci-openshift-assisted-service-master-edge-verify-generated-code
pull-ci-openshift-assisted-service-master-images
pull-ci-openshift-assisted-service-master-mce-images
pull-ci-openshift-assisted-service-master-okd-scos-images
pull-ci-openshift-assisted-service-master-verify-deps
Details

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@CrystalChun
Copy link
Contributor Author

/test edge-subsystem-kubeapi-aws

@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 22, 2025

@CrystalChun: This pull request references MGMT-22278 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Previously, when an Agent starts installation and hasn't completed installing, and a user comes in and tries to delete it, the Agent will be stuck deleting because spoke resource deletion will fail.

This gates the spoke cleanup process with the Agent's current status. The Agent must have spoke resources that need to be removed or is installed.

If the Agent does not have any spoke resources or is not installed, then the spoke cleanup process will be skipped.

List all the issues related to this PR

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Manual Testing

Recreate customer scenario

  1. Infraenv w/ 1 BMH and 1 agent
  2. Start installing agent
  3. Delete BMH, Agent, and InfraEnv
  4. Should delete all resources successfully

Additional functionality testing

  • Agent that is installing with a BMH created should have its BMH (and Node) deleted from the spoke cluster when it's deleted and should be deleted successfully afterwards
  • Agent that is installing with CSRs approved should have its Node deleted from the spoke cluster and should be deleted successfully afterwards

Regression testing

  • Agent that has not been bound to a cluster and has not started installation should be deleted successfully
  • Agent that has installed should have its spoke resources removed and be deleted successfully
  • Agent that has installed and needs spoke resource removal, but assisted does not have spoke client access (fails to delete spoke resources) should not be deleted

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 22, 2025

@CrystalChun: This pull request references MGMT-22278 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Previously, when an Agent starts installation and hasn't completed installing, and a user comes in and tries to delete it, the Agent will be stuck deleting because spoke resource deletion will fail.

This gates the spoke cleanup process with the Agent's current status. The Agent must have spoke resources that need to be removed or is installed.

If the Agent does not have any spoke resources or is not installed, then the spoke cleanup process will be skipped.

List all the issues related to this PR

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Manual Testing

Recreate customer scenario

  1. Infraenv w/ 1 BMH and 1 agent
  2. Start installing agent
  3. Delete BMH, Agent, and InfraEnv
  4. Should delete all resources successfully

Additional functionality testing

  • Agent that is installing with a BMH created should have its BMH (and Node) deleted from the spoke cluster when it's deleted and should be deleted successfully afterwards
  • Agent that is installing with CSRs approved should have its Node deleted from the spoke cluster and should be deleted successfully afterwards

Regression testing

  • Agent that has not been bound to a cluster and has not started installation should be deleted successfully
  • Agent that has installed (with InfraEnv and Cluster not deleted) should have its spoke resources removed and be deleted successfully
  • Agent that has installed and needs spoke resource removal, but assisted does not have spoke client access (fails to delete spoke resources) should not be deleted
  • Agent that has installed (with InfraEnv deleted, but Cluster not deleted) should not have its spoke resources removed and be deleted successfully
  • Agent that has installed (with InfraEnv not deleted, but Cluster deleted) should not have its spoke resources removed and be deleted successfully
  • Agent that has installed (with InfraEnv and Cluster not deleted) with skip spoke cleanup annotation should not have its spoke resources removed and be deleted successfully

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 22, 2025

@CrystalChun: This pull request references MGMT-22278 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Previously, when an Agent starts installation and hasn't completed installing, and a user comes in and tries to delete it, the Agent will be stuck deleting because spoke resource deletion will fail.

This gates the spoke cleanup process with the Agent's current status. The Agent must have spoke resources that need to be removed or is installed.

If the Agent does not have any spoke resources or is not installed, then the spoke cleanup process will be skipped.

List all the issues related to this PR

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Manual Testing

Recreate customer scenario

  1. Infraenv w/ 1 BMH and 1 agent
  2. Start installing agent
  3. Delete BMH, Agent, and InfraEnv
  4. Should delete all resources successfully

Additional functionality testing

  • Agent that is installing with a BMH created should have its BMH (and Node) deleted from the spoke cluster when it's deleted and should be deleted successfully afterwards
  • Agent that is installing with CSRs approved should have its Node deleted from the spoke cluster and should be deleted successfully afterwards

Regression testing

  • Agent that has not been bound to a cluster and has not started installation should be deleted successfully
  • Agent that has installed (with InfraEnv and Cluster not deleted) should have its spoke resources removed and be deleted successfully
  • Agent that has installed and needs spoke resource removal, but assisted does not have spoke client access (fails to delete spoke resources) should not be deleted
  • Agent that has installed (with InfraEnv deleted, but Cluster not deleted) should not have its spoke resources removed and be deleted successfully
  • Agent that has installed (with InfraEnv not deleted, but Cluster deleted) should not have its spoke resources removed and be deleted successfully
  • Agent that has installed (with InfraEnv and Cluster not deleted) with skip spoke cleanup annotation should not have its spoke resources removed and be deleted successfully

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 23, 2025

@CrystalChun: This pull request references MGMT-22278 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Previously, when an Agent starts installation and hasn't completed installing, and a user comes in and tries to delete it, the Agent will be stuck deleting because spoke resource deletion will fail.

This gates the spoke cleanup process with the Agent's current status. The Agent must have spoke resources that need to be removed or is installed.

If the Agent does not have any spoke resources or is not installed, then the spoke cleanup process will be skipped.

List all the issues related to this PR

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Manual Testing

Recreate customer scenario

  1. Infraenv w/ 1 BMH and 1 agent
  2. Start installing agent
  3. Delete BMH, Agent, and InfraEnv
  4. Should delete all resources successfully

Additional functionality testing

  • Agent that is installing with a BMH created should have its BMH (and Node) deleted from the spoke cluster when it's deleted and should be deleted successfully afterwards
  • Agent that is installing with CSRs approved should have its Node deleted from the spoke cluster and should be deleted successfully afterwards

Regression testing

  • Agent that has not been bound to a cluster and has not started installation should be deleted successfully
  • Agent that has installed (with InfraEnv and Cluster not deleted) should have its spoke resources removed and be deleted successfully
  • Agent that has installed and needs spoke resource removal, but assisted does not have spoke client access (fails to delete spoke resources) should not be deleted
  • Agent that has installed (with InfraEnv deleted, but Cluster not deleted) should not have its spoke resources removed and be deleted successfully
  • Agent that has installed (with InfraEnv not deleted, but Cluster deleted) should not have its spoke resources removed and be deleted successfully
  • Agent that has installed (with InfraEnv and Cluster not deleted) with skip spoke cleanup annotation should not have its spoke resources removed and be deleted successfully

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 23, 2025

@CrystalChun: This pull request references MGMT-22278 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Previously, when an Agent starts installation and hasn't completed installing, and a user comes in and tries to delete it, the Agent will be stuck deleting because spoke resource deletion will fail.

This gates the spoke cleanup process with the Agent's current status. The Agent must have spoke resources that need to be removed or is installed.

If the Agent does not have any spoke resources or is not installed, then the spoke cleanup process will be skipped.

List all the issues related to this PR

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Manual Testing

Recreate customer scenario

  1. Infraenv w/ 1 BMH and 1 agent
  2. Start installing agent
  3. Delete BMH, Agent, and InfraEnv
  4. Should delete all resources successfully

Additional functionality testing

  • Agent that is installing with a BMH created should have its BMH (and Node) deleted from the spoke cluster when it's deleted and should be deleted successfully afterwards
  • Agent that is installing with CSRs approved should have its Node deleted from the spoke cluster and should be deleted successfully afterwards

Regression testing

  • Agent that has not been bound to a cluster and has not started installation should be deleted successfully
  • Agent that has installed (with InfraEnv and Cluster not deleted) should have its spoke resources removed and be deleted successfully
  • Agent that has installed and needs spoke resource removal, but assisted does not have spoke client access (fails to delete spoke resources) should not be deleted
  • Agent that has installed (with InfraEnv deleted, but Cluster not deleted) should not have its spoke resources removed and be deleted successfully
  • Agent that has installed (with InfraEnv not deleted, but Cluster deleted) should not have its spoke resources removed and be deleted successfully
  • Agent that has installed (with InfraEnv and Cluster not deleted) with skip spoke cleanup annotation should not have its spoke resources removed and be deleted successfully

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 23, 2025

@CrystalChun: This pull request references MGMT-22278 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Previously, when an Agent starts installation and hasn't completed installing, and a user comes in and tries to delete it, the Agent will be stuck deleting because spoke resource deletion will fail.

This gates the spoke cleanup process with the Agent's current status. The Agent must have spoke resources that need to be removed or is installed.

If the Agent does not have any spoke resources or is not installed, then the spoke cleanup process will be skipped.

List all the issues related to this PR

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Manual Testing

Recreate customer scenario

  1. Infraenv w/ 1 BMH and 1 agent
  2. Start installing agent
  3. Delete BMH, Agent, and InfraEnv
  4. Should delete all resources successfully

Additional functionality testing

  • Agent that is installing with a BMH created should have its BMH (and Node) deleted from the spoke cluster when it's deleted and should be deleted successfully afterwards
  • Agent that is installing with CSRs approved should have its Node deleted from the spoke cluster and should be deleted successfully afterwards

Regression testing

  • Agent that has not been bound to a cluster and has not started installation should be deleted successfully
  • Agent that has installed (with InfraEnv and Cluster not deleted) should have its spoke resources removed and be deleted successfully
  • Agent that has installed and needs spoke resource removal, but assisted does not have spoke client access (fails to delete spoke resources) should not be deleted
  • Agent that has installed (with InfraEnv deleted, but Cluster not deleted) should not have its spoke resources removed and be deleted successfully
  • Agent that has installed (with InfraEnv not deleted, but Cluster deleted) should not have its spoke resources removed and be deleted successfully
  • Agent that has installed (with InfraEnv and Cluster not deleted) with skip spoke cleanup annotation should not have its spoke resources removed and be deleted successfully

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 23, 2025

@CrystalChun: This pull request references MGMT-22278 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Previously, when an Agent starts installation and hasn't completed installing, and a user comes in and tries to delete it, the Agent will be stuck deleting because spoke resource deletion will fail.

This gates the spoke cleanup process with the Agent's current status. The Agent must have spoke resources that need to be removed or is installed.

If the Agent does not have any spoke resources or is not installed, then the spoke cleanup process will be skipped.

List all the issues related to this PR

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Manual Testing

Recreate customer scenario

  1. Infraenv w/ 1 BMH and 1 agent
  2. Start installing agent
  3. Delete BMH, Agent, and InfraEnv
  4. Should delete all resources successfully

Additional functionality testing

  • Agent that is installing with a BMH created should have its BMH (and Node) deleted from the spoke cluster when it's deleted and should be deleted successfully afterwards
  • Agent that is installing with CSRs approved should have its Node deleted from the spoke cluster and should be deleted successfully afterwards

Regression testing

  • Agent that has not been bound to a cluster and has not started installation should be deleted successfully
  • Agent that has installed (with InfraEnv and Cluster not deleted) should have its spoke resources removed and be deleted successfully
  • Agent that has installed and needs spoke resource removal, but assisted does not have spoke client access (fails to delete spoke resources) should not be deleted
  • Agent that has installed (with InfraEnv deleted, but Cluster not deleted) should not have its spoke resources removed and be deleted successfully
  • Agent that has installed (with InfraEnv not deleted, but Cluster deleted) should not have its spoke resources removed and be deleted successfully
  • Agent that has installed (with InfraEnv and Cluster not deleted) with skip spoke cleanup annotation should not have its spoke resources removed and be deleted successfully

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 23, 2025

@CrystalChun: This pull request references MGMT-22278 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Previously, when an Agent starts installation and hasn't completed installing, and a user comes in and tries to delete it, the Agent will be stuck deleting because spoke resource deletion will fail.

This gates the spoke cleanup process with the Agent's current status. The Agent must have spoke resources that need to be removed or is installed.

If the Agent does not have any spoke resources or is not installed, then the spoke cleanup process will be skipped.

List all the issues related to this PR

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Manual Testing

Recreate customer scenario

  1. Infraenv w/ 1 BMH and 1 agent
  2. Start installing agent
  3. Delete BMH, Agent, and InfraEnv
  4. Should delete all resources successfully

Additional functionality testing

  • Agent that is installing with a BMH created should have its BMH (and Node) deleted from the spoke cluster when it's deleted and should be deleted successfully afterwards
  • Agent that is installing with CSRs approved should have its Node deleted from the spoke cluster and should be deleted successfully afterwards

Regression testing

  • Agent that has not been bound to a cluster and has not started installation should be deleted successfully
  • Agent that has installed (with InfraEnv and Cluster not deleted) should have its spoke resources removed and be deleted successfully
  • Agent that has installed and needs spoke resource removal, but assisted does not have spoke client access (fails to delete spoke resources) should not be deleted
  • Agent that has installed (with InfraEnv deleted, but Cluster not deleted) should not have its spoke resources removed and be deleted successfully
  • Agent that has installed (with InfraEnv not deleted, but Cluster deleted) should not have its spoke resources removed and be deleted successfully
  • Agent that has installed (with InfraEnv and Cluster not deleted) with skip spoke cleanup annotation should not have its spoke resources removed and be deleted successfully

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 23, 2025

@CrystalChun: This pull request references MGMT-22278 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Previously, when an Agent starts installation and hasn't completed installing, and a user comes in and tries to delete it, the Agent will be stuck deleting because spoke resource deletion will fail.

This gates the spoke cleanup process with the Agent's current status. The Agent must have spoke resources that need to be removed or is installed.

If the Agent does not have any spoke resources or is not installed, then the spoke cleanup process will be skipped.

List all the issues related to this PR

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Manual Testing

Recreate customer scenario

  1. Infraenv w/ 1 BMH and 1 agent
  2. Start installing agent
  3. Delete BMH, Agent, and InfraEnv
  4. Should delete all resources successfully

Additional functionality testing

  • Agent that is installing with a BMH created should have its BMH (and Node) deleted from the spoke cluster when it's deleted and should be deleted successfully afterwards
  • Agent that is installing with CSRs approved should have its Node deleted from the spoke cluster and should be deleted successfully afterwards

Regression testing

  • Agent that has not been bound to a cluster and has not started installation should be deleted successfully
  • Agent that has installed (with InfraEnv and Cluster not deleted) should have its spoke resources removed and be deleted successfully
  • Agent that has installed and needs spoke resource removal, but assisted does not have spoke client access (fails to delete spoke resources) should not be deleted
  • Agent that has installed (with InfraEnv deleted, but Cluster not deleted) should not have its spoke resources removed and be deleted successfully
  • Agent that has installed (with InfraEnv not deleted, but Cluster deleted) should not have its spoke resources removed and be deleted successfully
  • Agent that has installed (with InfraEnv and Cluster not deleted) with skip spoke cleanup annotation should not have its spoke resources removed and be deleted successfully

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

1 similar comment
@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 23, 2025

@CrystalChun: This pull request references MGMT-22278 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Previously, when an Agent starts installation and hasn't completed installing, and a user comes in and tries to delete it, the Agent will be stuck deleting because spoke resource deletion will fail.

This gates the spoke cleanup process with the Agent's current status. The Agent must have spoke resources that need to be removed or is installed.

If the Agent does not have any spoke resources or is not installed, then the spoke cleanup process will be skipped.

List all the issues related to this PR

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Manual Testing

Recreate customer scenario

  1. Infraenv w/ 1 BMH and 1 agent
  2. Start installing agent
  3. Delete BMH, Agent, and InfraEnv
  4. Should delete all resources successfully

Additional functionality testing

  • Agent that is installing with a BMH created should have its BMH (and Node) deleted from the spoke cluster when it's deleted and should be deleted successfully afterwards
  • Agent that is installing with CSRs approved should have its Node deleted from the spoke cluster and should be deleted successfully afterwards

Regression testing

  • Agent that has not been bound to a cluster and has not started installation should be deleted successfully
  • Agent that has installed (with InfraEnv and Cluster not deleted) should have its spoke resources removed and be deleted successfully
  • Agent that has installed and needs spoke resource removal, but assisted does not have spoke client access (fails to delete spoke resources) should not be deleted
  • Agent that has installed (with InfraEnv deleted, but Cluster not deleted) should not have its spoke resources removed and be deleted successfully
  • Agent that has installed (with InfraEnv not deleted, but Cluster deleted) should not have its spoke resources removed and be deleted successfully
  • Agent that has installed (with InfraEnv and Cluster not deleted) with skip spoke cleanup annotation should not have its spoke resources removed and be deleted successfully

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@CrystalChun CrystalChun marked this pull request as ready for review December 23, 2025 19:12
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 23, 2025
@openshift-ci openshift-ci bot requested review from gamli75 and romfreiman December 23, 2025 19:12
@codecov
Copy link

codecov bot commented Dec 23, 2025

Codecov Report

❌ Patch coverage is 53.19149% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 43.50%. Comparing base (c6921fa) to head (7183461).
⚠️ Report is 4 commits behind head on master.

Files with missing lines Patch % Lines
...nternal/controller/controllers/agent_controller.go 53.19% 17 Missing and 5 partials ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8608      +/-   ##
==========================================
+ Coverage   43.49%   43.50%   +0.01%     
==========================================
  Files         411      411              
  Lines       71244    71271      +27     
==========================================
+ Hits        30987    31010      +23     
- Misses      37497    37501       +4     
  Partials     2760     2760              
Files with missing lines Coverage Δ
...nternal/controller/controllers/agent_controller.go 76.59% <53.19%> (-0.26%) ⬇️

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
internal/controller/controllers/agent_controller.go (1)

929-949: Error handling already mitigates transient failures appropriately.

The concern about getBMH errors blocking deletion is valid. However, the code already provides safeguards: client.IgnoreNotFound() filters out NotFound errors (lines 675-676), errors are logged with context (line 513), and the requeue mechanism enables automatic retry. This conservative error-handling pattern is consistent across all pre-deletion checks and intentionally prioritizes preventing incomplete cleanup over avoiding temporary deletion delays. If persistent transient errors become a production issue, metrics on error frequency would be a useful optional enhancement, but the current logging provides adequate visibility.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 1554411 and baac52b.

📒 Files selected for processing (1)
  • internal/controller/controllers/agent_controller.go
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • internal/controller/controllers/agent_controller.go
🧬 Code graph analysis (1)
internal/controller/controllers/agent_controller.go (3)
api/v1beta1/agent_types.go (1)
  • CSRStatus (324-330)
internal/controller/controllers/bmh_agent_controller.go (2)
  • AGENT_BMH_LABEL (81-81)
  • BMH_SPOKE_CREATED_ANNOTATION (99-99)
models/host_stage.go (1)
  • HostStageDone (67-67)
🔇 Additional comments (2)
internal/controller/controllers/agent_controller.go (2)

665-687: LGTM! Clean helper extraction.

The getBMH helper properly handles the case where no BMH label exists and uses client.IgnoreNotFound to treat "not found" as a non-error. Refactoring bmhExists to delegate to getBMH eliminates duplication.


511-517: LGTM! Cleanup gating aligns with PR objectives.

The spoke cleanup is now correctly gated on three conditions:

  1. No skip annotation
  2. Agent bound to a cluster
  3. Spoke resources actually exist (per spokeResourcesExist)

This prevents the deletion flow from attempting cleanup when the agent hasn't progressed far enough to create spoke resources, addressing the stuck-deletion scenario described in the PR.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
internal/controller/controllers/agent_controller.go (1)

963-974: Deferred patch error won't propagate to caller.

The err = patchErr assignment inside the defer won't affect the returned error because Go evaluates return values before executing deferred functions, and the return values here are unnamed. The patch error is logged but silently discarded.

If patch errors should propagate, use named return values:

🔎 Use named return values to propagate patch errors
-func (r *AgentReconciler) updateStatus(ctx context.Context, log logrus.FieldLogger, agent, origAgent *aiv1beta1.Agent, h *models.Host, clusterId *strfmt.UUID, syncErr error, internal bool) (ctrl.Result, error) {
+func (r *AgentReconciler) updateStatus(ctx context.Context, log logrus.FieldLogger, agent, origAgent *aiv1beta1.Agent, h *models.Host, clusterId *strfmt.UUID, syncErr error, internal bool) (ret ctrl.Result, err error) {

 	var (
-		err                   error
 		shouldAutoApproveCSRs bool
 		spokeClient           spoke_k8s_client.SpokeK8sClient
 		node                  *corev1.Node
-		ret                   ctrl.Result = ctrl.Result{}
 	)

Alternatively, if patch errors should only be logged without affecting reconciliation, remove the err = patchErr assignment to avoid confusion about intent.

🧹 Nitpick comments (1)
internal/controller/controllers/agent_controller.go (1)

1057-1059: Setting err = nil before return is redundant.

At line 1058, err = nil is set explicitly, but then line 1059 returns ret, nil directly anyway. The assignment has no effect.

🔎 Simplify by removing redundant assignment
 		if err != nil {
 			ret = ctrl.Result{Requeue: true}
-			err = nil
 			return ret, nil
 		}
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between baac52b and 2ba0d2b.

📒 Files selected for processing (1)
  • internal/controller/controllers/agent_controller.go
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • internal/controller/controllers/agent_controller.go
🧬 Code graph analysis (1)
internal/controller/controllers/agent_controller.go (3)
api/v1beta1/agent_types.go (2)
  • Agent (342-348)
  • CSRStatus (324-330)
internal/controller/controllers/bmh_agent_controller.go (2)
  • AGENT_BMH_LABEL (81-81)
  • BMH_SPOKE_CREATED_ANNOTATION (99-99)
models/host_stage.go (1)
  • HostStageDone (67-67)
🔇 Additional comments (3)
internal/controller/controllers/agent_controller.go (3)

511-517: LGTM! Proper gating of spoke resource cleanup.

The new spokeResourcesExist check correctly gates the cleanup logic, ensuring spoke resources are only removed when they actually exist. This addresses the PR objective of preventing stuck agent deletions when installation hasn't completed.


665-687: Clean helper extraction.

Good refactor extracting getBMH as a reusable helper that bmhExists and spokeResourcesExist can both leverage. The error handling with client.IgnoreNotFound is appropriate.


929-949: Logic correctly identifies when spoke resources exist.

The three conditions comprehensively cover scenarios where spoke resources would exist:

  1. Agent completed installation (Done stage)
  2. BMH with spoke-created annotation indicates spoke resources were provisioned
  3. Approved CSRs indicate the node is joining/has joined the spoke cluster

This aligns well with the PR objective to skip cleanup for agents that haven't progressed far enough in installation.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
internal/controller/controllers/agent_controller_test.go (1)

5024-5042: Consider testing both CSR types for comprehensive coverage.

The test correctly verifies that node removal proceeds during installation when CSRs are approved (indicating spoke resources exist). However, it only tests with a serving CSR (CSRTypeServing). For more comprehensive coverage, consider adding a test that verifies behavior with both client and serving CSRs, since both types are tracked in agent.Status.CSRStatus.ApprovedCSRs.

🔎 Optional: Test with both CSR types

Consider adding a similar test case that includes both CSR types:

agent.Status.CSRStatus.ApprovedCSRs = []v1beta1.CSRInfo{
    {
        Name:       "csr-host-client",
        Type:       v1beta1.CSRTypeClient,
        ApprovedAt: metav1.Now(),
    },
    {
        Name:       "csr-host-serving",
        Type:       v1beta1.CSRTypeServing,
        ApprovedAt: metav1.Now(),
    },
}

This would verify the cleanup logic handles both CSR types correctly, matching the real-world scenario where both are typically approved during installation.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 2ba0d2b and fc1268f.

📒 Files selected for processing (2)
  • internal/controller/controllers/agent_controller.go
  • internal/controller/controllers/agent_controller_test.go
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • internal/controller/controllers/agent_controller.go
  • internal/controller/controllers/agent_controller_test.go
🧬 Code graph analysis (2)
internal/controller/controllers/agent_controller.go (3)
api/v1beta1/agent_types.go (2)
  • Agent (342-348)
  • CSRStatus (324-330)
internal/controller/controllers/bmh_agent_controller.go (2)
  • AGENT_BMH_LABEL (81-81)
  • BMH_SPOKE_CREATED_ANNOTATION (99-99)
models/host_stage.go (1)
  • HostStageDone (67-67)
internal/controller/controllers/agent_controller_test.go (4)
models/host.go (1)
  • HostStatusInstalling (635-635)
models/host_stage.go (3)
  • HostStageRebooting (55-55)
  • HostStageDone (67-67)
  • HostStageInstalling (46-46)
api/v1beta1/agent_types.go (3)
  • CSRStatus (324-330)
  • CSRInfo (317-321)
  • CSRTypeServing (313-313)
internal/controller/controllers/bmh_agent_controller.go (2)
  • AGENT_BMH_LABEL (81-81)
  • BMH_SPOKE_CREATED_ANNOTATION (99-99)
🔇 Additional comments (10)
internal/controller/controllers/agent_controller.go (5)

665-679: LGTM! Clean helper extraction.

The getBMH helper properly handles the lookup logic with correct error handling: returns nil when no BMH label exists or BMH is not found (both expected scenarios), and propagates other errors appropriately. This extraction improves code readability and reusability.


929-949: Excellent design decision to gate cleanup on resource existence.

This helper correctly determines spoke resource presence by checking three indicators:

  1. Agent installation completed (HostStageDone)
  2. BMH has spoke creation annotation
  3. CSRs have been approved (node joining flow initiated)

This directly addresses the PR objective: preventing stuck Agent deletions when installation started but hasn't progressed far enough to create spoke resources. The logic properly handles uninitialized states (empty stage string ≠ HostStageDone).


511-517: LGTM! Core fix correctly gates spoke cleanup.

The finalizer now properly checks spokeResourcesExist before attempting cleanup, preventing the stuck deletion scenario when an agent started installation but hasn't created spoke resources yet. The condition correctly requires all three: no skip annotation, bound to cluster, and spoke resources exist.

Note: The spoke client initialization on lines 535-543 is wisely deferred until cleanup is confirmed necessary, avoiding unnecessary spoke cluster connections.


961-974: LGTM! Defer refactoring addresses previous review concern.

The updateStatus refactoring consolidates return paths through a single ret variable and ensures status patching happens via defer. The past review concern about defer return signatures has been properly addressed—the defer now assigns to the err variable in the enclosing scope, ensuring patch errors propagate to the caller.

Minor note: If both the function body and the patch operation fail, the patch error will override the original error on line 968. However, both errors are logged, so this trade-off is acceptable for cleaner status patching logic.


681-687: LGTM! Clean refactoring.

The bmhExists method now correctly delegates to the getBMH helper, improving maintainability by centralizing the BMH lookup logic while preserving error propagation.

internal/controller/controllers/agent_controller_test.go (5)

3442-3443: LGTM: Test expectations correctly updated.

The updated expectations now verify that the agent maintains its status (HostStatusInstalling) and stage (HostStageRebooting) even when encountering node retrieval errors, rather than clearing these fields. This aligns with the test setup and reflects proper error handling in the production code.


4786-4786: LGTM: Proper test setup for terminal state.

Setting the agent's current stage to Done in the BeforeEach establishes the baseline that the agent has completed installation before finalization. This is appropriate for most finalizer test scenarios and allows individual tests to override when testing mid-installation cleanup.


4921-4952: LGTM: Test correctly verifies cleanup gating.

This test properly validates that spoke resource cleanup is skipped when an agent is in the middle of installation (HostStageInstalling) without any spoke resources created (no approved CSRs or BMH). The test correctly:

  1. Overrides the BeforeEach stage to simulate mid-installation
  2. Expects no spoke client initialization or node removal operations
  3. Verifies the finalizer completes without attempting spoke cleanup

This aligns with the PR objective to prevent stuck deletions when agents haven't completed installation.


5005-5005: LGTM: Consistent baseline setup.

Setting the agent stage to Done in the nested BeforeEach establishes the baseline for tests using the fake spoke client. This is appropriate since these tests typically verify spoke resource cleanup for completed installations, with individual tests able to override when testing edge cases.


5044-5071: LGTM: Test correctly verifies BMH cleanup logic.

This test properly validates that BMH removal proceeds during installation when the BMH has the spoke-created annotation (BMH_SPOKE_CREATED_ANNOTATION: "true"). The test correctly:

  1. Sets up the agent with the BMH label linking to the BMH
  2. Mocks the BMH with the annotation indicating it was created on the spoke cluster
  3. Verifies the BMH is deleted during finalization
  4. Uses consistent verification patterns (IsNotFound check)

The presence of the annotation indicates the BMH was created on the spoke cluster and should be cleaned up, even during mid-installation deletion.

@CrystalChun CrystalChun force-pushed the agent-deletion branch 3 times, most recently from 2fa8d45 to 274af1a Compare December 24, 2025 00:47
The updateStatus function might exit early and not end up patching
the agent's status since the patch was at the end.
This ensures that it always patches it.
@CrystalChun
Copy link
Contributor Author

/retest-required

@gamli75
Copy link
Contributor

gamli75 commented Dec 24, 2025

/override ci/prow/okd-scos-images

@openshift-ci
Copy link

openshift-ci bot commented Dec 24, 2025

@gamli75: Overrode contexts on behalf of gamli75: ci/prow/okd-scos-images

Details

In response to this:

/override ci/prow/okd-scos-images

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci
Copy link

openshift-ci bot commented Dec 24, 2025

@CrystalChun: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants