Skip to content

MGMT-20179: Fix IBIO status conditions #270

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 28, 2025

Conversation

zszabo-rh
Copy link
Contributor

@zszabo-rh zszabo-rh commented May 5, 2025

Refactoring Reconcile() in relation with MGMT-20179

Reconcile is now separated in multiple phases, each with clear exit criteria and outcome, also setting the corresponding reason(s) when failed to represent why the requirements have not been met:

  1. Config validation phase:
  • ConfigurationPending: it's either the user needs to complete the ImageClusterInstall definition, or some of referenced resources (CD or BMH) are not available yet. In both cases the reconcile ends, and will be triggered again when the problem is resolved.
  • ConfigurationFailed: sets this reason when AutomatedCleaningMode cannot be modified in BMH. Reconcile is stopped in this case.
  1. Host validation phase:
  • HostValidationPending: if BMH provisioning or hardware inspection is not ready yet, reconcile is requeued for 30s later.
  • HostValidationFailed: in case of any errors or invalid BMH configuration the reconcile ends here.
  1. Image creation phase:
  • ImageCreationPending: when lock cannot be acquired, reconcile gets requeued for 5s later to try again.
  • ImageCreationFailed: any other unexpected error stops the reconcile loop with this reason.
  1. Host configuration phase:
  • HostConfigurationPending: sets this reason in following scenarios:
    • earlier DataImage instance is still being deleted for some reason (requeue after 30s)
    • current DataImage was just created less than a second ago so BMO might not be notified yet (requeue after 1s)
    • image-based-install-managed annotation is not set yet in BMH (no requeue)
  • HostConfigurationFailed: any unexpected errors during this phase will lead to this reason and finish reconcile.
  • HostConfigurationSucceeded: all the finishing steps went fine (like setting the boot time to now), and RequirementsMet condition can finally be true.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 5, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented May 5, 2025

@zszabo-rh: This pull request references MGMT-20179 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.20.0" version, but no target version was set.

In response to this:

Refactoring Reconcile() in relation with MGMT-20179

Reconcile sets the following reasons when requirements not met due to some failure:

  • ConfigurationNotReadyReason: BMH or CD reference missing or the resource is not available
  • HostValidationPending: BMH provisioning is not ready yet
  • HostValidationFailedReason:
  • ImageNotReadyReason:
  • HostConfigurationPendingReason
  • HostConfigurationFailedReason

And finally when everything goes fine during reconcile and RequirementsMet condition is set:

  • HostConfigurationSucceededReason

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 5, 2025
Copy link

openshift-ci bot commented May 5, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot
Copy link

openshift-ci-robot commented May 5, 2025

@zszabo-rh: This pull request references MGMT-20179 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.20.0" version, but no target version was set.

In response to this:

Refactoring Reconcile() in relation with MGMT-20179

Reconcile sets the following reasons when requirements not met due to some failure:

  • ConfigurationNotReadyReason: BMH or CD reference missing or the resource is not available
  • HostValidationPending: BMH provisioning is not ready yet
  • HostValidationFailedReason:
  • ImageNotReadyReason:
  • HostConfigurationPendingReason
  • HostConfigurationFailedReason

And finally when everything goes fine during reconcile and RequirementsMet condition is set:

  • HostConfigurationSucceededReason

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented May 6, 2025

@zszabo-rh: This pull request references MGMT-20179 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.20.0" version, but no target version was set.

In response to this:

Refactoring Reconcile() in relation with MGMT-20179

Reconcile is now separated in multiple phases, each with clear exit criteria and outcome, also setting the corresponding reason(s) when failed to represent why the requirements have not been met:

  1. Config validation phase:
  • ConfigurationNotReady: it's either the user needs to complete the ImageClusterInstall definition, or some of referenced resources (CD or BMH) are not available yet. In both cases the reconcile ends, and will be triggered again when the problem is resolved.
  1. Host validation phase:
  • HostValidationPending: if BMH provisioning or hardware inspection is not ready yet, reconcile is requeued for 30s later.
  • HostValidationFailed: in case of any errors or invalid BMH configuration the reconcile ends here.
  1. Image creation phase:
  • ImageCreationPending: when lock cannot be acquired, reconcile gets requeued for 5s later to try again.
  • ImageCreationFailed: any other unexpected error stops the reconcile loop with this reason.
  1. Host configuration phase:
  • HostConfigurationPending: sets this reason in following scenarios:
  • earlier DataImage instance is still being deleted for some reason (requeue after 30s)
  • current DataImage was just created less than a second ago so BMO might not be notified yet (requeue after 1s)
  • 'image-based-install-managed' annotation is not set yet in BMH (no requeue)
  • HostConfigurationFailed: any unexpected errors during this phase will lead to this reason and finish reconcile.
  1. Requirements met phase:
  • HostConfigurationSucceeded: if all the finishing steps go fine (like setting the boot time to "now"), then RequirementsMet condition is set to true with reason HostConfigurationSucceeded. In case of any errors it falls back to HostConfigurationFailed and stops reconcile.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented May 6, 2025

@zszabo-rh: This pull request references MGMT-20179 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.20.0" version, but no target version was set.

In response to this:

Refactoring Reconcile() in relation with MGMT-20179

Reconcile is now separated in multiple phases, each with clear exit criteria and outcome, also setting the corresponding reason(s) when failed to represent why the requirements have not been met:

  1. Config validation phase:
  • ConfigurationPending: it's either the user needs to complete the ImageClusterInstall definition, or some of referenced resources (CD or BMH) are not available yet. In both cases the reconcile ends, and will be triggered again when the problem is resolved.
  • ConfigurationFailed: sets this reason when AutomatedCleaningMode cannot be modified in BMH. Reconcile is stopped in this case.
  1. Host validation phase:
  • HostValidationPending: if BMH provisioning or hardware inspection is not ready yet, reconcile is requeued for 30s later.
  • HostValidationFailed: in case of any errors or invalid BMH configuration the reconcile ends here.
  1. Image creation phase:
  • ImageCreationPending: when lock cannot be acquired, reconcile gets requeued for 5s later to try again.
  • ImageCreationFailed: any other unexpected error stops the reconcile loop with this reason.
  1. Host configuration phase:
  • HostConfigurationPending: sets this reason in following scenarios:
  • earlier DataImage instance is still being deleted for some reason (requeue after 30s)
  • current DataImage was just created less than a second ago so BMO might not be notified yet (requeue after 1s)
  • image-based-install-managed annotation is not set yet in BMH (no requeue)
  • HostConfigurationFailed: any unexpected errors during this phase will lead to this reason and finish reconcile.
  • HostConfigurationSucceeded: all the finishing steps went fine (like setting the boot time to now), and RequirementsMet condition can finally be true.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@zszabo-rh zszabo-rh force-pushed the conditions_refactoring branch from bf7040d to dc8e95a Compare May 6, 2025 07:28
func (r *ImageClusterInstallReconciler) setRequirementsMetCondition(ctx context.Context, ici *v1alpha1.ImageClusterInstall,
status corev1.ConditionStatus, reason, msg string) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

log.WithError(updateErr).Error("failed to create DataImage")
cond.Message = "failed to create BareMetalHost DataImage"
if !res.IsZero() {
cond.Reason = v1alpha1.HostConfigurationPendingReason
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why pending?

Copy link
Contributor Author

@zszabo-rh zszabo-rh May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since it's the infamous previous dataImage is being deleted scenario..
in this case we gonna requeue with 30s timer, so I assumed "pending" fits better

return ctrl.Result{}, err
}
if !annotationExists(&bmh.ObjectMeta, ibioManagedBMH) {
// TODO: consider replacing this condition with `dataDisk.Status.AttachedImage`
log.Infof("Nothing to do, waiting for BMH to get %s annotation", ibioManagedBMH)
cond.Reason = v1alpha1.HostConfigurationPendingReason
cond.Message = fmt.Sprintf("Waiting for BMH to get %s annotation", ibioManagedBMH)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit misleading, this controller is adding the annotation in the function above updateBMHProvisioningState.
The message should say why the BMH is not getting the annotation.
looking at the code it seems that it will not annotate the BMH if:

bmh.Status.Provisioning.State != bmh_v1alpha1.StateAvailable &&
bmh.Status.Provisioning.State != bmh_v1alpha1.StateExternallyProvisioned 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to:
cond.Message = fmt.Sprintf("Waiting for BMH provisioning state to be StateAvailable or StateExternallyProvisioned, current state is: %s", bmh.Status.Provisioning.State)

@zszabo-rh zszabo-rh force-pushed the conditions_refactoring branch 2 times, most recently from dff224a to 6388d25 Compare May 8, 2025 08:47
@zszabo-rh
Copy link
Contributor Author

/test all

Copy link
Member

@carbonin carbonin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep messages formatted consistently. Things like either capitalizing the first word or not I think should always be the same.


}

if err := r.setClusterInstallMetadata(ctx, log, ici, cd); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit odd to have in a function called "validateHost". I wouldn't expect these kinds of side effects from a validation. Maybe it doesn't belong here or maybe we can come up with a better name for the function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it's just a mid-phase step that doesn't really belong to host validation nor image creation
I've moved it back to caller, but do you think it's still ok to use the HostValidationFailedReason for this? I can even introduce a dedicated phase and condition reason just for this single step (maybe also bundle together with labelReferencedObjectsForBackup), not sure which is better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine either way

}

r.labelReferencedObjectsForBackup(ctx, log, ici, clusterDeployment)
r.labelReferencedObjectsForBackup(ctx, log, ici, cd)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't really part of the image creation, can this maybe just live in the caller since it's never going to cause a condition failure?

@zszabo-rh zszabo-rh force-pushed the conditions_refactoring branch 2 times, most recently from 795a1ec to b40e051 Compare May 12, 2025 11:32
@zszabo-rh
Copy link
Contributor Author

/test all

@zszabo-rh zszabo-rh requested review from eranco74 and carbonin May 13, 2025 05:47
@zszabo-rh zszabo-rh marked this pull request as ready for review May 13, 2025 07:38
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 13, 2025
@openshift-ci openshift-ci bot requested a review from tsorya May 13, 2025 07:39
@zszabo-rh
Copy link
Contributor Author

/retest

cond *hivev1.ClusterInstallCondition,
log logrus.FieldLogger,
) (*hivev1.ClusterDeployment, *bmh_v1alpha1.BareMetalHost, error) {
cond.Reason = v1alpha1.ConfigurationPendingReason
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this is acting as a kind of state transition, and that the caller is responsible for actually saving the condition can we move these to the caller as well? I think it would make it a bit more clear what's going on.

Actually if all of the condition changes could be done in the caller I think that would be better, but I understand if you don't want to create a ton of return values given that (at least this function) changes both the message and reason in some error cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I moved the state defaults to the caller, but yes, it would require a lot of extra work to identify each and every error case and do the corresponding cond change also there, not sure if worth the effort.

@zszabo-rh zszabo-rh force-pushed the conditions_refactoring branch from b40e051 to c297f35 Compare May 14, 2025 06:02
@zszabo-rh
Copy link
Contributor Author

/hold
analyzing e2e logs

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 14, 2025
Copy link
Member

@carbonin carbonin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the tests I'm a bit worried about the amount of new error cases coming from Reconcile. Returning an error incurs more reconcile calls and I'm not sure that's what we want in all these cases.

@@ -1959,7 +1960,8 @@ var _ = Describe("Reconcile with DataImageCoolDownPeriod set to 1 second", func(
}
installerSuccess()
res, err := r.Reconcile(ctx, req)
Expect(err).NotTo(HaveOccurred())
Expect(err).To(HaveOccurred())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this really be an error?

@@ -1518,7 +1519,7 @@ var _ = Describe("Reconcile", func() {
Expect(infoOut.MachineNetwork[0].CIDR.String()).To(Equal(clusterInstall.Spec.MachineNetwork))
})

It("in case there is no actual bmh under the reference we should not return error", func() {
It("in case there is no actual bmh under the reference we should return error", func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need both this test and Set ClusterInstallRequirementsMet to false in case there is not actual bmh under the reference?

clusterInstall.Spec.ClusterDeploymentRef = nil
Expect(c.Create(ctx, clusterInstall)).To(Succeed())
key := types.NamespacedName{
Namespace: clusterInstallNamespace,
Name: clusterInstallName,
}
res, err := r.Reconcile(ctx, ctrl.Request{NamespacedName: key})
Expect(err).NotTo(HaveOccurred())
Expect(err).To(HaveOccurred())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a strange behavior change, why would we want an error in this case? An error will cause an additional reconcile where we would expect to only reconcile once the resource was changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh this rather crucial information was new for me, somehow I assumed setting an error is more like an easily accessible debug info for the user, but otherwise does not affect the reconcile loop.

Thanks for clarifying this, I refactored the errors to limit them only for real issues.

What I don't like now is that basically each "main phase method" called by Reconcile() returns values by unique logic. Do you think I should make them uniform, like have them all return a clear statement if reconcile should continue or not (even if this can be easily figured out by checking the rest of the values)? I feel that approach a bit forced without having much benefits, but I'll consider again if you think otherwise.

@@ -997,7 +997,8 @@ var _ = Describe("Reconcile", func() {
}
installerSuccess()
res, err := r.Reconcile(ctx, req)
Expect(err).NotTo(HaveOccurred())
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("current state is: registering"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why error here?

@zszabo-rh zszabo-rh force-pushed the conditions_refactoring branch from c297f35 to 6cf9553 Compare May 15, 2025 08:46
Copy link
Member

@carbonin carbonin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean about the error handling. If you want to try for a generic approach we did something like this in the bmh agent controller in assisted-service, but I'm still not sure I like it.

Maybe it would be useful here though with some tweaks (like not using a type called reconcileComplete where you're not stopping reconcile 😫 )

// - ConfigurationFailed: sets this reason when AutomatedCleaningMode cannot be modified in BMH.
cond.Reason = v1alpha1.ConfigurationPendingReason
cd, bmh, err := r.validateConfiguration(ctx, ici, &cond, log)
if cd == nil || bmh == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is what you were referencing in your previous comment....

I'd say just change this to also check for err != nil I could imagine someone maybe changing this function somehow such that it returns values for cd and bmh but also ran into some validation issue. Just generally if an error is returned we should probably be checking it.

// - HostConfigurationFailed (default): any unexpected errors during this phase will lead to this reason and finish reconcile.
cond.Reason = v1alpha1.HostConfigurationFailedReason
continueReconcile, res, err := r.configureHost(ctx, ici, imageUrl, bmh, &cond, log)
if !continueReconcile {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one too. To be safe I'd also check err here even though we might not currently have a case where continueReconcile == true and err != nil.

@zszabo-rh zszabo-rh force-pushed the conditions_refactoring branch from 6cf9553 to a6f05f9 Compare May 16, 2025 04:39
Copy link

openshift-ci bot commented May 16, 2025

@zszabo-rh: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 16, 2025
Copy link

openshift-ci bot commented May 16, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: carbonin, zszabo-rh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 16, 2025
@carbonin
Copy link
Member

Is there a reason this is still on hold?

Feature freeze is today for 2.9 if we want to get this in ...

@zszabo-rh
Copy link
Contributor Author

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 28, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit 9efb3e9 into openshift:main May 28, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants