MGMT-20179: Fix IBIO status conditions #270

zszabo-rh · 2025-05-05T13:54:30Z

Refactoring Reconcile() in relation with MGMT-20179

Reconcile is now separated in multiple phases, each with clear exit criteria and outcome, also setting the corresponding reason(s) when failed to represent why the requirements have not been met:

Config validation phase:

ConfigurationPending: it's either the user needs to complete the ImageClusterInstall definition, or some of referenced resources (CD or BMH) are not available yet. In both cases the reconcile ends, and will be triggered again when the problem is resolved.
ConfigurationFailed: sets this reason when AutomatedCleaningMode cannot be modified in BMH. Reconcile is stopped in this case.

Host validation phase:

HostValidationPending: if BMH provisioning or hardware inspection is not ready yet, reconcile is requeued for 30s later.
HostValidationFailed: in case of any errors or invalid BMH configuration the reconcile ends here.

Image creation phase:

ImageCreationPending: when lock cannot be acquired, reconcile gets requeued for 5s later to try again.
ImageCreationFailed: any other unexpected error stops the reconcile loop with this reason.

Host configuration phase:

HostConfigurationPending: sets this reason in following scenarios:
- earlier DataImage instance is still being deleted for some reason (requeue after 30s)
- current DataImage was just created less than a second ago so BMO might not be notified yet (requeue after 1s)
- image-based-install-managed annotation is not set yet in BMH (no requeue)
HostConfigurationFailed: any unexpected errors during this phase will lead to this reason and finish reconcile.
HostConfigurationSucceeded: all the finishing steps went fine (like setting the boot time to now), and RequirementsMet condition can finally be true.

openshift-ci-robot · 2025-05-05T13:54:33Z

@zszabo-rh: This pull request references MGMT-20179 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.20.0" version, but no target version was set.

In response to this:

Refactoring Reconcile() in relation with MGMT-20179

Reconcile sets the following reasons when requirements not met due to some failure:

ConfigurationNotReadyReason: BMH or CD reference missing or the resource is not available

HostValidationPending: BMH provisioning is not ready yet

HostValidationFailedReason:

ImageNotReadyReason:

HostConfigurationPendingReason

HostConfigurationFailedReason

And finally when everything goes fine during reconcile and RequirementsMet condition is set:

HostConfigurationSucceededReason

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2025-05-05T13:54:34Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci-robot · 2025-05-05T13:54:38Z

@zszabo-rh: This pull request references MGMT-20179 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.20.0" version, but no target version was set.

In response to this:

Refactoring Reconcile() in relation with MGMT-20179

Reconcile sets the following reasons when requirements not met due to some failure:

ConfigurationNotReadyReason: BMH or CD reference missing or the resource is not available

HostValidationPending: BMH provisioning is not ready yet

HostValidationFailedReason:

ImageNotReadyReason:

HostConfigurationPendingReason

HostConfigurationFailedReason

And finally when everything goes fine during reconcile and RequirementsMet condition is set:

HostConfigurationSucceededReason

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2025-05-06T07:03:18Z

@zszabo-rh: This pull request references MGMT-20179 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.20.0" version, but no target version was set.

In response to this:

Refactoring Reconcile() in relation with MGMT-20179

Reconcile is now separated in multiple phases, each with clear exit criteria and outcome, also setting the corresponding reason(s) when failed to represent why the requirements have not been met:

Config validation phase:

ConfigurationNotReady: it's either the user needs to complete the ImageClusterInstall definition, or some of referenced resources (CD or BMH) are not available yet. In both cases the reconcile ends, and will be triggered again when the problem is resolved.

Host validation phase:

HostValidationPending: if BMH provisioning or hardware inspection is not ready yet, reconcile is requeued for 30s later.

HostValidationFailed: in case of any errors or invalid BMH configuration the reconcile ends here.

Image creation phase:

ImageCreationPending: when lock cannot be acquired, reconcile gets requeued for 5s later to try again.

ImageCreationFailed: any other unexpected error stops the reconcile loop with this reason.

Host configuration phase:

HostConfigurationPending: sets this reason in following scenarios:

earlier DataImage instance is still being deleted for some reason (requeue after 30s)

current DataImage was just created less than a second ago so BMO might not be notified yet (requeue after 1s)

'image-based-install-managed' annotation is not set yet in BMH (no requeue)

HostConfigurationFailed: any unexpected errors during this phase will lead to this reason and finish reconcile.

Requirements met phase:

HostConfigurationSucceeded: if all the finishing steps go fine (like setting the boot time to "now"), then RequirementsMet condition is set to true with reason HostConfigurationSucceeded. In case of any errors it falls back to HostConfigurationFailed and stops reconcile.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2025-05-06T07:14:57Z

@zszabo-rh: This pull request references MGMT-20179 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.20.0" version, but no target version was set.

In response to this:

Refactoring Reconcile() in relation with MGMT-20179

Reconcile is now separated in multiple phases, each with clear exit criteria and outcome, also setting the corresponding reason(s) when failed to represent why the requirements have not been met:

Config validation phase:

ConfigurationPending: it's either the user needs to complete the ImageClusterInstall definition, or some of referenced resources (CD or BMH) are not available yet. In both cases the reconcile ends, and will be triggered again when the problem is resolved.

ConfigurationFailed: sets this reason when AutomatedCleaningMode cannot be modified in BMH. Reconcile is stopped in this case.

Host validation phase:

HostValidationPending: if BMH provisioning or hardware inspection is not ready yet, reconcile is requeued for 30s later.

HostValidationFailed: in case of any errors or invalid BMH configuration the reconcile ends here.

Image creation phase:

ImageCreationPending: when lock cannot be acquired, reconcile gets requeued for 5s later to try again.

ImageCreationFailed: any other unexpected error stops the reconcile loop with this reason.

Host configuration phase:

HostConfigurationPending: sets this reason in following scenarios:

earlier DataImage instance is still being deleted for some reason (requeue after 30s)

current DataImage was just created less than a second ago so BMO might not be notified yet (requeue after 1s)

image-based-install-managed annotation is not set yet in BMH (no requeue)

HostConfigurationFailed: any unexpected errors during this phase will lead to this reason and finish reconcile.

HostConfigurationSucceeded: all the finishing steps went fine (like setting the boot time to now), and RequirementsMet condition can finally be true.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

eranco74 · 2025-05-06T13:10:59Z

controllers/conditions.go

 func (r *ImageClusterInstallReconciler) setRequirementsMetCondition(ctx context.Context, ici *v1alpha1.ImageClusterInstall,
-	status corev1.ConditionStatus, reason, msg string) error {


eranco74 · 2025-05-06T13:20:48Z

controllers/imageclusterinstall_controller.go

-			log.WithError(updateErr).Error("failed to create DataImage")
+		cond.Message = "failed to create BareMetalHost DataImage"
+		if !res.IsZero() {
+			cond.Reason = v1alpha1.HostConfigurationPendingReason


Why pending?

since it's the infamous previous dataImage is being deleted scenario..
in this case we gonna requeue with 30s timer, so I assumed "pending" fits better

controllers/imageclusterinstall_controller.go

eranco74 · 2025-05-06T13:32:28Z

controllers/imageclusterinstall_controller.go

 		return ctrl.Result{}, err
 	}
 	if !annotationExists(&bmh.ObjectMeta, ibioManagedBMH) {
 		// TODO: consider replacing this condition with `dataDisk.Status.AttachedImage`
-		log.Infof("Nothing to do, waiting for BMH to get %s annotation", ibioManagedBMH)
+		cond.Reason = v1alpha1.HostConfigurationPendingReason
+		cond.Message = fmt.Sprintf("Waiting for BMH to get %s annotation", ibioManagedBMH)


This is a bit misleading, this controller is adding the annotation in the function above updateBMHProvisioningState.
The message should say why the BMH is not getting the annotation.
looking at the code it seems that it will not annotate the BMH if:

bmh.Status.Provisioning.State != bmh_v1alpha1.StateAvailable && bmh.Status.Provisioning.State != bmh_v1alpha1.StateExternallyProvisioned

Changed to:
cond.Message = fmt.Sprintf("Waiting for BMH provisioning state to be StateAvailable or StateExternallyProvisioned, current state is: %s", bmh.Status.Provisioning.State)

zszabo-rh · 2025-05-08T08:50:41Z

/test all

carbonin

Let's keep messages formatted consistently. Things like either capitalizing the first word or not I think should always be the same.

controllers/imageclusterinstall_controller.go

carbonin · 2025-05-08T18:23:58Z

controllers/imageclusterinstall_controller.go

+
+	}
+
+	if err := r.setClusterInstallMetadata(ctx, log, ici, cd); err != nil {


This is a bit odd to have in a function called "validateHost". I wouldn't expect these kinds of side effects from a validation. Maybe it doesn't belong here or maybe we can come up with a better name for the function.

yes it's just a mid-phase step that doesn't really belong to host validation nor image creation
I've moved it back to caller, but do you think it's still ok to use the HostValidationFailedReason for this? I can even introduce a dedicated phase and condition reason just for this single step (maybe also bundle together with labelReferencedObjectsForBackup), not sure which is better.

I think it's fine either way

carbonin · 2025-05-08T18:24:37Z

controllers/imageclusterinstall_controller.go

 	}

-	r.labelReferencedObjectsForBackup(ctx, log, ici, clusterDeployment)
+	r.labelReferencedObjectsForBackup(ctx, log, ici, cd)


This isn't really part of the image creation, can this maybe just live in the caller since it's never going to cause a condition failure?

zszabo-rh · 2025-05-13T05:47:12Z

/test all

zszabo-rh · 2025-05-13T10:26:01Z

/retest

carbonin · 2025-05-13T15:46:52Z

controllers/imageclusterinstall_controller.go

+	cond *hivev1.ClusterInstallCondition,
+	log logrus.FieldLogger,
+) (*hivev1.ClusterDeployment, *bmh_v1alpha1.BareMetalHost, error) {
+	cond.Reason = v1alpha1.ConfigurationPendingReason


Given that this is acting as a kind of state transition, and that the caller is responsible for actually saving the condition can we move these to the caller as well? I think it would make it a bit more clear what's going on.

Actually if all of the condition changes could be done in the caller I think that would be better, but I understand if you don't want to create a ton of return values given that (at least this function) changes both the message and reason in some error cases.

ok I moved the state defaults to the caller, but yes, it would require a lot of extra work to identify each and every error case and do the corresponding cond change also there, not sure if worth the effort.

zszabo-rh · 2025-05-14T10:19:31Z

/hold
analyzing e2e logs

carbonin

Looking at the tests I'm a bit worried about the amount of new error cases coming from Reconcile. Returning an error incurs more reconcile calls and I'm not sure that's what we want in all these cases.

carbonin · 2025-05-14T13:28:15Z

controllers/imageclusterinstall_controller_test.go

@@ -1959,7 +1960,8 @@ var _ = Describe("Reconcile with DataImageCoolDownPeriod set to 1 second", func(
 		}
 		installerSuccess()
 		res, err := r.Reconcile(ctx, req)
-		Expect(err).NotTo(HaveOccurred())
+		Expect(err).To(HaveOccurred())


Should this really be an error?

carbonin · 2025-05-14T13:31:06Z

controllers/imageclusterinstall_controller_test.go

@@ -1518,7 +1519,7 @@ var _ = Describe("Reconcile", func() {
 		Expect(infoOut.MachineNetwork[0].CIDR.String()).To(Equal(clusterInstall.Spec.MachineNetwork))
 	})

-	It("in case there is no actual bmh under the reference we should not return error", func() {
+	It("in case there is no actual bmh under the reference we should return error", func() {


Do we really need both this test and Set ClusterInstallRequirementsMet to false in case there is not actual bmh under the reference?

carbonin · 2025-05-14T13:32:24Z

controllers/imageclusterinstall_controller_test.go

 		clusterInstall.Spec.ClusterDeploymentRef = nil
 		Expect(c.Create(ctx, clusterInstall)).To(Succeed())
 		key := types.NamespacedName{
 			Namespace: clusterInstallNamespace,
 			Name:      clusterInstallName,
 		}
 		res, err := r.Reconcile(ctx, ctrl.Request{NamespacedName: key})
-		Expect(err).NotTo(HaveOccurred())
+		Expect(err).To(HaveOccurred())


This is a strange behavior change, why would we want an error in this case? An error will cause an additional reconcile where we would expect to only reconcile once the resource was changed.

tbh this rather crucial information was new for me, somehow I assumed setting an error is more like an easily accessible debug info for the user, but otherwise does not affect the reconcile loop.

Thanks for clarifying this, I refactored the errors to limit them only for real issues.

What I don't like now is that basically each "main phase method" called by Reconcile() returns values by unique logic. Do you think I should make them uniform, like have them all return a clear statement if reconcile should continue or not (even if this can be easily figured out by checking the rest of the values)? I feel that approach a bit forced without having much benefits, but I'll consider again if you think otherwise.

carbonin · 2025-05-14T13:32:51Z

controllers/imageclusterinstall_controller_test.go

@@ -997,7 +997,8 @@ var _ = Describe("Reconcile", func() {
 		}
 		installerSuccess()
 		res, err := r.Reconcile(ctx, req)
-		Expect(err).NotTo(HaveOccurred())
+		Expect(err).To(HaveOccurred())
+		Expect(err.Error()).To(ContainSubstring("current state is: registering"))


Why error here?

carbonin

I see what you mean about the error handling. If you want to try for a generic approach we did something like this in the bmh agent controller in assisted-service, but I'm still not sure I like it.

Maybe it would be useful here though with some tweaks (like not using a type called reconcileComplete where you're not stopping reconcile 😫 )

carbonin · 2025-05-15T18:08:24Z

controllers/imageclusterinstall_controller.go

+	// - ConfigurationFailed: sets this reason when AutomatedCleaningMode cannot be modified in BMH.
+	cond.Reason = v1alpha1.ConfigurationPendingReason
+	cd, bmh, err := r.validateConfiguration(ctx, ici, &cond, log)
+	if cd == nil || bmh == nil {


I guess this is what you were referencing in your previous comment....

I'd say just change this to also check for err != nil I could imagine someone maybe changing this function somehow such that it returns values for cd and bmh but also ran into some validation issue. Just generally if an error is returned we should probably be checking it.

carbonin · 2025-05-15T18:09:37Z

controllers/imageclusterinstall_controller.go

+	// - HostConfigurationFailed (default): any unexpected errors during this phase will lead to this reason and finish reconcile.
+	cond.Reason = v1alpha1.HostConfigurationFailedReason
+	continueReconcile, res, err := r.configureHost(ctx, ici, imageUrl, bmh, &cond, log)
+	if !continueReconcile {


This one too. To be safe I'd also check err here even though we might not currently have a case where continueReconcile == true and err != nil.

openshift-ci · 2025-05-16T07:07:25Z

@zszabo-rh: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci · 2025-05-16T13:55:01Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: carbonin, zszabo-rh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [carbonin]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

carbonin · 2025-05-28T12:58:06Z

Is there a reason this is still on hold?

Feature freeze is today for 2.9 if we want to get this in ...

zszabo-rh · 2025-05-28T13:38:07Z

/unhold

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 5, 2025

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 5, 2025

MGMT-20179: Fix IBIO status conditions

dc8e95a

zszabo-rh force-pushed the conditions_refactoring branch from bf7040d to dc8e95a Compare May 6, 2025 07:28

eranco74 reviewed May 7, 2025

View reviewed changes

zszabo-rh force-pushed the conditions_refactoring branch 2 times, most recently from dff224a to 6388d25 Compare May 8, 2025 08:47

carbonin reviewed May 8, 2025

View reviewed changes

zszabo-rh force-pushed the conditions_refactoring branch 2 times, most recently from 795a1ec to b40e051 Compare May 12, 2025 11:32

zszabo-rh requested review from eranco74 and carbonin May 13, 2025 05:47

zszabo-rh marked this pull request as ready for review May 13, 2025 07:38

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 13, 2025

openshift-ci bot requested a review from tsorya May 13, 2025 07:39

carbonin reviewed May 13, 2025

View reviewed changes

zszabo-rh force-pushed the conditions_refactoring branch from b40e051 to c297f35 Compare May 14, 2025 06:02

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 14, 2025

carbonin reviewed May 14, 2025

View reviewed changes

zszabo-rh force-pushed the conditions_refactoring branch from c297f35 to 6cf9553 Compare May 15, 2025 08:46

carbonin reviewed May 15, 2025

View reviewed changes

Refactors phases to functions

a6f05f9

zszabo-rh force-pushed the conditions_refactoring branch from 6cf9553 to a6f05f9 Compare May 16, 2025 04:39

carbonin approved these changes May 16, 2025

View reviewed changes

openshift-ci bot assigned carbonin May 16, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 16, 2025

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 16, 2025

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 28, 2025

openshift-merge-bot bot merged commit 9efb3e9 into openshift:main May 28, 2025
7 checks passed

		func (r ImageClusterInstallReconciler) setRequirementsMetCondition(ctx context.Context, ici v1alpha1.ImageClusterInstall,
		status corev1.ConditionStatus, reason, msg string) error {


		}

		if err := r.setClusterInstallMetadata(ctx, log, ici, cd); err != nil {

MGMT-20179: Fix IBIO status conditions #270

MGMT-20179: Fix IBIO status conditions #270

Uh oh!

Conversation

zszabo-rh commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented May 5, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented May 5, 2025

Uh oh!

openshift-ci-robot commented May 5, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented May 6, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented May 6, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zszabo-rh May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zszabo-rh commented May 8, 2025

Uh oh!

carbonin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zszabo-rh commented May 13, 2025

Uh oh!

zszabo-rh commented May 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zszabo-rh commented May 14, 2025

Uh oh!

carbonin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carbonin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented May 16, 2025

zszabo-rh commented May 5, 2025 •

edited

Loading

openshift-ci-robot commented May 5, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented May 5, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented May 6, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented May 6, 2025 •

edited by openshift-ci bot

Loading

zszabo-rh May 8, 2025 •

edited

Loading