-
Notifications
You must be signed in to change notification settings - Fork 19
MGMT-20179: Fix IBIO status conditions #270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MGMT-20179: Fix IBIO status conditions #270
Conversation
@zszabo-rh: This pull request references MGMT-20179 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.20.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Skipping CI for Draft Pull Request. |
@zszabo-rh: This pull request references MGMT-20179 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.20.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@zszabo-rh: This pull request references MGMT-20179 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.20.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@zszabo-rh: This pull request references MGMT-20179 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.20.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
bf7040d
to
dc8e95a
Compare
func (r *ImageClusterInstallReconciler) setRequirementsMetCondition(ctx context.Context, ici *v1alpha1.ImageClusterInstall, | ||
status corev1.ConditionStatus, reason, msg string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
log.WithError(updateErr).Error("failed to create DataImage") | ||
cond.Message = "failed to create BareMetalHost DataImage" | ||
if !res.IsZero() { | ||
cond.Reason = v1alpha1.HostConfigurationPendingReason |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why pending?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since it's the infamous previous dataImage is being deleted scenario..
in this case we gonna requeue with 30s timer, so I assumed "pending" fits better
return ctrl.Result{}, err | ||
} | ||
if !annotationExists(&bmh.ObjectMeta, ibioManagedBMH) { | ||
// TODO: consider replacing this condition with `dataDisk.Status.AttachedImage` | ||
log.Infof("Nothing to do, waiting for BMH to get %s annotation", ibioManagedBMH) | ||
cond.Reason = v1alpha1.HostConfigurationPendingReason | ||
cond.Message = fmt.Sprintf("Waiting for BMH to get %s annotation", ibioManagedBMH) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit misleading, this controller is adding the annotation in the function above updateBMHProvisioningState
.
The message should say why the BMH is not getting the annotation.
looking at the code it seems that it will not annotate the BMH if:
bmh.Status.Provisioning.State != bmh_v1alpha1.StateAvailable &&
bmh.Status.Provisioning.State != bmh_v1alpha1.StateExternallyProvisioned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to:
cond.Message = fmt.Sprintf("Waiting for BMH provisioning state to be StateAvailable or StateExternallyProvisioned, current state is: %s", bmh.Status.Provisioning.State)
dff224a
to
6388d25
Compare
/test all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep messages formatted consistently. Things like either capitalizing the first word or not I think should always be the same.
|
||
} | ||
|
||
if err := r.setClusterInstallMetadata(ctx, log, ici, cd); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit odd to have in a function called "validateHost". I wouldn't expect these kinds of side effects from a validation. Maybe it doesn't belong here or maybe we can come up with a better name for the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it's just a mid-phase step that doesn't really belong to host validation nor image creation
I've moved it back to caller, but do you think it's still ok to use the HostValidationFailedReason
for this? I can even introduce a dedicated phase and condition reason just for this single step (maybe also bundle together with labelReferencedObjectsForBackup
), not sure which is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine either way
} | ||
|
||
r.labelReferencedObjectsForBackup(ctx, log, ici, clusterDeployment) | ||
r.labelReferencedObjectsForBackup(ctx, log, ici, cd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't really part of the image creation, can this maybe just live in the caller since it's never going to cause a condition failure?
795a1ec
to
b40e051
Compare
/test all |
/retest |
cond *hivev1.ClusterInstallCondition, | ||
log logrus.FieldLogger, | ||
) (*hivev1.ClusterDeployment, *bmh_v1alpha1.BareMetalHost, error) { | ||
cond.Reason = v1alpha1.ConfigurationPendingReason |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this is acting as a kind of state transition, and that the caller is responsible for actually saving the condition can we move these to the caller as well? I think it would make it a bit more clear what's going on.
Actually if all of the condition changes could be done in the caller I think that would be better, but I understand if you don't want to create a ton of return values given that (at least this function) changes both the message and reason in some error cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok I moved the state defaults to the caller, but yes, it would require a lot of extra work to identify each and every error case and do the corresponding cond change also there, not sure if worth the effort.
b40e051
to
c297f35
Compare
/hold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the tests I'm a bit worried about the amount of new error cases coming from Reconcile
. Returning an error incurs more reconcile calls and I'm not sure that's what we want in all these cases.
@@ -1959,7 +1960,8 @@ var _ = Describe("Reconcile with DataImageCoolDownPeriod set to 1 second", func( | |||
} | |||
installerSuccess() | |||
res, err := r.Reconcile(ctx, req) | |||
Expect(err).NotTo(HaveOccurred()) | |||
Expect(err).To(HaveOccurred()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this really be an error?
@@ -1518,7 +1519,7 @@ var _ = Describe("Reconcile", func() { | |||
Expect(infoOut.MachineNetwork[0].CIDR.String()).To(Equal(clusterInstall.Spec.MachineNetwork)) | |||
}) | |||
|
|||
It("in case there is no actual bmh under the reference we should not return error", func() { | |||
It("in case there is no actual bmh under the reference we should return error", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need both this test and Set ClusterInstallRequirementsMet to false in case there is not actual bmh under the reference
?
clusterInstall.Spec.ClusterDeploymentRef = nil | ||
Expect(c.Create(ctx, clusterInstall)).To(Succeed()) | ||
key := types.NamespacedName{ | ||
Namespace: clusterInstallNamespace, | ||
Name: clusterInstallName, | ||
} | ||
res, err := r.Reconcile(ctx, ctrl.Request{NamespacedName: key}) | ||
Expect(err).NotTo(HaveOccurred()) | ||
Expect(err).To(HaveOccurred()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a strange behavior change, why would we want an error in this case? An error will cause an additional reconcile where we would expect to only reconcile once the resource was changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tbh this rather crucial information was new for me, somehow I assumed setting an error is more like an easily accessible debug info for the user, but otherwise does not affect the reconcile loop.
Thanks for clarifying this, I refactored the errors to limit them only for real issues.
What I don't like now is that basically each "main phase method" called by Reconcile() returns values by unique logic. Do you think I should make them uniform, like have them all return a clear statement if reconcile should continue or not (even if this can be easily figured out by checking the rest of the values)? I feel that approach a bit forced without having much benefits, but I'll consider again if you think otherwise.
@@ -997,7 +997,8 @@ var _ = Describe("Reconcile", func() { | |||
} | |||
installerSuccess() | |||
res, err := r.Reconcile(ctx, req) | |||
Expect(err).NotTo(HaveOccurred()) | |||
Expect(err).To(HaveOccurred()) | |||
Expect(err.Error()).To(ContainSubstring("current state is: registering")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why error here?
c297f35
to
6cf9553
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean about the error handling. If you want to try for a generic approach we did something like this in the bmh agent controller in assisted-service, but I'm still not sure I like it.
Maybe it would be useful here though with some tweaks (like not using a type called reconcileComplete
where you're not stopping reconcile 😫 )
// - ConfigurationFailed: sets this reason when AutomatedCleaningMode cannot be modified in BMH. | ||
cond.Reason = v1alpha1.ConfigurationPendingReason | ||
cd, bmh, err := r.validateConfiguration(ctx, ici, &cond, log) | ||
if cd == nil || bmh == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is what you were referencing in your previous comment....
I'd say just change this to also check for err != nil
I could imagine someone maybe changing this function somehow such that it returns values for cd and bmh but also ran into some validation issue. Just generally if an error is returned we should probably be checking it.
// - HostConfigurationFailed (default): any unexpected errors during this phase will lead to this reason and finish reconcile. | ||
cond.Reason = v1alpha1.HostConfigurationFailedReason | ||
continueReconcile, res, err := r.configureHost(ctx, ici, imageUrl, bmh, &cond, log) | ||
if !continueReconcile { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one too. To be safe I'd also check err
here even though we might not currently have a case where continueReconcile == true
and err != nil
.
6cf9553
to
a6f05f9
Compare
@zszabo-rh: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: carbonin, zszabo-rh The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Is there a reason this is still on hold? Feature freeze is today for 2.9 if we want to get this in ... |
/unhold |
Refactoring Reconcile() in relation with MGMT-20179
Reconcile is now separated in multiple phases, each with clear exit criteria and outcome, also setting the corresponding reason(s) when failed to represent why the requirements have not been met: