-
Notifications
You must be signed in to change notification settings - Fork 538
MCO-1543: Update v1alpha1 MCN API #2201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MCO-1543: Update v1alpha1 MCN API #2201
Conversation
Skipping CI for Draft Pull Request. |
Hello @isabella-janssen! Some important instructions when contributing to openshift/api: |
/test all |
/retest-required |
1 similar comment
/retest-required |
@isabella-janssen: This pull request references MCO-1543 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@isabella-janssen: This pull request references MCO-1543 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@isabella-janssen: This pull request references MCO-1543 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/test integration |
2 similar comments
/test integration |
/test integration |
/retest-required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good! I've left a few questions and tried to recap our discussion regarding StateProgress
at the end.
// of only lowercase alphanumeric characters, hyphens (-), and periods (.), start and end | ||
// with an alphanumeric character, and be at most 253 characters in length. | ||
// +kubebuilder:validation:MaxLength:=253 | ||
// +kubebuilder:validation:XValidation:rule="!format.dns1123Subdomain().validate(self).hasValue()",message="a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character." | ||
// +optional | ||
Current string `json:"current"` | ||
// desired is the MachineConfig the node wants to upgrade to. | ||
// This value gets set in the machine config node status once the machine config has been validated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, to check my understanding, this means that the new MC has been verified by the daemon to be "good" and we're about to start the update? I guess I can see the value for this, but wouldn't the remaining MachineConfigNode
states give us this feedback in another way? Are we gaining anything else by maintaining a Desired
field here? It sort of sounds like an anti-pattern to me, since we're not following k8s conventions for Spec/Status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought a bit more about this, and I feel like we are willing to go down to two fields, (one in Spec
and one in Status
) we could name them something simpler like MachineConfig
(similar to the OCL API)? We could also clarify in godoc that they represent what the node's current/desired annotations used to.
Just my 2 cents and non blocking 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related conversation with insight from @yuqi-zhang can be found here. It seems the ideas are:
- Drop the desired config from Status and only have it in the Spec.
- Potentially simplify the field naming similar to the standard used in OCL.
I’ve stalked through the original PR introducing the MCN API & it looks like this is the context behind the current version design in spec & status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call on the original discussion. I think the way we generally think about it is that the daemon today never "rejects" an incoming config, and the state at which it's processing the desired configuration is encapsulated by all the conditions we have on the object. In this sense, it doesn't really tell us any more information to have the daemon "ack the latest desired" since the only time it would not do so is if it wasn't running.
With that thought I think I lean towards having "desired" as only spec, and "current" as only status, and have the conditions be the main fields for granular progress. I'm don't feel too strongly about that though, so I'm also happy to defer to @JoelSpeed who advocated for the original fields
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you review my previous comments, the expectation is that there would be some kind of desired version spec in the spec, and known version status in the status, but also a way for the user to be able to tell that the daemon has seen that it's been requested to update to the new spec (which should be in status)
So, with that in mind, I'd suggest either keeping the status.desiredVersion style field, or having an observedGeneration
at the minimum, which would allow a user to see that the latest spec had been acknowledged by the daemon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Joel's suggestion here makes sense, so I'm happy to keep as is 👍
As a sidenote, I'm not sure what the current output of oc get machineconfignodes
looks like, but would it make sense to surface status.desiredVersion
there? I wonder if it would throw users off, if they are used to our existing desired/current node annotation system?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Joel's suggestion here makes sense, so I'm happy to keep as is 👍
I'm fine with that as well. I think the actual MCN flow might not be working in line with those expectations, but I will continue to test and (likely) open a bug to track that correction.
Would it make sense to surface
status.desiredVersion
[in the output ofoc get machineconfignodes
]?
Right now oc get machineconfignodes
gives node name, MCP, desired config, current config, and updated, where desired config is pulled from spec.configVersion.desired
(ref). Are you proposing pulling that value from status.configVersion.desired
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think keeping it as spec is fine. Basically it boils down to whether we display:
- what the controller latest set as the desired configuration, or
- what the daemon has acknowledged as the desired configuration.
1 happens first, so from an output perspective I think it makes more sense for the user. In practice the difference should be very small anyways, and noting a diff (most of the time this would indicate a dead daemon) might be more helpful since it aligns better with our existing paradigms in the annotations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you proposing pulling that value from status.configVersion.desired instead?
Ah no sorry, I was proposing that we shouldn't do that. I think spec.configVersion.desired
& status.configVersion.current
makes most sense here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @djoshy and @yuqi-zhang, that make sense!
type StateProgress string | ||
|
||
// TODO: Trim down to only helpful statues. Relevant PR: https://github.com/openshift/api/pull/1596 | ||
const ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just recapping our discussion earlier, it is probably worth determining:
- How are these statuses being used in the current MCO?
- Which ones are parents/children? As you noted, our existing docs seem to be lacking this.
Once we have that, we can then work through removing the unhelpful ones and updating their godocs(the existing ones seem a little vague).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original parent/child relationships for the statuses during an update can be visualized according to the following diagram. My understanding of how the children relate back to the parents is that when all children associated with the parent are complete (status = true), the parent is considered complete. For example, if UpdatePostActionCompleted = True
, it implies that the node cordone, drain, and appliedFilesAndOS stages completed successfully (or were not needed).
block-beta
columns 15
block:parents:15
columns 15
parentTitle("Parent Phases")
space:14
space
updatePrepared["UpdatePrepared"] space:2
updateExecuted["UpdateExecuted"] space:3
updatePostActionComplete["UpdatePostActionComplete"] space
resumed["Resumed"] space
updateComplete["UpdateComplete"] space
updated["Updated"]
space:15
end
%% updatePrepared --> updateExecuted
updatePrepared --> updateExecuted
updateExecuted --> updatePostActionComplete
updatePostActionComplete --> resumed
resumed --> updateComplete
updateComplete --> updated
space:15
block:children:15
columns 15
childTitle("Child Phases")
space:14
space
updateCompatible["UpdateCompatible"]
space cordoned["Cordoned"]
drained["Drained"]
appliedFilesAndOS["AppliedFilesAndOS"] space
rebootedNode["RebootedNode"] space
reloadedCRIO["ReloadedCRIO"] space:2
uncordoned["Uncordoned"] space:2
end
updateCompatible --> updatePrepared
cordoned --> updateExecuted
drained --> updateExecuted
appliedFilesAndOS --> updateExecuted
rebootedNode --> updatePostActionComplete
reloadedCRIO --> updatePostActionComplete
uncordoned --> updateComplete
class parents PhaseGroup
class children PhaseGroup
classDef PhaseGroup fill:#f5f5f5,stroke-dasharray:10,10,stroke-width:2px,stroke:#000
class parentTitle PhaseTitle
class childTitle PhaseTitle
classDef PhaseTitle stroke:transparent,fill:transparent,font-weight:bold,font-size:1.25em,color:#000
class updatePrepared,updateExecuted,updatePostActionComplete,resumed,updateComplete,updated,updateCompatible,cordoned,drained,appliedFilesAndOS,rebootedNode,reloadedCRIO,uncordoned Phase
classDef Phase font-weight:bold,fill:#bbbbbb,stroke:#000,color:#000
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just summing up my thoughts from the above threads:
- +1 on getting rid of the child phase for
MachineConfigNodeUpdatePrepared
MachineConfigNodeUpdateReloaded
does not make sense anymore, since with Node Disruption Policies the users can "script" additional custom actions. We should probably just keepMachineConfigNodeUpdatePostActionComplete
to indicate that the node disruption actions are in progress/complete. Rebooting is also technically a node disruption action, so I can see the argument forMachineConfigNodeUpdateRebooted
to be dropped too. It does feel more "critical" to me than the other actions though, so maybe it is worth keeping around?(maybe lift it out into a parent phase, potentially?)- I am leaning towards keeping the cordon/uncordon states since it could potentially help out with that scenario Jerry mentioned. Might be wishful thinking, but I'm also hoping it may let us clean up the drain annotations we currently use on the node objects at some point in the future 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on getting rid of the child phase for MachineConfigNodeUpdatePrepared
Done 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess my argument is that, from a customer POV (specifically to a less opinionated user who doesn't use node disruption policies) MachineConfigNodeUpdateRebooted
seems more important, or at least just as important as MachineConfigNodeUpdatePostActionComplete
. With the way node disruption policies are structured, if a reboot is going to take place, all other "scripted" node disruption actions are skipped.
So in practice, with the current system, you'd see:
Update type | MCN States that are set to true |
---|---|
Rebootless | MachineConfigNodeUpdatePostActionComplete |
Reboot | MachineConfigNodeUpdatePostActionComplete + MachineConfigNodeUpdateRebooted |
By lifting reboots to a separate parent state:
Update type | MCN States that are set to true |
---|---|
Rebootless | MachineConfigNodeUpdatePostActionComplete |
Reboot | MachineConfigNodeUpdateRebooted |
Not strongly held, I'll happily back off if this kind of granularity isn't worth the effort - I'm just sort of trying to apply our original motivation behind node disruption policies here 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see the MachineConfigNodeUpdateRebooted
phase being useful in some form. Whether we leave it as a child of MachineConfigNodeUpdatePostActionComplete
or lift it to a parent phase, I don't think I have enough background knowledge to have a strong opinion.
For the second scenario, does a reboot required update not have any overlap with the steps in MachineConfigNodeUpdatePostActionComplete
? What steps do you see the MachineConfigNodeUpdatePostActionComplete
phase containing that is distinct from the steps that would be part of MachineConfigNodeUpdateRebooted
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the second scenario, does a reboot required update not have any overlap with the steps in MachineConfigNodeUpdatePostActionComplete? What steps do you see the MachineConfigNodeUpdatePostActionComplete phase containing that is distinct from the steps that would be part of MachineConfigNodeUpdateRebooted?
So a reboot required update would skip all other "post update actions" and go straight to a reboot. It is sort of an either or, the daemon will determine if a node disruption policies apply for an MC update and if none of the policies apply, it'll default to a reboot. I guess it comes down to if we want to consider MachineConfigNodeUpdatePostActionComplete
as representative of the daemon executing the node disruption actions scripted by the user, or if we want to just make it represent "everything that happens after an update has been staged".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it comes down to if we want to consider MachineConfigNodeUpdatePostActionComplete as representative of the daemon executing the node disruption actions scripted by the user, or if we want to just make it represent "everything that happens after an update has been staged".
I can see it being either way. The API is the same whether MachineConfigNodeUpdateRebooted
is a child phase or not, so it's probably not crucial to decide now.
That being said, if it is an either/or situation, I think I would lean towards MachineConfigNodeUpdatePostActionComplete
and MachineConfigNodeUpdateRebooted
being their own individual parent phases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, it is more of an implementation detail! If trying to use both as parents isn't very ergonomic at the daemon level, we can revert to the older path. Thanks for talking it through with me 😄
dd2f54e
to
5779891
Compare
@isabella-janssen: This pull request references MCO-1543 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@isabella-janssen: This pull request references MCO-1543 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
266599b
to
e23cab0
Compare
@JoelSpeed: Overrode contexts on behalf of JoelSpeed: ci/prow/verify-crd-schema In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: isabella-janssen, JoelSpeed, yuqi-zhang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest-required |
/label acknowledge-critical-fixes-only This is important work for MCN's 4.19 GA. |
/retest-required |
1 similar comment
/retest-required |
/retest-required |
2 similar comments
/retest-required |
/retest-required |
/retest-required |
/retest-required |
@isabella-janssen: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
@JoelSpeed the |
/override ci/prow/verify-crd-schema based on @JoelSpeed's in #2201 (comment) |
@deads2k: Overrode contexts on behalf of deads2k: ci/prow/verify-crd-schema In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
4521905
into
openshift:master
[ART PR BUILD NOTIFIER] Distgit: ose-cluster-config-api |
This is intended to be the final update to the the v1alpha1 MCN API before the API is graduated to V1.
Notable updates:
Status.PinnedImageSets.LastFailedGenerationErrors
list type field is being removed and a replacementStatus.PinnedImageSets.LastFailedGenerationError
string type field is being added, as a PinnedImageSet error will be a single error string.Status.PinnedImageSets.LastFailedGeneration
field can be found in this Slack thread.MachineConfigNodeNodeDegraded
status was added and will alert the user of a node degraded state.Relevant information: