-
Notifications
You must be signed in to change notification settings - Fork 500
MCO-1504: Update bootimage management enhancement #1761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MCO-1504: Update bootimage management enhancement #1761
Conversation
@djoshy: This pull request references MCO-1504 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
||
#### Reactive | ||
1. Have the MCS reject new ignition requests if the aformentioned configmap indicates that the cluster's bootimages are out of date. The MCS could then signal to the cluster admin that scale-up is not available until the configmap has been reconciled. | ||
2. Add a service to be shipped via RHCOS/MCO templates, which will do a check on incoming OS container image vs currently booted RHCOS version. This runs on firstboot right after the MCD pulls the new image, and will prevent the node to rebase to the updated image if the drift is too far. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left this in from #1698, in case I was missing something. How would the daemon know the "acceptable" skew during firstboot? I think we could potentially do this after the pivot and yell at the admin, but IMO the "reject join" approach would probably cover this case and never let the firstboot daemon get to pivot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we'd have to inject that information into the payload.
Also this would cover cases where the environment doesn't use the MCS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh I might have misunderstood something here then, does the first boot daemon have access to the release payload? I thought all it had was the target MachineConfig
when it does the first boot pivot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly Jerry meant "inject that information into the Ignition config"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going with the OCP versions to check skew, would this approach be viable? Wouldn't (2) be comparing something that could be checked at the cluster level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@djoshy and I were discussing this today and we were playing with the idea of also bundling RHCOS versions alongside the OCP version in the skew policy so that the booted node would have enough information to determine whether it's out of compliance (by doing a version comparison). In theory, RHCOS versions can e.g. go backwards (I think it did happen once or twice in the installer where we rolled back a bootimage bump) but we wouldn't have a skew policy so tight for this to really matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replying to my own comment here: note that the bootimage versioning scheme has changed in 4.19+ to be RHEL-based. So e.g. the version here would appear to go from 418.94.z
to 9.6.z
which looks older in a naive semver comparison. So this would need to be somewhat smart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth adding here what cases (2) covers that (1) doesn't. I.e. Jerry's comment re. environments that don't use the MCS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replying to my own comment here: note that the bootimage versioning scheme has changed in 4.19+ to be RHEL-based. So e.g. the version here would appear to go from 418.94.z to 9.6.z which looks older in a naive semver comparison. So this would need to be somewhat smart.
Understood, I was thinking maybe we could optimize this a little bit depending by starting our skew limits in RHEL format(9.6.z). So anything RHCOS based could be considered out of skew. Given that it'll at least be a few OCP minors before we get to GA skew enforcement, we should have enough runway, hopefully.
I think it's worth adding here what cases (2) covers that (1) doesn't. I.e. Jerry's comment re. environments that don't use the MCS.
Will do!
2. Add a service to be shipped via RHCOS/MCO templates, which will do a check on incoming OS container image vs currently booted RHCOS version. This runs on firstboot right after the MCD pulls the new image, and will prevent the node to rebase to the updated image if the drift is too far. | ||
|
||
RHEL major versions will no longer be cross-compatible. i.e. if you wish to have a RHEL10 machineconfigpool, you must use a RHEL10 bootimage. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @wking @yuqi-zhang
picking up the converstation from #1698:
From this point on, MCO will target RHEL 10 for new nodes scaling into this MC
I'll let Jerry weigh in here, but my read here was that we aren't planning on doing any MCP specific enforcement. I think Jerry was implying this would be result from the aforementioned enforcement methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, so, when initially discussing around RHEL 10, it was around dual-stream, where you'd simultaneously have rhel9 and rhel10 based workers, and each type would have to boot from the same origin major for the bootimage. I think the original intention was to reduce potential 9->10 upgrade issues until RHEL 10 is more stable, but I could be wrong there (cc @sdodson )
When transitioning the cluster base RHCOS nodes from 9->10 then it would be a different problem. I think we'd have to have some cross compatibility there eventually and allow for rhel9 bootimages to work for at least 1 version where the shipped image is RHEL10
|
||
#### Reactive | ||
1. Have the MCS reject new ignition requests if the aformentioned configmap indicates that the cluster's bootimages are out of date. The MCS could then signal to the cluster admin that scale-up is not available until the configmap has been reconciled. | ||
2. Add a service to be shipped via RHCOS/MCO templates, which will do a check on incoming OS container image vs currently booted RHCOS version. This runs on firstboot right after the MCD pulls the new image, and will prevent the node to rebase to the updated image if the drift is too far. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we'd have to inject that information into the payload.
Also this would cover cases where the environment doesn't use the MCS
2. Add a service to be shipped via RHCOS/MCO templates, which will do a check on incoming OS container image vs currently booted RHCOS version. This runs on firstboot right after the MCD pulls the new image, and will prevent the node to rebase to the updated image if the drift is too far. | ||
|
||
RHEL major versions will no longer be cross-compatible. i.e. if you wish to have a RHEL10 machineconfigpool, you must use a RHEL10 bootimage. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, so, when initially discussing around RHEL 10, it was around dual-stream, where you'd simultaneously have rhel9 and rhel10 based workers, and each type would have to boot from the same origin major for the bootimage. I think the original intention was to reduce potential 9->10 upgrade issues until RHEL 10 is more stable, but I could be wrong there (cc @sdodson )
When transitioning the cluster base RHCOS nodes from 9->10 then it would be a different problem. I think we'd have to have some cross compatibility there eventually and allow for rhel9 bootimages to work for at least 1 version where the shipped image is RHEL10
1885750
to
189212c
Compare
In the last push:
|
189212c
to
34024e9
Compare
34024e9
to
c2ef0ca
Compare
// skewEnforcement allows an admin to set behavior of the boot image skew enforcement mechanism. | ||
// Enabled means that the MCO will degrade and prevent upgrades when the boot image skew is too large. | ||
// Disabled means that the MCO will no longer degrade and will permit upgrades when the boot image skew is | ||
// too large. This may also hinder the cluster's scaling ability. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Define too large? What are the potential pitfalls of "too large" of a skew?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By too large, I meant to say that it fails the skew guidance from the release image - I'll clarify the godoc to better desribe this. The main pitfall is that scaling would most likely fail, i.e. the pivot to release OS image isn't possible if your current boot image is below x. If scaling is a non-issue for the cluster in question, they could disable it and the cluster would be able to carry out upgrades again.
// Disabled means that the MCO will no longer degrade and will permit upgrades when the boot image skew is | ||
// too large. This may also hinder the cluster's scaling ability. | ||
// +optional | ||
SkewEnforcement SkewEnforcementSelectorMode `json:"skewEnforcement"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you were to make the enum values here represent the actual skew of the images, what might this look like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean that we should make the skew configurable? My understanding was that it needed to be something constant for a release(defined in the releaseImage
) and it could potentially change between releases, but not something an operator/admin would get to manually set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I more mean, you have skew enforcement as Enabled or Disabled. What if skew enforcement were more like ReleaseRecommended and Disabled, would that make more sense, and allow for a future expansion where an admin could opt in and say, actually, I want SingleRelease skew, or DualRelease skew? Allowing them to set their own guidelines and override what is recommended by the release image itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, what's the use case you're thinking about? I think the most likely scenario is that they want to override the skew check in the release image because they don't care about scaling. I'm not sure if they'd be interested in making the skew check tighter than we require.
1. Have the MCS reject new ignition requests if the aformentioned configmap indicates that the cluster's bootimages are out of date. The MCS could then signal to the cluster admin that scale-up is not available until the configmap has been reconciled. | ||
2. Add a service to be shipped via RHCOS/MCO templates, which will do a check on incoming OS container image vs currently booted RHCOS version. This runs on firstboot right after the MCD pulls the new image, and will prevent the node to rebase to the updated image if the drift is too far. | ||
|
||
RHEL major versions will no longer be cross-compatible. i.e. if you wish to have a RHEL10 machineconfigpool, you must use a RHEL10 bootimage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like a breaking change, why now?
I understand there's lots changing about our boot images, but, is this a one off, or a constant issue going forward?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This specifically is for dual-stream support, where in some version of OCP (likely 4.20?) that we will have a special RHEL-10 pool (design TBD), so your workers in the same OCP version will run different RHEL majors.
We will eventually have to have a RHEL9->10 upgrade path, so dual stream aside generally speaking I think we'd need to have cross compatibility, so we should probably clarify this.
But we would never want a RHEL9->11 upgrade path, I think would be the only breaking case.
In the last push:
|
fd8861c
to
54d1b56
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for working on this!
One thing that I think is implied but should probably be spelled out more is how skew comparison actually works. I.e. are we literally parsing RHCOS bootimage version strings and doing comparisons (in that case, recent versioning changes make that trickier)?
Or I think a saner approach is to compare OCP versions instead given that RHCOS bootimage versioning is not super meaningful to the rest of OCP. I.e. the skew policies would reference OCP versions and the coreos-bootimages configmap would reference the OCP version it's for?
@@ -77,7 +85,7 @@ __Overview__ | |||
- `ManagedBootImages` feature gate is active | |||
- The cluster and/or the machineset is opted-in to boot image updates. This is done at the operator level, via the `MachineConfiguration` API object. | |||
- The `machineset` does not have a valid owner reference. Having a valid owner reference typically indicates that the `MachineSet` is managed by another workflow, and that updates to it are likely going to cause thrashing. | |||
- The golden configmap is verified to be in sync with the current version of the MCO. The MCO will update("stamp") the golden configmap with version of the new MCO image after atleast 1 master node has succesfully completed an update to the new OCP image. This helps prevent `machinesets` being updated too soon at the end of a cluster upgrade, before the MCO itself has updated and has had a chance to roll out the new OCP image to the cluster. | |||
- The golden configmap is verified to be in sync with the current version of the MCO. The MCO will update("stamp") the golden configmap with version of the new MCO image after at least 1 master node has successfully completed an update to the new OCP image. This helps prevent `machinesets` being updated too soon at the end of a cluster upgrade, before the MCO itself has updated and has had a chance to roll out the new OCP image to the cluster. | |||
|
|||
If any of the above checks fail, the MSBIC will exit out of the sync. | |||
- Based on platform and architecture type, the MSBIC will check if the boot images referenced in the `providerSpec` field of the `MachineSet` is the same as the one in the ConfigMap. Each platform(gcp, aws...and so on) does this differently, so this part of the implementation will have to be special cased. The ConfigMap is considered to be the golden set of bootimage values, i.e. they will never go out of date. If it is not a match, the `providerSpec` field is cloned and updated with the new boot image reference. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't comment lower than this, but: should the MSBIC add an owner reference to itself on the MachineSet after updating it? (And obviously change the precondition checks above to check whether the MachineSet has either no owner, or the MSBIC as owner.)
Otherwise, other controllers might have the same logic and also update without taking ownership and you still get thrashing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hesitate to add an owner reference to a MachineSet
because the MCO is only really taking ownership of one specific field inside it, the providerSpec
. The remaining fields are still managed and updated by the MAPI operator. If someone else is causing thrashing in the providerSpec
field, we hope to get it ironed out in the early stages of supporting that platform by socializing this feature and the EP.
cc @JoelSpeed in case you have anything to add here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think adding an owner reference would prevent thrashing. Generally there's nothing that would exist within the cluster that would be updating the boot images, but many users have add-ons that might.
GitOps may be forcing a specific spec, and would revert any change the MSBIC makes.
Hive also forces the MachineSet spec, and would revert any change the MSBIC makes.
In the future, we plan to support MachineDeployments, which would own the MachineSets. In this case we would want the MSBIC to update the MachineDeployment instead of the MachineSet.
We are also working on migrating MAPI to CAPI. In the future, writes to the MAPI resource may be denied, and the writer is then expected to update the CAPI resource instead, we will need to discuss how this is going to work.
Out of interest, what object you would expect the owner reference to point to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of interest, what object you would expect the owner reference to point to?
Not sure... it just feels odd to update an object without clearly claiming ownership of it. But David's point that we're only updating one field of it is a good one. Happy to keep it un-owned if that's the cleanest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Owner references are typically used for garbage collection, so, if I delete a Deployment, the things that have owner refs pointing to it (ReplicaSets) get deleted, and so on.
I don't think they actually matter for writes at all. People do often use them for that, but I wouldn't expect Hive (for example) to check for an MCO owner ref before stomping the change you just made
|
||
The cluster admin may also choose to opt-out of skew management via this configmap, which indicates that they will not require scaling nodes, and thereby opting out of skew enforcement and scaling functionality. | ||
|
||
A potential problem here is that the way boot images are stored in the machineset is lossy. In certain platforms, there is no way to recover the boot image metadata from the MachineSet. This is most likely to happen the first time the MCO attempts to do skew enforcement on a cluster that has never had boot image updates. In such cases, the MCO will default to the install time boot image, which can be recovered from the [aleph version](https://github.com/coreos/coreos-assembler/pull/768) of the control plane nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Past the first update, can you clarify how the MSBIC knows which bootimage version is in a MachineSet? Will it add e.g. an annotation on the MachineSet when it patches it?
The way this relates to this line here is that I think rather than using the aleph of the control plane nodes, we could also just make the installer add the necessary annotation when it creates the MachineSet, right?
Clusters born from installers without that patch won't have the annotation which implies it's at least older than the OCP release containing the patch. Cluster born from installers with it will have it available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Past the first update, can you clarify how the MSBIC knows which bootimage version is in a MachineSet? Will it add e.g. an annotation on the MachineSet when it patches it?
Based on the current discussion from the EP, I think the consensus is leaning towards an API for skew management instead of the configmap, so the "boot image version" of the cluster would have to be a field within there, editable by the user(for the non-managed cases) and the boot image controller. In fact, if we go with the approach of using OCP versions from your comment, we might be able to skip the boot image metadata determination issue altogether? For example:
-
When the boot image controller successfully completes a cluster wide boot image update, store the OCP version of the last update(retrievable from the boot images configmap) in the Skew Enforcement API.
-
The skew enforcement mechanism would monitor this value in the API, and compare against the skew limit described by the current release image. If the value in the API is empty, we default to OCP version at install time(I'm sure this is somewhere in the cluster)
Does that seem plausible, or am I missing something? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The skew enforcement mechanism would monitor this value in the API, and compare against the skew limit described by the current release image. If the value in the API is empty, we default to OCP version at install time(I'm sure this is somewhere in the cluster)
When an admin manually has to update the boot image, how do they tell the cluster what version the new boot image is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in #1761 (comment), PTAL
|
||
#### Reactive | ||
1. Have the MCS reject new ignition requests if the aformentioned configmap indicates that the cluster's bootimages are out of date. The MCS could then signal to the cluster admin that scale-up is not available until the configmap has been reconciled. | ||
2. Add a service to be shipped via RHCOS/MCO templates, which will do a check on incoming OS container image vs currently booted RHCOS version. This runs on firstboot right after the MCD pulls the new image, and will prevent the node to rebase to the updated image if the drift is too far. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly Jerry meant "inject that information into the Ignition config"?
d839abb
to
ddb8885
Compare
|
||
### Enforcement of bootimage skew | ||
|
||
There should be some mechanism that will alert the user when a cluster's bootimage are out of date. To allow for this, the release payload will gain a new field, which will store the OCP version of the minimum acceptable boot image for that release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: I know this has been discussed before, but it is unclear to me how to add an additional field in the release image. Is there an API? Based on openshift docs, the release image is pulled by the installer from https://github.com/openshift/installer/blob/main/pkg/asset/releaseimage/default.go#L23, which I think is populated from ocp-build-data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure either where's the cleanest place to put this.
cc @sdodson
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it may also be worth clarifying here whether:
- the current payload pre-determines upgradeability to the next release, or
- the incoming payload determines if the existing cluster can be upgraded
I think we had discussed 1 being "safer" at the cost of flexibility since we can prepare a user for this change, but based on the description we're aiming for more of 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The option (1) is how upgradeability check currently works in OCP. The respective component is expected to have an Upgradeable=False
condition which CVO respects and prevents updates to any higher minor version. The caveat here is that the component cannot consider exact next version (or even its payload metadata or content), it does not have access to it. So option (2) would be novel. Right now there is no UX for dynamic pre-checks for a specific version (known issues are one, but that serves a different use case).
Right now the only place that could enforce a constraint set by incoming payload is CVO, and it could do it only after the user actually tries to upgrade (poor UX). CVO would download the payload and it would need to be able to evaluate the constraint itself (does not seem bootimage checks should be CVO's domain), or we'd need to invent a new mechanism for delegating this check to components.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to convey (1) 😅 I'll update the description here to make it more explicit.
My original question here still stands though - is there a well defined API for adding an additional field to the release image? Is there any prior art to this? This would help us figure out the scope of work here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYM by "field"? The payload image has labels, and it has content. The content is a very limited amount of metadata (consumed by CVO), and manifests. Manifests is typically where components like MCO find the data they need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we've yet to determine the exact mechanism, so we're flexible here.
@petr-muller by manifest, do you mean just, say, a CVO-managed configmap that contains a hard coded OCP/RHCOS version that we ship via CVO? Would we be allowed to add something to the Release Metadata
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, CVO-managed configmap, or perhaps an annotation on some CVO-managed object, stuff like that.
Adding something to release metadata is not a big deal (see the struct) in itself but would be quite novel use case. Something would need learn writing it (oc adm release new
?). How would MCO consume it? Would CVO need to evaluate it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would a configmap work @djoshy ? I'm also happy with us keeping it vague for now and re-visiting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, just catching up on this now. I think a configmap would work! If we could add this as an additional field in coreos-bootimages
configmap, even better! 🤔
A potential problem here is that the way boot images are stored in the machineset is lossy. In certain platforms, there is no way to recover the boot image metadata from the MachineSet. This is most likely to happen the first time the MCO attempts to do skew enforcement on a cluster that has never had boot image updates. In such cases, the MCO will use the OCP version from install time to determine skew instead. | ||
|
||
#### Reactive | ||
1. Have the MCS reject new ignition requests if the aformentioned object indicates that the cluster's bootimages are out of date. The MCS would then signal to the cluster admin that scale-up is not available until the skew has been resolved. Raising the alarm from the MCS at the cluster level will help prevent avoid additional noise for the cluster infra team, and make apparent that the scaling failure was intentional. The MCS will also attempt to serve an Ignition config that writes a message to `/etc/issue` explaining that the bootimage is too old, which will be visible from the node's console. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this mention the Ignition User-Agent approach too? It would require hardcoding a table in the MCS of Ignition version to "implied OCP version", but it does provide an additional assurance in case an old bootimage was brought up by accident.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that was what we were thinking of initially, but it does come with the issue that in recent versions, there hasn't been that much change to the ignition spec, so would that be sufficient to differentiate exact versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Read through the changes again, and generally lgtm! Very descriptive and clear on the process, great work!
|
||
### Enforcement of bootimage skew | ||
|
||
There should be some mechanism that will alert the user when a cluster's bootimage are out of date. To allow for this, the release payload will gain a new field, which will store the OCP version of the minimum acceptable boot image for that release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it may also be worth clarifying here whether:
- the current payload pre-determines upgradeability to the next release, or
- the incoming payload determines if the existing cluster can be upgraded
I think we had discussed 1 being "safer" at the cost of flexibility since we can prepare a user for this change, but based on the description we're aiming for more of 2?
A potential problem here is that the way boot images are stored in the machineset is lossy. In certain platforms, there is no way to recover the boot image metadata from the MachineSet. This is most likely to happen the first time the MCO attempts to do skew enforcement on a cluster that has never had boot image updates. In such cases, the MCO will use the OCP version from install time to determine skew instead. | ||
|
||
#### Reactive | ||
1. Have the MCS reject new ignition requests if the aformentioned object indicates that the cluster's bootimages are out of date. The MCS would then signal to the cluster admin that scale-up is not available until the skew has been resolved. Raising the alarm from the MCS at the cluster level will help prevent avoid additional noise for the cluster infra team, and make apparent that the scaling failure was intentional. The MCS will also attempt to serve an Ignition config that writes a message to `/etc/issue` explaining that the bootimage is too old, which will be visible from the node's console. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that was what we were thinking of initially, but it does come with the issue that in recent versions, there hasn't been that much change to the ignition spec, so would that be sufficient to differentiate exact versions?
@@ -77,7 +85,7 @@ __Overview__ | |||
- `ManagedBootImages` feature gate is active | |||
- The cluster and/or the machineset is opted-in to boot image updates. This is done at the operator level, via the `MachineConfiguration` API object. | |||
- The `machineset` does not have a valid owner reference. Having a valid owner reference typically indicates that the `MachineSet` is managed by another workflow, and that updates to it are likely going to cause thrashing. | |||
- The golden configmap is verified to be in sync with the current version of the MCO. The MCO will update("stamp") the golden configmap with version of the new MCO image after atleast 1 master node has succesfully completed an update to the new OCP image. This helps prevent `machinesets` being updated too soon at the end of a cluster upgrade, before the MCO itself has updated and has had a chance to roll out the new OCP image to the cluster. | |||
- The golden configmap is verified to be in sync with the current version of the MCO. The MCO will update("stamp") the golden configmap with version of the new MCO image after at least 1 master node has successfully completed an update to the new OCP image. This helps prevent `machinesets` being updated too soon at the end of a cluster upgrade, before the MCO itself has updated and has had a chance to roll out the new OCP image to the cluster. | |||
|
|||
If any of the above checks fail, the MSBIC will exit out of the sync. | |||
- Based on platform and architecture type, the MSBIC will check if the boot images referenced in the `providerSpec` field of the `MachineSet` is the same as the one in the ConfigMap. Each platform(gcp, aws...and so on) does this differently, so this part of the implementation will have to be special cased. The ConfigMap is considered to be the golden set of bootimage values, i.e. they will never go out of date. If it is not a match, the `providerSpec` field is cloned and updated with the new boot image reference. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Owner references are typically used for garbage collection, so, if I delete a Deployment, the things that have owner refs pointing to it (ReplicaSets) get deleted, and so on.
I don't think they actually matter for writes at all. People do often use them for that, but I wouldn't expect Hive (for example) to check for an MCO owner ref before stomping the change you just made
I think we are fairly close to a consensus here. I've updated with the following:
Note: I've also updated the skew enforcement API draft per our discussions. It is still a WIP and needs thorough review, so I will be making a follow-up PR to the EP once we have the final shape merged in openshift/api. |
@@ -472,24 +645,112 @@ status: | |||
The goal of this is to provide information about the "lineage" of a machine management resource to the user. The user can then manually restore their machine management resource to an earlier state if they wish to do so by following documentation. | |||
|
|||
### Implementation Details/Notes/Constraints [optional] | |||
The reconciliation loop below is run on any `MachineSet` that is opted in for updates when any of the following in-cluster resources are added or updated: | |||
1. A `MachineSet`'s providerSpec field. This is where a MachineSet's boot image references reference are stored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
references reference?
1. A `MachineSet`'s providerSpec field. This is where a MachineSet's boot image references reference are stored. | |
1. A `MachineSet`'s providerSpec field. This is where a MachineSet's boot image references are stored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woops, thanks!
|
||
### Enforcement of bootimage skew | ||
|
||
There should be some mechanism that will alert the user when a cluster's bootimage are out of date. To allow for this, the `coreos-bootimages` configmap will gain a new field, which will store the the boot image required for upgrading to the next y stream release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When in the release cycle do you know what the next boot image will be? Can this be done before the .0 of each release is picked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm wrong here @jlebon I think the boot image changes a few times over the development of an OCP release, but we should have a pretty good idea by the release candidates? (and that should let us determine the skew limit before .0 is cut?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked at the full context of this question yet, but for now to answer directly your question: yes, once we get to RCs, bootimages don't churn much. But it's important to understand that bootimages do also get bumped mid-release though they're somewhat infrequent (they only happen when a bugfix needs to land in the bootimage). E.g. for the lifetime of a non-EUS release, it might get updated 5 times let's say.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, my question wasn't super well worded.
Taking 4.20 as an example, will we be able to get something backported to 4.20.0 so that the skew enforcement for 4.21 is correct in every 4.20 z-stream, or, does the timing of releases mean that we will have to backport into some later 4.20.z, and therefore raise the minimum supported upgrade edge from 4.20 to 4.21 to some non .0 z-stream?
Trying to work out what toil this will create for the folks on the OTA team who will have to manage the upgrade graph
CC @sdodson as this came up in the arch call last week WRT a similar topic in CAPI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the MCO POV, we're going to try our best to not change the skew limit during a z stream, except for very rare cases where it can't be fixed via the MCO/RHCOS images. Fixing in MCO/RHCOS is what we've historically done, but having a limit helps us reduce the amount of testing required.
To use your example, this means that we'll ensure the skew limit defined in 4.20.z will always be able to pivot to 4.21.z's release images without issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh gotcha. That does sound like a potentially bad experience though yeah. Because users would initiate their EUS upgrade, and then it would get stuck halfway through on a release that they never intended to run (or possibly is already out of support).
That's a good example actually of having the policy living in the target payload would fix.
Anyway, doesn't need to block this enhancement. We can rediscuss as we flesh out the skew details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good example actually of having the policy living in the target payload would fix.
I don't think that's true. The target payload in EUS would move twice, so would be something like 4.18 -> 4.19 -> 4.20, so you'd still have the issue of ending up on the odd/intermediate release and then getting the upgrade blocked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, maybe I'm misunderstanding. Upon initiating an EUS upgrade, does the CVO know which z-stream release it'll go to in the target EUS? Or does it first do the odd upgrade, and then "requeries the graph" to know what to target next?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or does it first do the odd upgrade, and then "requeries the graph" to know what to target next?
This, IIUC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. CVO itself has no concept or knowledge of an EUS update.
- Manually update the boot image and update the skew enforcement object if it is a non machineset backed cluster. | ||
- Opt-out of skew enforcement altogether, giving up scaling ability. | ||
|
||
A potential problem here is that the way boot images are stored in the machineset is lossy. In certain platforms, there is no way to recover the boot image metadata from the MachineSet. This is most likely to happen the first time the MCO attempts to do skew enforcement on a cluster that has never had boot image updates. In such cases, the MCO will use the OCP version from install time to determine skew instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does it know what version the cluster was installed at?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was planning on pulling this from the clusterversion object's status. However I think this history list gets capped out at 50 items, so we'd be limited by that. Could you clarify if this is the case @wking ?
We could also fall back to the control plane node's aleph version as a backup too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CVO is supposed to preserve the initial install version in ClusterVersion, see this enhancement for details on how pruning intermediate history
entries works. Historically, there was a bug series about pruning the initial install version, see openshift/cincinnati-graph-data#2909 (openshift/cincinnati-graph-data@034fa01) for links to the bugfixes. So we can reliably identify the install version for clusters born in 4.10 or later. And for clusters born in 4.9 or earlier, it depends on how many updates they went through before getting to the bugfix releases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And for clusters born in 4.9 or earlier, it depends on how many updates they went through before getting to the bugfix releases.
But at worst, they would not report a version later than 4.10, right?
4.10 (RHEL 8.4-based) as a lower bound should be good for our use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also fall back to the control plane node's aleph version as a backup too.
Control plane nodes come and go, they may have been created using a newer boot image than the MachineSets, so does the aleph version really show what you need? I don't believe you can confidently say it would match the clusters initial version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fair point, and agreed that it may not be a good source. Based on the responses from this thread, I think we'll be able to leverage clusterversion history here. I'll update the EP to clarify this.
A potential problem here is that the way boot images are stored in the machineset is lossy. In certain platforms, there is no way to recover the boot image metadata from the MachineSet. This is most likely to happen the first time the MCO attempts to do skew enforcement on a cluster that has never had boot image updates. In such cases, the MCO will use the OCP version from install time to determine skew instead. | ||
|
||
#### Reactive | ||
1. Have the MCS reject new ignition requests if the aformentioned object indicates that the cluster's bootimages are out of date. The MCS would then signal to the cluster admin that scale-up is not available until the skew has been resolved. Raising the alarm from the MCS at the cluster level will help prevent additional noise for the cluster infra team, and make apparent that the scaling failure was intentional. The MCS will also attempt to serve an Ignition config that writes a message to `/etc/issue` explaining that the bootimage is too old, which will be visible from the node's console. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this potentially block some scale ups that would otherwise be completely ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's possible that some % of clusters could be affected depending on how tightly we define the skew limit. While we'll do our best minimize such cases (with documentation, rolling out opt-out mechanisms etc), in the long term this really helps limit the MCO's testing matrix...which is already getting quite unmanageable, IMO. And of course, potentially avoid yet to be discovered pivot issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM overall from my side, nice work!
As mentioned, there's some fleshing out to do on the skew enforcement side of things, but it doesn't need to block the overall enhancement.
|
||
#### Reactive | ||
1. Have the MCS reject new ignition requests if the aformentioned configmap indicates that the cluster's bootimages are out of date. The MCS could then signal to the cluster admin that scale-up is not available until the configmap has been reconciled. | ||
2. Add a service to be shipped via RHCOS/MCO templates, which will do a check on incoming OS container image vs currently booted RHCOS version. This runs on firstboot right after the MCD pulls the new image, and will prevent the node to rebase to the updated image if the drift is too far. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replying to my own comment here: note that the bootimage versioning scheme has changed in 4.19+ to be RHEL-based. So e.g. the version here would appear to go from 418.94.z
to 9.6.z
which looks older in a naive semver comparison. So this would need to be somewhat smart.
|
||
### Enforcement of bootimage skew | ||
|
||
There should be some mechanism that will alert the user when a cluster's bootimage are out of date. To allow for this, the `coreos-bootimages` configmap will gain a new field, which will store the the boot image required for upgrading to the next y stream release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, is this saying that e.g. the 4.19 coreos-bootimages
configmap will describe the minimum bootimage needed for 4.20?
I think it'd be cleaner to have 4.20 describe its minimum bootimage required so it's self-descriptive and not quite coupled to the update graph. But I assume there are issues with that approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, is this saying that e.g. the 4.19 coreos-bootimages configmap will describe the minimum bootimage needed for 4.20?
Correct.
But I assume there are issues with that approach?
Yes, unfortunately. OCP components typically evaluate cluster conditions and prevent upgrades to a future version. If the 4.20 configmap described the skew limit for 4.20 scale up, the CVO would somehow have to evaluate and act on it during the upgrade from 4.19->4.20, which would be rather novel. Some more context in this comment thread.
- Manually update the boot image and update the skew enforcement object if it is a non machineset backed cluster. | ||
- Opt-out of skew enforcement altogether, giving up scaling ability. | ||
|
||
A potential problem here is that the way boot images are stored in the machineset is lossy. In certain platforms, there is no way to recover the boot image metadata from the MachineSet. This is most likely to happen the first time the MCO attempts to do skew enforcement on a cluster that has never had boot image updates. In such cases, the MCO will use the OCP version from install time to determine skew instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And for clusters born in 4.9 or earlier, it depends on how many updates they went through before getting to the bugfix releases.
But at worst, they would not report a version later than 4.10, right?
4.10 (RHEL 8.4-based) as a lower bound should be good for our use cases.
|
||
#### Reactive | ||
1. Have the MCS reject new ignition requests if the aformentioned configmap indicates that the cluster's bootimages are out of date. The MCS could then signal to the cluster admin that scale-up is not available until the configmap has been reconciled. | ||
2. Add a service to be shipped via RHCOS/MCO templates, which will do a check on incoming OS container image vs currently booted RHCOS version. This runs on firstboot right after the MCD pulls the new image, and will prevent the node to rebase to the updated image if the drift is too far. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth adding here what cases (2) covers that (1) doesn't. I.e. Jerry's comment re. environments that don't use the MCS.
7f3e363
to
d4507d0
Compare
matchLabels: {} | ||
``` | ||
|
||
#### Skew Enforcement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there isn't already an API PR for this, lets get that going
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened a draft API PR here: openshift/api#2357
|
||
Generally speaking, we would like to keep the minimum required bootimage version aligned to the RHEL version we are shipping in the payload and constant for a given release's z stream. For example, a 9.6 bootimage will be allowed until 9.8 is shipped via RHCOS. We would like to keep this customizable, such that any major breaking changes outside of RHEL major/minor can still be enforced as a one-off. | ||
|
||
#### Enforcement options |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to make a choice between the options, or are they all going to be implemented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're aiming for the methods described here as MVP, with the MCS report back as a stretch goal(so I chose not to mention that in the EP just yet). Based on what comes out of skew API & enforcement implementation, I plan to update this section in a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, and it's all going to be feature gated for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, all of it will be behind feature gates, with skew enforcement starting in DevPreview
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm from the MCO side, thanks David!
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jlebon, yuqi-zhang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm /hold Anyone else who needs to ack this update? |
/hold cancel Spoke with David and I think we're good to go for now. Will follow up if there's other discussion needed. |
1 similar comment
@djoshy: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
This is a follow-up update for #1496 and proposes a strategy for implementing an opt-out and skew enforcement mechanism for boot image updates. A lot of this is based on #1698 by @yuqi-zhang - Thanks, Jerry!
All comments and questions welcome. I have a few open questions for which I'll be leaving comments below.
cc @jlebon @wking
And sorta unrelated: I've also move some of the older flowcharts to mermaid diagrams as they are more maintainable.