MCO-1504: Update bootimage management enhancement #1761

djoshy · 2025-02-26T21:04:48Z

This is a follow-up update for #1496 and proposes a strategy for implementing an opt-out and skew enforcement mechanism for boot image updates. A lot of this is based on #1698 by @yuqi-zhang - Thanks, Jerry!

All comments and questions welcome. I have a few open questions for which I'll be leaving comments below.

cc @jlebon @wking

And sorta unrelated: I've also move some of the older flowcharts to mermaid diagrams as they are more maintainable.

openshift-ci-robot · 2025-02-26T21:04:52Z

@djoshy: This pull request references MCO-1504 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

This is a follow-up update for #1496 and proposes a strategy for implementing an opt-out and skew enforcement mechanism for boot image updates. A lot of this is based on #1698 by @yuqi-zhang - Thanks, Jerry!

All comments and questions welcome. I have a few open questions for which I'll be leaving comments below.

cc @jlebon @wking

And sorta unrelated: I've also move some of the older flowcharts to mermaid diagrams as they are more maintainable.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

enhancements/machine-config/manage-boot-images.md

djoshy · 2025-02-26T21:21:25Z

enhancements/machine-config/manage-boot-images.md

+
+#### Reactive
+1. Have the MCS reject new ignition requests if the aformentioned configmap indicates that the cluster's bootimages are out of date. The MCS could then signal to the cluster admin that scale-up is not available until the configmap has been reconciled.
+2. Add a service to be shipped via RHCOS/MCO templates, which will do a check on incoming OS container image vs currently booted RHCOS version. This runs on firstboot right after the MCD pulls the new image, and will prevent the node to rebase to the updated image if the drift is too far.


I left this in from #1698, in case I was missing something. How would the daemon know the "acceptable" skew during firstboot? I think we could potentially do this after the pivot and yell at the admin, but IMO the "reject join" approach would probably cover this case and never let the firstboot daemon get to pivot.

I guess we'd have to inject that information into the payload.

Also this would cover cases where the environment doesn't use the MCS

Ahh I might have misunderstood something here then, does the first boot daemon have access to the release payload? I thought all it had was the target MachineConfig when it does the first boot pivot.

Possibly Jerry meant "inject that information into the Ignition config"?

If we're going with the OCP versions to check skew, would this approach be viable? Wouldn't (2) be comparing something that could be checked at the cluster level?

@djoshy and I were discussing this today and we were playing with the idea of also bundling RHCOS versions alongside the OCP version in the skew policy so that the booted node would have enough information to determine whether it's out of compliance (by doing a version comparison). In theory, RHCOS versions can e.g. go backwards (I think it did happen once or twice in the installer where we rolled back a bootimage bump) but we wouldn't have a skew policy so tight for this to really matter.

Replying to my own comment here: note that the bootimage versioning scheme has changed in 4.19+ to be RHEL-based. So e.g. the version here would appear to go from 418.94.z to 9.6.z which looks older in a naive semver comparison. So this would need to be somewhat smart.

I think it's worth adding here what cases (2) covers that (1) doesn't. I.e. Jerry's comment re. environments that don't use the MCS.

Replying to my own comment here: note that the bootimage versioning scheme has changed in 4.19+ to be RHEL-based. So e.g. the version here would appear to go from 418.94.z to 9.6.z which looks older in a naive semver comparison. So this would need to be somewhat smart.

Understood, I was thinking maybe we could optimize this a little bit depending by starting our skew limits in RHEL format(9.6.z). So anything RHCOS based could be considered out of skew. Given that it'll at least be a few OCP minors before we get to GA skew enforcement, we should have enough runway, hopefully.

I think it's worth adding here what cases (2) covers that (1) doesn't. I.e. Jerry's comment re. environments that don't use the MCS.

Will do!

djoshy · 2025-02-26T21:26:55Z

enhancements/machine-config/manage-boot-images.md

+2. Add a service to be shipped via RHCOS/MCO templates, which will do a check on incoming OS container image vs currently booted RHCOS version. This runs on firstboot right after the MCD pulls the new image, and will prevent the node to rebase to the updated image if the drift is too far.
+
+RHEL major versions will no longer be cross-compatible. i.e. if you wish to have a RHEL10 machineconfigpool, you must use a RHEL10 bootimage.
+


cc @wking @yuqi-zhang
picking up the converstation from #1698:

From this point on, MCO will target RHEL 10 for new nodes scaling into this MC

I'll let Jerry weigh in here, but my read here was that we aren't planning on doing any MCP specific enforcement. I think Jerry was implying this would be result from the aforementioned enforcement methods.

Hmm, so, when initially discussing around RHEL 10, it was around dual-stream, where you'd simultaneously have rhel9 and rhel10 based workers, and each type would have to boot from the same origin major for the bootimage. I think the original intention was to reduce potential 9->10 upgrade issues until RHEL 10 is more stable, but I could be wrong there (cc @sdodson )

When transitioning the cluster base RHCOS nodes from 9->10 then it would be a different problem. I think we'd have to have some cross compatibility there eventually and allow for rhel9 bootimages to work for at least 1 version where the shipped image is RHEL10

enhancements/machine-config/manage-boot-images.md

yuqi-zhang · 2025-02-27T22:39:58Z

enhancements/machine-config/manage-boot-images.md

+
+#### Reactive
+1. Have the MCS reject new ignition requests if the aformentioned configmap indicates that the cluster's bootimages are out of date. The MCS could then signal to the cluster admin that scale-up is not available until the configmap has been reconciled.
+2. Add a service to be shipped via RHCOS/MCO templates, which will do a check on incoming OS container image vs currently booted RHCOS version. This runs on firstboot right after the MCD pulls the new image, and will prevent the node to rebase to the updated image if the drift is too far.


I guess we'd have to inject that information into the payload.

Also this would cover cases where the environment doesn't use the MCS

yuqi-zhang · 2025-02-27T22:43:44Z

enhancements/machine-config/manage-boot-images.md

+2. Add a service to be shipped via RHCOS/MCO templates, which will do a check on incoming OS container image vs currently booted RHCOS version. This runs on firstboot right after the MCD pulls the new image, and will prevent the node to rebase to the updated image if the drift is too far.
+
+RHEL major versions will no longer be cross-compatible. i.e. if you wish to have a RHEL10 machineconfigpool, you must use a RHEL10 bootimage.
+


Hmm, so, when initially discussing around RHEL 10, it was around dual-stream, where you'd simultaneously have rhel9 and rhel10 based workers, and each type would have to boot from the same origin major for the bootimage. I think the original intention was to reduce potential 9->10 upgrade issues until RHEL 10 is more stable, but I could be wrong there (cc @sdodson )

When transitioning the cluster base RHCOS nodes from 9->10 then it would be a different problem. I think we'd have to have some cross compatibility there eventually and allow for rhel9 bootimages to work for at least 1 version where the shipped image is RHEL10

enhancements/machine-config/manage-boot-images.md

djoshy · 2025-03-04T15:53:57Z

In the last push:

Timeline has been updated to link to release docs.
Clarified decoupling of default-on and skew enforcement.
Added a new "None" mode for machine resource selection to the existing MachineManagerSelectorMode API.
Proposed a new API for skew enforcement selection.

enhancements/machine-config/manage-boot-images.md

JoelSpeed · 2025-03-07T10:36:50Z

enhancements/machine-config/manage-boot-images.md

+  // skewEnforcement allows an admin to set behavior of the boot image skew enforcement mechanism.
+  // Enabled means that the MCO will degrade and prevent upgrades when the boot image skew is too large.
+  // Disabled means that the MCO will no longer degrade and will permit upgrades when the boot image skew is 
+  // too large. This may also hinder the cluster's scaling ability.


Define too large? What are the potential pitfalls of "too large" of a skew?

By too large, I meant to say that it fails the skew guidance from the release image - I'll clarify the godoc to better desribe this. The main pitfall is that scaling would most likely fail, i.e. the pivot to release OS image isn't possible if your current boot image is below x. If scaling is a non-issue for the cluster in question, they could disable it and the cluster would be able to carry out upgrades again.

JoelSpeed · 2025-03-07T10:37:19Z

enhancements/machine-config/manage-boot-images.md

+  // Disabled means that the MCO will no longer degrade and will permit upgrades when the boot image skew is 
+  // too large. This may also hinder the cluster's scaling ability.
+  // +optional
+  SkewEnforcement SkewEnforcementSelectorMode `json:"skewEnforcement"`


If you were to make the enum values here represent the actual skew of the images, what might this look like?

Do you mean that we should make the skew configurable? My understanding was that it needed to be something constant for a release(defined in the releaseImage) and it could potentially change between releases, but not something an operator/admin would get to manually set.

I more mean, you have skew enforcement as Enabled or Disabled. What if skew enforcement were more like ReleaseRecommended and Disabled, would that make more sense, and allow for a future expansion where an admin could opt in and say, actually, I want SingleRelease skew, or DualRelease skew? Allowing them to set their own guidelines and override what is recommended by the release image itself

Hmm, what's the use case you're thinking about? I think the most likely scenario is that they want to override the skew check in the release image because they don't care about scaling. I'm not sure if they'd be interested in making the skew check tighter than we require.

enhancements/machine-config/manage-boot-images.md

JoelSpeed · 2025-03-07T10:51:47Z

enhancements/machine-config/manage-boot-images.md

+1. Have the MCS reject new ignition requests if the aformentioned configmap indicates that the cluster's bootimages are out of date. The MCS could then signal to the cluster admin that scale-up is not available until the configmap has been reconciled.
+2. Add a service to be shipped via RHCOS/MCO templates, which will do a check on incoming OS container image vs currently booted RHCOS version. This runs on firstboot right after the MCD pulls the new image, and will prevent the node to rebase to the updated image if the drift is too far.
+
+RHEL major versions will no longer be cross-compatible. i.e. if you wish to have a RHEL10 machineconfigpool, you must use a RHEL10 bootimage.


This feels like a breaking change, why now?

I understand there's lots changing about our boot images, but, is this a one off, or a constant issue going forward?

This specifically is for dual-stream support, where in some version of OCP (likely 4.20?) that we will have a special RHEL-10 pool (design TBD), so your workers in the same OCP version will run different RHEL majors.

We will eventually have to have a RHEL9->10 upgrade path, so dual stream aside generally speaking I think we'd need to have cross compatibility, so we should probably clarify this.

But we would never want a RHEL9->11 upgrade path, I think would be the only breaking case.

djoshy · 2025-03-11T18:47:32Z

In the last push:

Added some more issues to the Motivation section such as composefs, sigstore etc
Added ManagedBootImages status API
Other minor cleanups/removes based on feedback

jlebon

Thanks a lot for working on this!

One thing that I think is implied but should probably be spelled out more is how skew comparison actually works. I.e. are we literally parsing RHCOS bootimage version strings and doing comparisons (in that case, recent versioning changes make that trickier)?

Or I think a saner approach is to compare OCP versions instead given that RHCOS bootimage versioning is not super meaningful to the rest of OCP. I.e. the skew policies would reference OCP versions and the coreos-bootimages configmap would reference the OCP version it's for?

enhancements/machine-config/manage-boot-images.md

jlebon · 2025-03-20T00:56:36Z

enhancements/machine-config/manage-boot-images.md

@@ -77,7 +85,7 @@ __Overview__
  - `ManagedBootImages` feature gate is active
  - The cluster and/or the machineset is opted-in to boot image updates. This is done at the operator level, via the `MachineConfiguration` API object.
  - The `machineset` does not have a valid owner reference. Having a valid owner reference typically indicates that the `MachineSet` is managed by another workflow, and that updates to it are likely going to cause thrashing. 
-  - The golden configmap is verified to be in sync with the current version of the MCO. The MCO will update("stamp") the golden configmap with version of the new MCO image after atleast 1 master node has succesfully completed an update to the new OCP image. This helps prevent `machinesets` being updated too soon at the end of a cluster upgrade, before the MCO itself has updated and has had a chance to roll out the new OCP image to the cluster. 
+  - The golden configmap is verified to be in sync with the current version of the MCO. The MCO will update("stamp") the golden configmap with version of the new MCO image after at least 1 master node has successfully completed an update to the new OCP image. This helps prevent `machinesets` being updated too soon at the end of a cluster upgrade, before the MCO itself has updated and has had a chance to roll out the new OCP image to the cluster.

  If any of the above checks fail, the MSBIC will exit out of the sync.
 - Based on platform and architecture type, the MSBIC will check if the boot images referenced in the `providerSpec` field of the `MachineSet` is the same as the one in the ConfigMap. Each platform(gcp, aws...and so on) does this differently, so this part of the implementation will have to be special cased. The ConfigMap is considered to be the golden set of bootimage values, i.e. they will never go out of date. If it is not a match, the `providerSpec` field is cloned and updated with the new boot image reference.


Can't comment lower than this, but: should the MSBIC add an owner reference to itself on the MachineSet after updating it? (And obviously change the precondition checks above to check whether the MachineSet has either no owner, or the MSBIC as owner.)

Otherwise, other controllers might have the same logic and also update without taking ownership and you still get thrashing.

I hesitate to add an owner reference to a MachineSet because the MCO is only really taking ownership of one specific field inside it, the providerSpec. The remaining fields are still managed and updated by the MAPI operator. If someone else is causing thrashing in the providerSpec field, we hope to get it ironed out in the early stages of supporting that platform by socializing this feature and the EP.

cc @JoelSpeed in case you have anything to add here

I don't think adding an owner reference would prevent thrashing. Generally there's nothing that would exist within the cluster that would be updating the boot images, but many users have add-ons that might.

GitOps may be forcing a specific spec, and would revert any change the MSBIC makes.

Hive also forces the MachineSet spec, and would revert any change the MSBIC makes.

In the future, we plan to support MachineDeployments, which would own the MachineSets. In this case we would want the MSBIC to update the MachineDeployment instead of the MachineSet.

We are also working on migrating MAPI to CAPI. In the future, writes to the MAPI resource may be denied, and the writer is then expected to update the CAPI resource instead, we will need to discuss how this is going to work.

Out of interest, what object you would expect the owner reference to point to?

Out of interest, what object you would expect the owner reference to point to?

Not sure... it just feels odd to update an object without clearly claiming ownership of it. But David's point that we're only updating one field of it is a good one. Happy to keep it un-owned if that's the cleanest.

Owner references are typically used for garbage collection, so, if I delete a Deployment, the things that have owner refs pointing to it (ReplicaSets) get deleted, and so on.

I don't think they actually matter for writes at all. People do often use them for that, but I wouldn't expect Hive (for example) to check for an MCO owner ref before stomping the change you just made

enhancements/machine-config/manage-boot-images.md

jlebon · 2025-03-21T20:43:46Z

enhancements/machine-config/manage-boot-images.md

+
+The cluster admin may also choose to opt-out of skew management via this configmap, which indicates that they will not require scaling nodes, and thereby opting out of skew enforcement and scaling functionality.
+
+A potential problem here is that the way boot images are stored in the machineset is lossy. In certain platforms, there is no way to recover the boot image metadata from the MachineSet. This is most likely to happen the first time the MCO attempts to do skew enforcement on a cluster that has never had boot image updates. In such cases, the MCO will default to the install time boot image, which can be recovered from the [aleph version](https://github.com/coreos/coreos-assembler/pull/768) of the control plane nodes. 


Past the first update, can you clarify how the MSBIC knows which bootimage version is in a MachineSet? Will it add e.g. an annotation on the MachineSet when it patches it?

The way this relates to this line here is that I think rather than using the aleph of the control plane nodes, we could also just make the installer add the necessary annotation when it creates the MachineSet, right?

Clusters born from installers without that patch won't have the annotation which implies it's at least older than the OCP release containing the patch. Cluster born from installers with it will have it available.

Past the first update, can you clarify how the MSBIC knows which bootimage version is in a MachineSet? Will it add e.g. an annotation on the MachineSet when it patches it?

Based on the current discussion from the EP, I think the consensus is leaning towards an API for skew management instead of the configmap, so the "boot image version" of the cluster would have to be a field within there, editable by the user(for the non-managed cases) and the boot image controller. In fact, if we go with the approach of using OCP versions from your comment, we might be able to skip the boot image metadata determination issue altogether? For example:

When the boot image controller successfully completes a cluster wide boot image update, store the OCP version of the last update(retrievable from the boot images configmap) in the Skew Enforcement API.

The skew enforcement mechanism would monitor this value in the API, and compare against the skew limit described by the current release image. If the value in the API is empty, we default to OCP version at install time(I'm sure this is somewhere in the cluster)

Does that seem plausible, or am I missing something? 🤔

The skew enforcement mechanism would monitor this value in the API, and compare against the skew limit described by the current release image. If the value in the API is empty, we default to OCP version at install time(I'm sure this is somewhere in the cluster)

When an admin manually has to update the boot image, how do they tell the cluster what version the new boot image is?

Addressed in #1761 (comment), PTAL

enhancements/machine-config/manage-boot-images.md

jlebon · 2025-03-21T20:56:14Z

enhancements/machine-config/manage-boot-images.md

+
+#### Reactive
+1. Have the MCS reject new ignition requests if the aformentioned configmap indicates that the cluster's bootimages are out of date. The MCS could then signal to the cluster admin that scale-up is not available until the configmap has been reconciled.
+2. Add a service to be shipped via RHCOS/MCO templates, which will do a check on incoming OS container image vs currently booted RHCOS version. This runs on firstboot right after the MCD pulls the new image, and will prevent the node to rebase to the updated image if the drift is too far.


Possibly Jerry meant "inject that information into the Ignition config"?

djoshy · 2025-04-10T20:51:43Z

enhancements/machine-config/manage-boot-images.md

+
+### Enforcement of bootimage skew
+
+There should be some mechanism that will alert the user when a cluster's bootimage are out of date. To allow for this, the release payload will gain a new field, which will store the OCP version of the minimum acceptable boot image for that release. 


Question: I know this has been discussed before, but it is unclear to me how to add an additional field in the release image. Is there an API? Based on openshift docs, the release image is pulled by the installer from https://github.com/openshift/installer/blob/main/pkg/asset/releaseimage/default.go#L23, which I think is populated from ocp-build-data.

Not sure either where's the cleanest place to put this.
cc @sdodson

I think it may also be worth clarifying here whether:

the current payload pre-determines upgradeability to the next release, or

the incoming payload determines if the existing cluster can be upgraded

I think we had discussed 1 being "safer" at the cost of flexibility since we can prepare a user for this change, but based on the description we're aiming for more of 2?

The option (1) is how upgradeability check currently works in OCP. The respective component is expected to have an Upgradeable=False condition which CVO respects and prevents updates to any higher minor version. The caveat here is that the component cannot consider exact next version (or even its payload metadata or content), it does not have access to it. So option (2) would be novel. Right now there is no UX for dynamic pre-checks for a specific version (known issues are one, but that serves a different use case).

Right now the only place that could enforce a constraint set by incoming payload is CVO, and it could do it only after the user actually tries to upgrade (poor UX). CVO would download the payload and it would need to be able to evaluate the constraint itself (does not seem bootimage checks should be CVO's domain), or we'd need to invent a new mechanism for delegating this check to components.

I was trying to convey (1) 😅 I'll update the description here to make it more explicit.

My original question here still stands though - is there a well defined API for adding an additional field to the release image? Is there any prior art to this? This would help us figure out the scope of work here.

WDYM by "field"? The payload image has labels, and it has content. The content is a very limited amount of metadata (consumed by CVO), and manifests. Manifests is typically where components like MCO find the data they need.

I think we've yet to determine the exact mechanism, so we're flexible here.

@petr-muller by manifest, do you mean just, say, a CVO-managed configmap that contains a hard coded OCP/RHCOS version that we ship via CVO? Would we be allowed to add something to the Release Metadata instead?

Yeah, CVO-managed configmap, or perhaps an annotation on some CVO-managed object, stuff like that.

Adding something to release metadata is not a big deal (see the struct) in itself but would be quite novel use case. Something would need learn writing it (oc adm release new?). How would MCO consume it? Would CVO need to evaluate it?

Would a configmap work @djoshy ? I'm also happy with us keeping it vague for now and re-visiting

Sorry, just catching up on this now. I think a configmap would work! If we could add this as an additional field in coreos-bootimages configmap, even better! 🤔

jlebon · 2025-04-24T20:38:13Z

enhancements/machine-config/manage-boot-images.md

+A potential problem here is that the way boot images are stored in the machineset is lossy. In certain platforms, there is no way to recover the boot image metadata from the MachineSet. This is most likely to happen the first time the MCO attempts to do skew enforcement on a cluster that has never had boot image updates. In such cases, the MCO will use the OCP version from install time to determine skew instead. 
+
+#### Reactive
+1. Have the MCS reject new ignition requests if the aformentioned object indicates that the cluster's bootimages are out of date. The MCS would then signal to the cluster admin that scale-up is not available until the skew has been resolved. Raising the alarm from the MCS at the cluster level will help prevent avoid additional noise for the cluster infra team, and make apparent that the scaling failure was intentional. The MCS will also attempt to serve an Ignition config that writes a message to `/etc/issue` explaining that the bootimage is too old, which will be visible from the node's console.


Should this mention the Ignition User-Agent approach too? It would require hardcoding a table in the MCS of Ignition version to "implied OCP version", but it does provide an additional assurance in case an old bootimage was brought up by accident.

I think that was what we were thinking of initially, but it does come with the issue that in recent versions, there hasn't been that much change to the ignition spec, so would that be sufficient to differentiate exact versions?

yuqi-zhang

Read through the changes again, and generally lgtm! Very descriptive and clear on the process, great work!

yuqi-zhang · 2025-04-24T23:22:25Z

enhancements/machine-config/manage-boot-images.md

+
+### Enforcement of bootimage skew
+
+There should be some mechanism that will alert the user when a cluster's bootimage are out of date. To allow for this, the release payload will gain a new field, which will store the OCP version of the minimum acceptable boot image for that release. 


I think it may also be worth clarifying here whether:

the current payload pre-determines upgradeability to the next release, or

the incoming payload determines if the existing cluster can be upgraded

I think we had discussed 1 being "safer" at the cost of flexibility since we can prepare a user for this change, but based on the description we're aiming for more of 2?

yuqi-zhang · 2025-04-24T23:24:49Z

enhancements/machine-config/manage-boot-images.md

+A potential problem here is that the way boot images are stored in the machineset is lossy. In certain platforms, there is no way to recover the boot image metadata from the MachineSet. This is most likely to happen the first time the MCO attempts to do skew enforcement on a cluster that has never had boot image updates. In such cases, the MCO will use the OCP version from install time to determine skew instead. 
+
+#### Reactive
+1. Have the MCS reject new ignition requests if the aformentioned object indicates that the cluster's bootimages are out of date. The MCS would then signal to the cluster admin that scale-up is not available until the skew has been resolved. Raising the alarm from the MCS at the cluster level will help prevent avoid additional noise for the cluster infra team, and make apparent that the scaling failure was intentional. The MCS will also attempt to serve an Ignition config that writes a message to `/etc/issue` explaining that the bootimage is too old, which will be visible from the node's console.


I think that was what we were thinking of initially, but it does come with the issue that in recent versions, there hasn't been that much change to the ignition spec, so would that be sufficient to differentiate exact versions?

enhancements/machine-config/manage-boot-images.md

JoelSpeed · 2025-04-25T08:34:17Z

enhancements/machine-config/manage-boot-images.md

@@ -77,7 +85,7 @@ __Overview__
  - `ManagedBootImages` feature gate is active
  - The cluster and/or the machineset is opted-in to boot image updates. This is done at the operator level, via the `MachineConfiguration` API object.
  - The `machineset` does not have a valid owner reference. Having a valid owner reference typically indicates that the `MachineSet` is managed by another workflow, and that updates to it are likely going to cause thrashing. 
-  - The golden configmap is verified to be in sync with the current version of the MCO. The MCO will update("stamp") the golden configmap with version of the new MCO image after atleast 1 master node has succesfully completed an update to the new OCP image. This helps prevent `machinesets` being updated too soon at the end of a cluster upgrade, before the MCO itself has updated and has had a chance to roll out the new OCP image to the cluster. 
+  - The golden configmap is verified to be in sync with the current version of the MCO. The MCO will update("stamp") the golden configmap with version of the new MCO image after at least 1 master node has successfully completed an update to the new OCP image. This helps prevent `machinesets` being updated too soon at the end of a cluster upgrade, before the MCO itself has updated and has had a chance to roll out the new OCP image to the cluster.

  If any of the above checks fail, the MSBIC will exit out of the sync.
 - Based on platform and architecture type, the MSBIC will check if the boot images referenced in the `providerSpec` field of the `MachineSet` is the same as the one in the ConfigMap. Each platform(gcp, aws...and so on) does this differently, so this part of the implementation will have to be special cased. The ConfigMap is considered to be the golden set of bootimage values, i.e. they will never go out of date. If it is not a match, the `providerSpec` field is cloned and updated with the new boot image reference.


Owner references are typically used for garbage collection, so, if I delete a Deployment, the things that have owner refs pointing to it (ReplicaSets) get deleted, and so on.

I don't think they actually matter for writes at all. People do often use them for that, but I wouldn't expect Hive (for example) to check for an MCO owner ref before stomping the change you just made

enhancements/machine-config/manage-boot-images.md

djoshy · 2025-05-28T19:27:58Z

I think we are fairly close to a consensus here. I've updated with the following:

Added new aarch64 bootloader issue under motivation
Clarified rootCA naming
Added hot loop detection mechanism
Clarified skew enforcement behavior and where the skew limit will be stored

Note: I've also updated the skew enforcement API draft per our discussions. It is still a WIP and needs thorough review, so I will be making a follow-up PR to the EP once we have the final shape merged in openshift/api.

JoelSpeed · 2025-05-30T10:50:07Z

enhancements/machine-config/manage-boot-images.md

@@ -472,24 +645,112 @@ status:
 The goal of this is to provide information about the "lineage" of a machine management resource to the user. The user can then manually restore their machine management resource to an earlier state if they wish to do so by following documentation. 

 ### Implementation Details/Notes/Constraints [optional]
+The reconciliation loop below is run on any `MachineSet` that is opted in for updates when any of the following in-cluster resources are added or updated:
+1. A `MachineSet`'s providerSpec field. This is where a MachineSet's boot image references reference are stored.


references reference?

Suggested change

1. A `MachineSet`'s providerSpec field. This is where a MachineSet's boot image references reference are stored.

1. A `MachineSet`'s providerSpec field. This is where a MachineSet's boot image references are stored.

woops, thanks!

JoelSpeed · 2025-05-30T10:52:44Z

enhancements/machine-config/manage-boot-images.md

+
+### Enforcement of bootimage skew
+
+There should be some mechanism that will alert the user when a cluster's bootimage are out of date. To allow for this, the `coreos-bootimages` configmap will gain a new field, which will store the the boot image required for upgrading to the next y stream release.


When in the release cycle do you know what the next boot image will be? Can this be done before the .0 of each release is picked?

Correct me if I'm wrong here @jlebon I think the boot image changes a few times over the development of an OCP release, but we should have a pretty good idea by the release candidates? (and that should let us determine the skew limit before .0 is cut?)

I haven't looked at the full context of this question yet, but for now to answer directly your question: yes, once we get to RCs, bootimages don't churn much. But it's important to understand that bootimages do also get bumped mid-release though they're somewhat infrequent (they only happen when a bugfix needs to land in the bootimage). E.g. for the lifetime of a non-EUS release, it might get updated 5 times let's say.

Sorry, my question wasn't super well worded.

Taking 4.20 as an example, will we be able to get something backported to 4.20.0 so that the skew enforcement for 4.21 is correct in every 4.20 z-stream, or, does the timing of releases mean that we will have to backport into some later 4.20.z, and therefore raise the minimum supported upgrade edge from 4.20 to 4.21 to some non .0 z-stream?

Trying to work out what toil this will create for the folks on the OTA team who will have to manage the upgrade graph

CC @sdodson as this came up in the arch call last week WRT a similar topic in CAPI

From the MCO POV, we're going to try our best to not change the skew limit during a z stream, except for very rare cases where it can't be fixed via the MCO/RHCOS images. Fixing in MCO/RHCOS is what we've historically done, but having a limit helps us reduce the amount of testing required.

To use your example, this means that we'll ensure the skew limit defined in 4.20.z will always be able to pivot to 4.21.z's release images without issues.

Ahh gotcha. That does sound like a potentially bad experience though yeah. Because users would initiate their EUS upgrade, and then it would get stuck halfway through on a release that they never intended to run (or possibly is already out of support).

That's a good example actually of having the policy living in the target payload would fix.

Anyway, doesn't need to block this enhancement. We can rediscuss as we flesh out the skew details.

That's a good example actually of having the policy living in the target payload would fix.

I don't think that's true. The target payload in EUS would move twice, so would be something like 4.18 -> 4.19 -> 4.20, so you'd still have the issue of ending up on the odd/intermediate release and then getting the upgrade blocked

Hmm, maybe I'm misunderstanding. Upon initiating an EUS upgrade, does the CVO know which z-stream release it'll go to in the target EUS? Or does it first do the odd upgrade, and then "requeries the graph" to know what to target next?

Or does it first do the odd upgrade, and then "requeries the graph" to know what to target next?

This, IIUC

Correct. CVO itself has no concept or knowledge of an EUS update.

JoelSpeed · 2025-05-30T10:54:49Z

enhancements/machine-config/manage-boot-images.md

+- Manually update the boot image and update the skew enforcement object if it is a non machineset backed cluster.
+- Opt-out of skew enforcement altogether, giving up scaling ability.
+
+A potential problem here is that the way boot images are stored in the machineset is lossy. In certain platforms, there is no way to recover the boot image metadata from the MachineSet. This is most likely to happen the first time the MCO attempts to do skew enforcement on a cluster that has never had boot image updates. In such cases, the MCO will use the OCP version from install time to determine skew instead. 


How does it know what version the cluster was installed at?

I was planning on pulling this from the clusterversion object's status. However I think this history list gets capped out at 50 items, so we'd be limited by that. Could you clarify if this is the case @wking ?

We could also fall back to the control plane node's aleph version as a backup too.

The CVO is supposed to preserve the initial install version in ClusterVersion, see this enhancement for details on how pruning intermediate history entries works. Historically, there was a bug series about pruning the initial install version, see openshift/cincinnati-graph-data#2909 (openshift/cincinnati-graph-data@034fa01) for links to the bugfixes. So we can reliably identify the install version for clusters born in 4.10 or later. And for clusters born in 4.9 or earlier, it depends on how many updates they went through before getting to the bugfix releases.

And for clusters born in 4.9 or earlier, it depends on how many updates they went through before getting to the bugfix releases.

But at worst, they would not report a version later than 4.10, right?

4.10 (RHEL 8.4-based) as a lower bound should be good for our use cases.

We could also fall back to the control plane node's aleph version as a backup too.

Control plane nodes come and go, they may have been created using a newer boot image than the MachineSets, so does the aleph version really show what you need? I don't believe you can confidently say it would match the clusters initial version

That's a fair point, and agreed that it may not be a good source. Based on the responses from this thread, I think we'll be able to leverage clusterversion history here. I'll update the EP to clarify this.

JoelSpeed · 2025-05-30T10:56:54Z

enhancements/machine-config/manage-boot-images.md

+A potential problem here is that the way boot images are stored in the machineset is lossy. In certain platforms, there is no way to recover the boot image metadata from the MachineSet. This is most likely to happen the first time the MCO attempts to do skew enforcement on a cluster that has never had boot image updates. In such cases, the MCO will use the OCP version from install time to determine skew instead. 
+
+#### Reactive
+1. Have the MCS reject new ignition requests if the aformentioned object indicates that the cluster's bootimages are out of date. The MCS would then signal to the cluster admin that scale-up is not available until the skew has been resolved. Raising the alarm from the MCS at the cluster level will help prevent additional noise for the cluster infra team, and make apparent that the scaling failure was intentional. The MCS will also attempt to serve an Ignition config that writes a message to `/etc/issue` explaining that the bootimage is too old, which will be visible from the node's console.


Would this potentially block some scale ups that would otherwise be completely ok?

Yes, it's possible that some % of clusters could be affected depending on how tightly we define the skew limit. While we'll do our best minimize such cases (with documentation, rolling out opt-out mechanisms etc), in the long term this really helps limit the MCO's testing matrix...which is already getting quite unmanageable, IMO. And of course, potentially avoid yet to be discovered pivot issues.

enhancements/machine-config/manage-boot-images.md

jlebon

This LGTM overall from my side, nice work!

As mentioned, there's some fleshing out to do on the skew enforcement side of things, but it doesn't need to block the overall enhancement.

enhancements/machine-config/manage-boot-images.md

jlebon · 2025-06-03T18:54:00Z

enhancements/machine-config/manage-boot-images.md

+
+#### Reactive
+1. Have the MCS reject new ignition requests if the aformentioned configmap indicates that the cluster's bootimages are out of date. The MCS could then signal to the cluster admin that scale-up is not available until the configmap has been reconciled.
+2. Add a service to be shipped via RHCOS/MCO templates, which will do a check on incoming OS container image vs currently booted RHCOS version. This runs on firstboot right after the MCD pulls the new image, and will prevent the node to rebase to the updated image if the drift is too far.


Replying to my own comment here: note that the bootimage versioning scheme has changed in 4.19+ to be RHEL-based. So e.g. the version here would appear to go from 418.94.z to 9.6.z which looks older in a naive semver comparison. So this would need to be somewhat smart.

jlebon · 2025-06-03T18:58:20Z

enhancements/machine-config/manage-boot-images.md

+
+### Enforcement of bootimage skew
+
+There should be some mechanism that will alert the user when a cluster's bootimage are out of date. To allow for this, the `coreos-bootimages` configmap will gain a new field, which will store the the boot image required for upgrading to the next y stream release.


Hmm, is this saying that e.g. the 4.19 coreos-bootimages configmap will describe the minimum bootimage needed for 4.20?

I think it'd be cleaner to have 4.20 describe its minimum bootimage required so it's self-descriptive and not quite coupled to the update graph. But I assume there are issues with that approach?

Hmm, is this saying that e.g. the 4.19 coreos-bootimages configmap will describe the minimum bootimage needed for 4.20?

Correct.

But I assume there are issues with that approach?

Yes, unfortunately. OCP components typically evaluate cluster conditions and prevent upgrades to a future version. If the 4.20 configmap described the skew limit for 4.20 scale up, the CVO would somehow have to evaluate and act on it during the upgrade from 4.19->4.20, which would be rather novel. Some more context in this comment thread.

jlebon · 2025-06-03T19:03:21Z

enhancements/machine-config/manage-boot-images.md

+- Manually update the boot image and update the skew enforcement object if it is a non machineset backed cluster.
+- Opt-out of skew enforcement altogether, giving up scaling ability.
+
+A potential problem here is that the way boot images are stored in the machineset is lossy. In certain platforms, there is no way to recover the boot image metadata from the MachineSet. This is most likely to happen the first time the MCO attempts to do skew enforcement on a cluster that has never had boot image updates. In such cases, the MCO will use the OCP version from install time to determine skew instead. 


And for clusters born in 4.9 or earlier, it depends on how many updates they went through before getting to the bugfix releases.

But at worst, they would not report a version later than 4.10, right?

4.10 (RHEL 8.4-based) as a lower bound should be good for our use cases.

jlebon · 2025-06-03T19:23:36Z

enhancements/machine-config/manage-boot-images.md

+
+#### Reactive
+1. Have the MCS reject new ignition requests if the aformentioned configmap indicates that the cluster's bootimages are out of date. The MCS could then signal to the cluster admin that scale-up is not available until the configmap has been reconciled.
+2. Add a service to be shipped via RHCOS/MCO templates, which will do a check on incoming OS container image vs currently booted RHCOS version. This runs on firstboot right after the MCD pulls the new image, and will prevent the node to rebase to the updated image if the drift is too far.


I think it's worth adding here what cases (2) covers that (1) doesn't. I.e. Jerry's comment re. environments that don't use the MCS.

JoelSpeed · 2025-06-05T10:53:51Z

enhancements/machine-config/manage-boot-images.md

+             matchLabels: {}
+```  
+
+#### Skew Enforcement


If there isn't already an API PR for this, lets get that going

Opened a draft API PR here: openshift/api#2357

JoelSpeed · 2025-06-05T10:54:31Z

enhancements/machine-config/manage-boot-images.md

+
+Generally speaking, we would like to keep the minimum required bootimage version aligned to the RHEL version we are shipping in the payload and constant for a given release's z stream. For example, a 9.6 bootimage will be allowed until 9.8 is shipped via RHCOS. We would like to keep this customizable, such that any major breaking changes outside of RHEL major/minor can still be enforced as a one-off.
+
+#### Enforcement options


Do we need to make a choice between the options, or are they all going to be implemented?

We're aiming for the methods described here as MVP, with the MCS report back as a stretch goal(so I chose not to mention that in the EP just yet). Based on what comes out of skew API & enforcement implementation, I plan to update this section in a follow-up PR.

Ack, and it's all going to be feature gated for now?

Yep, all of it will be behind feature gates, with skew enforcement starting in DevPreview

yuqi-zhang

lgtm from the MCO side, thanks David!

openshift-ci · 2025-06-05T20:47:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlebon, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

JoelSpeed · 2025-06-16T12:48:27Z

/lgtm

/hold

Anyone else who needs to ack this update?

yuqi-zhang · 2025-06-18T14:00:26Z

/hold cancel

Spoke with David and I think we're good to go for now. Will follow up if there's other discussion needed.

openshift-ci-robot · 2025-06-18T15:57:37Z

/retest-required

Remaining retests: 0 against base HEAD 2b010fd and 2 for PR HEAD d504628 in total

openshift-ci-robot · 2025-06-19T02:22:23Z

/retest-required

Remaining retests: 0 against base HEAD 2b010fd and 2 for PR HEAD d504628 in total

openshift-ci · 2025-06-19T02:38:20Z

@djoshy: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

initial draft for default-on boot image updates

980d43e

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 26, 2025

openshift-ci bot requested review from stbenjam and tjungblu February 26, 2025 21:05

djoshy commented Feb 26, 2025

View reviewed changes

enhancements/machine-config/manage-boot-images.md Outdated Show resolved Hide resolved

djoshy commented Feb 26, 2025

View reviewed changes

enhancements/machine-config/manage-boot-images.md Outdated Show resolved Hide resolved

djoshy commented Feb 26, 2025

View reviewed changes

enhancements/machine-config/manage-boot-images.md Outdated Show resolved Hide resolved

djoshy commented Feb 26, 2025

View reviewed changes

djoshy mentioned this pull request Feb 26, 2025

Update bootimage management enhancement #1698

Closed

djoshy commented Feb 26, 2025

View reviewed changes

yuqi-zhang reviewed Feb 27, 2025

View reviewed changes

wking reviewed Feb 28, 2025

View reviewed changes

enhancements/machine-config/manage-boot-images.md Outdated Show resolved Hide resolved

wking reviewed Feb 28, 2025

View reviewed changes

enhancements/machine-config/manage-boot-images.md Outdated Show resolved Hide resolved

djoshy force-pushed the manage-boot-images-skew branch from 1885750 to 189212c Compare March 4, 2025 15:52

djoshy force-pushed the manage-boot-images-skew branch from 189212c to 34024e9 Compare March 4, 2025 16:19

updated timeline and API

c2ef0ca

djoshy force-pushed the manage-boot-images-skew branch from 34024e9 to c2ef0ca Compare March 4, 2025 16:35

JoelSpeed reviewed Mar 7, 2025

View reviewed changes

updated motivation, ownerref changes, status API

54d1b56

djoshy force-pushed the manage-boot-images-skew branch from fd8861c to 54d1b56 Compare March 11, 2025 20:35

jlebon reviewed Mar 21, 2025

View reviewed changes

drafted skew enforcement API, updated timeline

ddb8885

djoshy force-pushed the manage-boot-images-skew branch from d839abb to ddb8885 Compare April 10, 2025 20:35

djoshy commented Apr 10, 2025

View reviewed changes

dustymabe mentioned this pull request Apr 21, 2025

Turning on composefs breaks pivot from old bootimages openshift/os#1678

Open

timeline updatge

95759fc

jlebon reviewed Apr 24, 2025

View reviewed changes

yuqi-zhang reviewed Apr 24, 2025

View reviewed changes

JoelSpeed reviewed Apr 25, 2025

View reviewed changes

updated skew API, rootCA & reactive methods clarification

7d37a63

JoelSpeed reviewed May 30, 2025

View reviewed changes

jonasbartho mentioned this pull request Jun 2, 2025

4.18.0-okd-scos.10 -> 4.19.0-okd-scos.1 upgrade goes smooth but not able to create new machines afterwards okd-project/okd#2205

Closed

jlebon reviewed Jun 3, 2025

View reviewed changes

enhancements/machine-config/manage-boot-images.md Outdated Show resolved Hide resolved

jlebon reviewed Jun 3, 2025

View reviewed changes

enhancements/machine-config/manage-boot-images.md Outdated Show resolved Hide resolved

minor fixes

783cfaf

jlebon approved these changes Jun 3, 2025

View reviewed changes

minor fixes and clarify non MCS scenario

d4507d0

djoshy force-pushed the manage-boot-images-skew branch from 7f3e363 to d4507d0 Compare June 3, 2025 20:31

clarify skew limit & install boot image version

d504628

JoelSpeed reviewed Jun 5, 2025

View reviewed changes

djoshy mentioned this pull request Jun 5, 2025

[WIP] MCO-1669: add BootImageSkewEnforcement API openshift/api#2357

Open

yuqi-zhang approved these changes Jun 5, 2025

View reviewed changes

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 5, 2025

jlebon mentioned this pull request Jun 13, 2025

Add bootloader-migrate-generator to create bootloader-migrate.service coreos/rhel-coreos-config#23

Open

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 16, 2025

openshift-ci bot assigned JoelSpeed Jun 16, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 16, 2025

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 18, 2025

openshift-merge-bot bot merged commit 26ce3cd into openshift:master Jun 19, 2025
2 checks passed

		2. Add a service to be shipped via RHCOS/MCO templates, which will do a check on incoming OS container image vs currently booted RHCOS version. This runs on firstboot right after the MCD pulls the new image, and will prevent the node to rebase to the updated image if the drift is too far.

		RHEL major versions will no longer be cross-compatible. i.e. if you wish to have a RHEL10 machineconfigpool, you must use a RHEL10 bootimage.


		The cluster admin may also choose to opt-out of skew management via this configmap, which indicates that they will not require scaling nodes, and thereby opting out of skew enforcement and scaling functionality.

		A potential problem here is that the way boot images are stored in the machineset is lossy. In certain platforms, there is no way to recover the boot image metadata from the MachineSet. This is most likely to happen the first time the MCO attempts to do skew enforcement on a cluster that has never had boot image updates. In such cases, the MCO will default to the install time boot image, which can be recovered from the [aleph version](https://github.com/coreos/coreos-assembler/pull/768) of the control plane nodes.


		### Enforcement of bootimage skew

		There should be some mechanism that will alert the user when a cluster's bootimage are out of date. To allow for this, the release payload will gain a new field, which will store the OCP version of the minimum acceptable boot image for that release.

	1. A `MachineSet`'s providerSpec field. This is where a MachineSet's boot image references reference are stored.
	1. A `MachineSet`'s providerSpec field. This is where a MachineSet's boot image references are stored.


		Generally speaking, we would like to keep the minimum required bootimage version aligned to the RHEL version we are shipping in the payload and constant for a given release's z stream. For example, a 9.6 bootimage will be allowed until 9.8 is shipped via RHCOS. We would like to keep this customizable, such that any major breaking changes outside of RHEL major/minor can still be enforced as a one-off.

		#### Enforcement options

MCO-1504: Update bootimage management enhancement #1761

MCO-1504: Update bootimage management enhancement #1761

Uh oh!

Conversation

djoshy commented Feb 26, 2025

Uh oh!

openshift-ci-robot commented Feb 26, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

djoshy commented Mar 4, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

djoshy commented Mar 11, 2025

Uh oh!

jlebon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

openshift-ci-robot commented Feb 26, 2025 •

edited by openshift-ci bot

Loading

petr-muller Apr 25, 2025 •

edited

Loading