-
Notifications
You must be signed in to change notification settings - Fork 409
docs(hep): harvester upgrade v2 #9161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Zespre Chang <[email protected]>
d736a7f to
734d03b
Compare
8232e46 to
9437748
Compare
- add upgrade & version crds - add test plan Signed-off-by: Zespre Chang <[email protected]>
9437748 to
55e6217
Compare
Signed-off-by: Zespre Chang <[email protected]>
Signed-off-by: Zespre Chang <[email protected]>
|
Hey folks, |
w13915984028
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the big and well writen HEP.
A few concerns for open discussion
(1) The alternative solution of upgrade-repo.
(2) The decoupling with Rancher, some hidden dependencies might only be observed on developing/testing stage.
(3) Given Harvester's 4-month release plan, as said in the HEP, it needs more thoughts about the feature delivery stage by stage.
|
|
||
| #### Stage 1 - Transition Period (version N-1 to N upgrades, where N is the debut minor version of Harvester Upgrade V2) | ||
|
|
||
| The inner workings of Upgrade Manager in this stage will be essentially the same as before; the significant difference is that it is built and packaged separately from the primary Harvester artifact. It will also have its own Deployment in contrast to the main Harvester controller manager Deployment. The main concerns will be: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems to import another requriement:
an upgrade matrix management upon Harvester/UpgradeShim <-> Upgrade Manager, to ensure the versions are matching between them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume it should just follow the guideline described in https://github.com/harvester/docs/pull/898/files#diff-9b99ca129b131cc50e1c584be79715640745a08c667b20f0fa714558b2d3077aR52-R53. i.e.:
- Allowing upgrade from one minor version to the next
- Allowing upgrade to a later patch version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overall upgrade policy for Harvester remains intact. What @w13915984028 raised is valid—since the old upgrade controller is being decoupled into two parts, it inevitably introduces a new compatibility requirement between the Upgrade Shim and Manager. The initial plan is to have it the same old way, i.e., Harvester 1.7.0 works with Upgrade Manager v1.7.0. We can bump the version by adding suffices to the Upgrade Manager version, e.g. v1.7.0-r1, just in case we need to fix anything after GA. The new UpgradePlan API allows the user to specify a different Upgrade Manager image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For upgrade policy to remain intact, won't it mean the shim and the manager versions need to follow the same requirements? E.g., if upgrading to Harvester 1.6.1, shim v1.6.0 should work with manager v1.6.1, shim v1.5.2 should work with manager v1.6.1? Granted, the logic of the shim and manager shouldn't change often between releases. And if possible, we should avoid a separate versioning scheme/matrix for the manager.
|
|
||
| #### Plan 1 - Static Pods with BackingImage Disk Files | ||
|
|
||
| The main idea is to leverage static pods, Longhorn BackingImage disk files, and hostPath volumes. The `minNumberOfCopies` field of the BackingImage for the ISO image is set to the number of nodes of the cluster (subject to change) to distribute the disk files to all nodes. By mounting the BackingImage disk file directly to a specific path on the host filesystem for each node, static pods on each node can access the ISO image content using the hostPath volume. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The minNumberOfCopies field: ... not sure it works effectively on massive cluster with say 100 nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on LH's implementation. Probably not the best choice here. See #9217 for a better way to do this.
| 1. (Optional) Preloading the Upgrade Manager container image | ||
| 1. Deploying Upgrade Manager | ||
|
|
||
| ### Upgrade Repository |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both plan 1 and plan 2 seem not robust enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The entire Upgrade Repository section is just a preliminary proposal and some experiments we've explored so far. Please refer to #9217 for the formal PR. The implementation works in a different way and should be more robust. Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could update the HEP to simplify the reviewing, follow the design on #9217 for open discussion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we discuss the not robust enough on #9217?
| 1. Node drain and RKE2 upgrade, corresponds to the **drain** stage | ||
| 1. OS upgrade, corresponds to the **post-drain** stage | ||
|
|
||
| Since the hook mechanism in Rancher v2prov is no longer usable, Upgrade Manager relies on Plan CRs backed by System Upgrade Controller to execute node-upgrade tasks. One of the hidden advantages of this change is that it allows the operating system upgrade to be separated from the overall upgrade, thus enabling true zero-downtime (for nodes) upgrades. Another one is that it unblocks the way for users to have granular control over the node-upgrade order through node selectors. The new node-upgrade phase looks like the following: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it allows the operating system upgrade to be separated from the overall upgrade
how to, Harvester or user makes the decision?
simply skip when it is not necessary or selectively do it on another upgrade?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say it provides flexibility, but does not necessarily offer a supported API to the user at this time. In the previous architecture, it was not possible to separate the two because they're both part of the Rancher v2prov framework (pre- and post-hooks). We still need to discuss whether or not it's meaningful and valuable for users to upgrade clusters but not node OSes in the context of Harvester. What I can tell is that I've heard people demand the ability to upgrade clusters without node reboots, which implies no OS upgrades are involved.
| The challenge of switching to SUC Plans is that the node-upgrade phase is no longer executed as a control loop, but as a one-time task. Upgrade Manager needs to reconcile Plans instead of machine-plan Secrets, and there will be no other entities, such as the embedded Rancher, which abstracts the lifecycle management of downstream clusters, to organize the upgrade nuances. | ||
|
|
||
| > [!IMPORTANT] | ||
| > The infinite-retry vs. fail-fast will be a key topic for discussion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems no an easy/straightforward decision, we might need to copy those hacks or patches from Rancher to cover various cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I'm not quite sure what you mean. Mind sharing some examples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When looking into Rancher's plan processing code, it is very complex; if we do the related processing on Harvester independently, some or many (edge) scenarios from Rancher plan are also valid for Harvester, we need to summarize those scenarios &/ borrow some implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the reconciliation logic will be offloaded to the System Upgrade Controller since we rely on SUC Plans instead. From Harvester's viewpoint (specifically, Upgrade Manager), it only cares about the status of SUC Plans (and probably the underlying Jobs for node upgrade status updates). The main idea is to decouple from Rancher's v2prov framework (the complex plan-processing code you mentioned above) and avoid reimplementing parts of it.
|
|
||
| Another significant aspect is installation. Do we rely on Rancher System Agent to bootstrap clusters? What are the impacts if we drop it entirely? Do we need to develop a new installation method to fill the gap left by the decommissioning of the Rancher System Agent? | ||
|
|
||
| It appears that we do rely on the Rancher System Agent to bootstrap nodes except for the initial one. If we remove the Rancher System Agent, there might not be issues in day 2 management; however, we cannot live without it when it comes to cluster bootstrapping. If we leave it as is, the Rancher System Agent generates numerous logs containing error messages (because it is no longer able to communicate with the Rancher Manager) and does nothing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because it is no longer able to communicate with the Rancher Manager
The embeded rancher deployment is still inside Harvester, what's the reason that it can't communicate, due to below?
Decouple the cluster itself with the embedded Rancher Manager by removing the kubernetesVersion and rkeConfig fields from the local provisioning cluster CR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, if the kubernetesVersion and rkeConfig fields from the local provisioning cluster CR are removed, it implies the embedded Rancher Manager can no longer manage the cluster it runs on. The Machine CR and machine-plan Secrets are gone. The communication between Rancher System Agent and Rancher Manager is broken.
ihcsim
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll need to give this another pass. Meanwhile, some preliminary feedback and questions - thanks.
| type UpgradePlanSpec struct { | ||
| // version refers to the corresponding version resource in the same namespace. | ||
| // +required | ||
| Version string `json:"version"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe consider a Version type (instead of just string) to allow for versions comparison and syntax validation. See e.g., github.com/hashicorp/go-version. Similarly, for ReleaseMetadata.[Harvester|Kubernetes|Rancher].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment at L177 is misleading. I'll update it. The field is used to refer to the other CRD, Version. The name of a Version CR can be anything, not necessarily a semantic version. Also, we don't rely on this field to validate upgrade eligibility because it's entirely freestyle and can be anything specified by the user, e.g., dev, test, etc., depending on what the corresponding Version CRs are named.
|
|
||
| #### Stage 1 - Transition Period (version N-1 to N upgrades, where N is the debut minor version of Harvester Upgrade V2) | ||
|
|
||
| The inner workings of Upgrade Manager in this stage will be essentially the same as before; the significant difference is that it is built and packaged separately from the primary Harvester artifact. It will also have its own Deployment in contrast to the main Harvester controller manager Deployment. The main concerns will be: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume it should just follow the guideline described in https://github.com/harvester/docs/pull/898/files#diff-9b99ca129b131cc50e1c584be79715640745a08c667b20f0fa714558b2d3077aR52-R53. i.e.:
- Allowing upgrade from one minor version to the next
- Allowing upgrade to a later patch version
| const ( | ||
| NodeStateImagePreloading string = "ImagePreloading" | ||
| NodeStateImagePreloaded string = "ImagePreloaded" | ||
| NodeStateKubernetesUpgrading string = "KubernetesUpgrading" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need another status for the charts?
Or we unify it on KubernetesUpgrading?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chart upgrades are conducted in the ClusterUpgrade phase, so they won't appear here, as this is for reflecting node-specific upgrade states.
|
|
||
| // force indicates the UpgradePlan will be forcibly applied, ignoring any pre-upgrade check failures. Default to "false". | ||
| // +optional | ||
| Force *bool `json:"force,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we really need the field for it (force upgrade, ignore all checks)? or just use the annotation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, if we really want a feature to be used on rare occasions, we put it in the annotation; otherwise, making it available in the API (CRD) is better because
- The information about the feature is more organized at the code level.
- Users can utilize tools like
kubectl explainto understand its functionality.
It really depends. If we want to highlight specific checks that are skippable, annotations are the way to go, because no one wants their CRD to become bloated.
| Rancher -. reconciles .-> ClusterCR | ||
| ``` | ||
|
|
||
| ### Upgrade Shim |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we summarize our all pre-check (no matter on precheck script or webhook currently) on this stage?
We now did not have the centrialize place to do the whole checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the checks are not suitable to be conducted in the webhook, e.g., node-level checks, so we put those into the pre-check script that can be run directly on the node; some are impossible to do then we push over to a later phase, such as upgrade eligibility check (the trustworthy version information is only packed in the ISO image. The name of the Version CR can be anything thus we shouldn't rely on it.)
Do you want to put all the checks in the Upgrade Shim? I'm unsure about the question here.
| 1. (Optional) Preloading the Upgrade Manager container image | ||
| 1. Deploying Upgrade Manager | ||
|
|
||
| ### Upgrade Repository |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we discuss the not robust enough on #9217?
| 1. Node drain and RKE2 upgrade, corresponds to the **drain** stage | ||
| 1. OS upgrade, corresponds to the **post-drain** stage | ||
|
|
||
| Since the hook mechanism in Rancher v2prov is no longer usable, Upgrade Manager relies on Plan CRs backed by System Upgrade Controller to execute node-upgrade tasks. One of the hidden advantages of this change is that it allows the operating system upgrade to be separated from the overall upgrade, thus enabling true zero-downtime (for nodes) upgrades. Another one is that it unblocks the way for users to have granular control over the node-upgrade order through node selectors. The new node-upgrade phase looks like the following: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not really sure about that. If we cannot move to SUC plan, we can only use hook mechanism like current upgrade flow, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's the backup plan if the SUC path is not feasible.
Signed-off-by: Zespre Chang <[email protected]>
Signed-off-by: Zespre Chang <[email protected]>
60ad7c2 to
e191880
Compare
|
Due to Harvester's unique use of Rancher (the local cluster is neither fully imported nor provisioned), it is currently impossible to opt out of the Rancher v2prov framework. This will cause the proposed node-upgrade SUC plans to conflict with the v2prov mechanism. We've pivoted to stay in for the node-upgrade phase in Upgrade V2 after some discussions. I will update the HEP accordingly. |
| 1. Downloading the ISO image | ||
| 1. (?) Checking the eligibility of the upgrade in terms of the to and from versions | ||
| 1. Deploying Upgrade Repository | ||
| 1. Bootstrapping Upgrade Manager | ||
| 1. (Optional) Preloading the Upgrade Manager container image | ||
| 1. Deploying Upgrade Manager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to move all these out of the upgrade shim like e.g., bundle them into an upgrade helm chart? In my mind, I think the upgrade shim should be so thin that it's only responsibility is to kickstart upgrade by e.g., downloading and installing the suggested upgrade helm chart, and maybe aggregating and sending upgrade errors to a source or something.
Helm also makes it easier to upgrade, run hooks and cleanup the upgrade components. E.g., upgrading Harvester from 1.7 to 1.8 can just be initiated by upgrading the upgrade helm chart from 1.7 to 1.8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to move all these out of the upgrade shim like e.g., bundle them into an upgrade helm chart? In my mind, I think the upgrade shim should be so thin that it's only responsibility is to kickstart upgrade by e.g., downloading and installing the suggested upgrade helm chart, and maybe aggregating and sending upgrade errors to a source or something.
Helm also makes it easier to upgrade, run hooks and cleanup the upgrade components. E.g., upgrading Harvester from 1.7 to 1.8 can just be initiated by upgrading the upgrade helm chart from 1.7 to 1.8.
I like the idea, and I would love to do that. The only question is whether we want to break the existing ways users can kick off upgrades.
The main consideration is the air-gapped environment. As before, users only need to prepare one thing, i.e., the new version ISO image, and they can kick off the upgrade. Should we choose to do it, then it would be different. Let's say we want to upgrade a cluster from v1.7.0 to v1.8.0. The v1.8.0 Upgrade Manager chart will be installed by the shim first. But the v1.8.0 Upgrade Manager is bundled in the v1.8.0 ISO image, which hasn't been downloaded yet, and the container image hasn't been preloaded to each node; this will be an issue. Overall, if we choose to let the shim install the chart, we must also distribute the chart and the container image somewhere, separately from the ISO image we used to distribute. Also, users need to host the chart and images themselves, along with the ISO image. This would require them to set up an OCI registry in their environment. Only if we can convince ourselves and users that this does not complicate the upgrade procedure can we do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not convinced that setting up an oci registry poses much of a hurdle to air-gapped users. In exchange, they get the benefits of v2 upgrade. It also becomes easier for users and maintainers to scrutinize, scan, test, and reproduce upgrade-related bugs. If breaking the existing ways yield more benefits and stability, maybe it's worth it. There are always ways to verify with existing users if we are open to this idea.
I guess, fundamentally, I am not sure how we decide the logical boundary for what stays in the shim and what stays in the upgrade manager. In the current proposal, can the steps to "Checking the eligibility of the upgrade in terms of the to and from versions" and "Deploying Upgrade Repository" be moved to the upgrade manager?
| > [!NOTE] | ||
| > Spegel cannot replace the Upgrade Repository entirely because, in the end, container images and other artifacts have to somehow appear on one of the nodes to be distributed among others through Spegel. | ||
|
|
||
| #### Plan 1 - Static Pods with BackingImage Disk Files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something for consideration is whether we can just bypass CSI and CDI altogether. The rationale is to reduce as many variables as possible. On a typical cluster, both CSI and CDI are heavily consumed by user workloads. We can't always count on their availability and stability.
One alternative is to start the upgrade-repo as a DaemonSet where each pod mounts a hostpath volume, and it downloads and writes the ISO directly to the nodes, or start upgrade-repo as a static pod running on a management node and rsync the downloaded ISO to other nodes on the cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was one of the original plans. Setting up and tearing down static pods requires running node-level maintenance jobs. In this case, we usually leverage SUC plans. However, we need to take the following into account:
- Not breaking existing ways to initiate an upgrade
- Image availability to each node
A group of users relies on initiating the upgrade by uploading the ISO image directly from their computer. It would be hard to preserve the entrypoint if we change the implementation to the way you mentioned.
OTOH, storing a copy of the ISO image on every node is straightforward and most efficient, but we cannot assume that every node has enough disk space for it. Moreover, if the ISO image is corrupted or missing, the node still needs a way to access that data from other nodes. This makes the design more complicated. Lastly, the repository needs to survive node reboots, as some post-actions require it during upgrade completion. Though static pods survive reboots, mount points do not.
As a result, we still used the centralized approach: downloading the ISO image to a single location in a single shot, setting up the repository, and consumers get what they want on demand.
|
|
||
| Plan 2 does not mess with the host filesystem and is more robust even under the chaos of an upgrading cluster. The drawback is unnecessary compression/decompression back and forth overhead, resulting in a longer preparation time compared to the previous VM-based implementation. Also, the share manager pod of the RWX volume becomes the SPOF (Single Point of Failure) of the entire upgrade-repo architecture. | ||
|
|
||
| ### Upgrade Manager Preloader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as suggested above, can we package the upgrade manager in a helm chart and let upgrade shim download and install the helm chart?
|
This PR is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. Please review it as soon as possible. cc @harvester/dev |
|
Hi @starbops, Issue: Our Helm upgrade/release flow doesn’t have a clear timeout/escape hatch in some phases. PR #9934 improves the rollback behavior, which helps a lot, but the overall upgrade can still get stuck in certain edge cases (i.e., it neither completes nor cleanly rolls back). Do you think we should open an issue to track adding phase-level timeouts / guardrails (and documenting the recommended manual recovery steps)? Happy to help draft the issue with concrete scenarios/logs if needed. Thanks |
@pohanhuang Thanks for the input. Given the nature of the Harvester Upgrade, it is generally not good practice, and may not even be possible, to roll back every step or bail out the process when it gets stuck. Our current approach is to retry (endlessly), which might sound trivial, but it at least does not push the cluster to a more complicated situation. This can get us through because most steps are idempotent; retrying can overcome transient errors. Cluster stability and user workload availability are the most concerned topics when it comes to upgrade. |
Problem:
Solution:
Related Issue(s):
#7101
#7112
Test plan:
Additional documentation or context