-
Notifications
You must be signed in to change notification settings - Fork 1.8k
TELCODOCS#2230: Coordinating reboots for configuration changes #91723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested conditionals for trial as per #91723 (comment)
00e1010
to
d439288
Compare
d439288
to
735afd2
Compare
@@ -15,6 +15,8 @@ include::modules/accessing-an-example-cluster-node-tuning-operator-specification | |||
|
|||
include::modules/cluster-node-tuning-operator-default-profiles-set.adoc[leveloffset=+1] | |||
|
|||
include::modules/ztp-coordinating-reboots-for-config-changes.adoc[leveloffset=+1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reboot section should come after
defer-application-of-tuning-changes_node-tuning-operator
in this document.
I am not sure if it needs to be a section, It can be a link to original section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! I added a link to the original section in this module.
735afd2
to
5112fd9
Compare
You can use {cgu-operator-full} to coordinate reboots across a fleet of spoke clusters when configuration changes require a reboot. These configuration changes include updates to tuning profiles that modify kernel parameters or system behavior.{cgu-operator} ensures that only nodes with a degraded tuned profile, indicating a reboot is needed, are rebooted. Instead of rebooting nodes after each individual change, you can apply all configuration updates through policies and then trigger a single, coordinated reboot. | ||
|
||
For more information about coordinated reboots, see xref:../../edge_computing/policygenerator_for_ztp/ztp-configuring-managed-clusters-policygenerator.adoc#ztp-coordinating-reboots-for-config-changes_ztp-configuring-managed-clusters-policygenerator[Coordinating reboots for configuration changes] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤖 [error] OpenShiftAsciiDoc.NoXrefInModules: Do not include xrefs in modules, only assemblies.
5112fd9
to
8563776
Compare
@sr1kar99: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
<3> The `machineConfigLabels` field is used to target the `worker-cnf` role. Configure a `MachineConfigPool` resource to ensure the profile is applied only to the correct nodes. | ||
|
||
You can use {cgu-operator-full} to coordinate reboots across a fleet of spoke clusters when configuration changes require a reboot. These configuration changes include updates to tuning profiles that modify kernel parameters or system behavior.{cgu-operator} ensures that only nodes with a degraded tuned profile, indicating a reboot is needed, are rebooted. Instead of rebooting nodes after each individual change, you can apply all configuration updates through policies and then trigger a single, coordinated reboot. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this explanation is needed here.
I was thinking more like:
Note: You can use TALM to perform a controlled reboot across a fleet of cluster to apply a deferred tuning change[link]
The following types of configuration changes typically require a reboot: | ||
|
||
* Updates to tuning profiles that modify kernel parameters or system behavior. | ||
* Kubelet configuration updates. | ||
* Node-level changes delivered through `MachineConfig`, such as `sysctl` settings or system service changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure how accurate this is. Also I don't think this is necessary.
It can be just
You can use {cgu-operator-full} to coordinate reboots across a fleet of spoke clusters when configuration changes require a reboot, such as defered tuning changes.
* Kubelet configuration updates. | ||
* Node-level changes delivered through `MachineConfig`, such as `sysctl` settings or system service changes. | ||
|
||
{cgu-operator} ensures that only nodes with a degraded tuned profile, indicating a reboot is needed, are rebooted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not true. TALM does not check for this. It reboot the nodes in the selcted mcp on the selected clusters.
. Create the configuration policies and policy bindings for the tuning or configuration changes. | ||
|
||
. Create a reboot policy on the hub cluster. This policy checks for degraded tuned profiles that indicate a reboot is needed. | ||
|
||
. Create and apply the `ClusterGroupUpgrade` (CGU) custom resource (CR). In the `spec.managedPolicies` list, include all relevant configuration policies first, followed by the reboot policy and the optional `mcp-validator` policy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The procedure should be mentioned:
- The policy can be generated using PolicyGenerator.
- Select
out/argocd/example/acmpolicygenerator/acm-example-sno-reboot
orout/argocd/example/acmpolicygenerator/acm-example-multinode-reboot
. - change
policyDefaults.placement.labelSelector
to match clusters that you want to reboot. - modify other placeholder values according to your use case.
- in case of rebooting for applying defered tuning change, make sure mcp value matches Tuned.spec.recommended
- follow the procedure (Customizing a managed cluster with PolicyGenerator CRs)[link to section above in the same document]
- After ArgoCD syncs successfully, you can rollout the reboot using TALM by creating a CGU
example:
apiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
name: reboot
namespace: default
spec:
clusterLabelSelectors:
- matchLabels:
# select clusters that you want to reboot
enable: true
managedPolicies:
# you add other configuration policies here, before the reboot policy
- example-reboot
remediationStrategy:
timeout: 300 # timeout across all clusters; you should consider the worst case
maxConcurrency: 10
* Confirm that all nodes in the `MachineConfigPool` have rebooted. | ||
* Verify that the `MachineConfigPool` reaches the `Updated` state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verify successful reboot on a specific node:
oc get mcp master +
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-be5785c3b98eb7a1ec902fef2b81e865 True False False 3 3 3 0 72d
After you apply the CGU custom resource, {cgu-operator} rolls out the configuration policies in order. When all policies are compliant, {cgu-operator} applies the reboot policy. This triggers a reboot of all nodes in the specified `MachineConfigPool`. | ||
|
||
.Verification | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can monitor the rollout on the hub by watching CGU's status. To verify all successful rollout of the reboot:
oc get cgu -A
NAMESPACE NAME AGE STATE DETAILS
default reboot 1d Completed All clusters are compliant with all the managed policies
Version(s):
4.19
Issue:
TELCODOCS-2230
Link to docs preview:
QE review: