Skip to content

TELCODOCS#2230: Coordinating reboots for configuration changes #91723

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sr1kar99
Copy link
Contributor

@sr1kar99 sr1kar99 commented Apr 7, 2025

Version(s):
4.19

Issue:
TELCODOCS-2230

Link to docs preview:

QE review:

  • QE has approved this change.

@openshift-ci openshift-ci bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 7, 2025
Copy link
Contributor

@xenolinux xenolinux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested conditionals for trial as per #91723 (comment)

@sr1kar99 sr1kar99 force-pushed the 2230-controlled-reboots branch 2 times, most recently from 00e1010 to d439288 Compare April 8, 2025 05:08
@sr1kar99 sr1kar99 force-pushed the 2230-controlled-reboots branch from d439288 to 735afd2 Compare April 8, 2025 12:13
@@ -15,6 +15,8 @@ include::modules/accessing-an-example-cluster-node-tuning-operator-specification

include::modules/cluster-node-tuning-operator-default-profiles-set.adoc[leveloffset=+1]

include::modules/ztp-coordinating-reboots-for-config-changes.adoc[leveloffset=+1]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reboot section should come after
defer-application-of-tuning-changes_node-tuning-operator
in this document.
I am not sure if it needs to be a section, It can be a link to original section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! I added a link to the original section in this module.

@sr1kar99 sr1kar99 force-pushed the 2230-controlled-reboots branch from 735afd2 to 5112fd9 Compare April 9, 2025 07:43
You can use {cgu-operator-full} to coordinate reboots across a fleet of spoke clusters when configuration changes require a reboot. These configuration changes include updates to tuning profiles that modify kernel parameters or system behavior.{cgu-operator} ensures that only nodes with a degraded tuned profile, indicating a reboot is needed, are rebooted. Instead of rebooting nodes after each individual change, you can apply all configuration updates through policies and then trigger a single, coordinated reboot.

For more information about coordinated reboots, see xref:../../edge_computing/policygenerator_for_ztp/ztp-configuring-managed-clusters-policygenerator.adoc#ztp-coordinating-reboots-for-config-changes_ztp-configuring-managed-clusters-policygenerator[Coordinating reboots for configuration changes]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 [error] OpenShiftAsciiDoc.NoXrefInModules: Do not include xrefs in modules, only assemblies.

@sr1kar99 sr1kar99 force-pushed the 2230-controlled-reboots branch from 5112fd9 to 8563776 Compare April 9, 2025 08:21
Copy link

openshift-ci bot commented Apr 9, 2025

@sr1kar99: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

<3> The `machineConfigLabels` field is used to target the `worker-cnf` role. Configure a `MachineConfigPool` resource to ensure the profile is applied only to the correct nodes.

You can use {cgu-operator-full} to coordinate reboots across a fleet of spoke clusters when configuration changes require a reboot. These configuration changes include updates to tuning profiles that modify kernel parameters or system behavior.{cgu-operator} ensures that only nodes with a degraded tuned profile, indicating a reboot is needed, are rebooted. Instead of rebooting nodes after each individual change, you can apply all configuration updates through policies and then trigger a single, coordinated reboot.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this explanation is needed here.
I was thinking more like:

Note: You can use TALM to perform a controlled reboot across a fleet of cluster to apply a deferred tuning change[link]

Comment on lines +11 to +15
The following types of configuration changes typically require a reboot:

* Updates to tuning profiles that modify kernel parameters or system behavior.
* Kubelet configuration updates.
* Node-level changes delivered through `MachineConfig`, such as `sysctl` settings or system service changes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how accurate this is. Also I don't think this is necessary.
It can be just
You can use {cgu-operator-full} to coordinate reboots across a fleet of spoke clusters when configuration changes require a reboot, such as defered tuning changes.

* Kubelet configuration updates.
* Node-level changes delivered through `MachineConfig`, such as `sysctl` settings or system service changes.

{cgu-operator} ensures that only nodes with a degraded tuned profile, indicating a reboot is needed, are rebooted.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not true. TALM does not check for this. It reboot the nodes in the selcted mcp on the selected clusters.

Comment on lines +29 to +33
. Create the configuration policies and policy bindings for the tuning or configuration changes.

. Create a reboot policy on the hub cluster. This policy checks for degraded tuned profiles that indicate a reboot is needed.

. Create and apply the `ClusterGroupUpgrade` (CGU) custom resource (CR). In the `spec.managedPolicies` list, include all relevant configuration policies first, followed by the reboot policy and the optional `mcp-validator` policy.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The procedure should be mentioned:

  • The policy can be generated using PolicyGenerator.
  • Select out/argocd/example/acmpolicygenerator/acm-example-sno-reboot or out/argocd/example/acmpolicygenerator/acm-example-multinode-reboot.
  • change policyDefaults.placement.labelSelector to match clusters that you want to reboot.
  • modify other placeholder values according to your use case.
  • in case of rebooting for applying defered tuning change, make sure mcp value matches Tuned.spec.recommended
  • follow the procedure (Customizing a managed cluster with PolicyGenerator CRs)[link to section above in the same document]
  • After ArgoCD syncs successfully, you can rollout the reboot using TALM by creating a CGU
    example:
apiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
  name: reboot
  namespace: default
spec:
  clusterLabelSelectors:
  - matchLabels:
      # select clusters that you want to reboot
  enable: true
  managedPolicies:
  # you add other configuration policies here, before the reboot policy
  - example-reboot
  remediationStrategy:
    timeout: 300 # timeout across all clusters; you should consider the worst case
    maxConcurrency: 10

Comment on lines +39 to +40
* Confirm that all nodes in the `MachineConfigPool` have rebooted.
* Verify that the `MachineConfigPool` reaches the `Updated` state.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verify successful reboot on a specific node:

oc get mcp master                                                                                                      +
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-be5785c3b98eb7a1ec902fef2b81e865   True      False      False      3              3                   3                     0                      72d

After you apply the CGU custom resource, {cgu-operator} rolls out the configuration policies in order. When all policies are compliant, {cgu-operator} applies the reboot policy. This triggers a reboot of all nodes in the specified `MachineConfigPool`.

.Verification

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can monitor the rollout on the hub by watching CGU's status. To verify all successful rollout of the reboot:

oc get cgu -A                                                                                                         
NAMESPACE   NAME     AGE   STATE       DETAILS
default     reboot   1d    Completed   All clusters are compliant with all the managed policies

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants