-
Notifications
You must be signed in to change notification settings - Fork 2k
Wait for worker MCP to be updated before running RHEL worker upgrade #65478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/pj-rehearse periodic-ci-openshift-openshift-tests-private-release-4.14-amd64-nightly-4.14-cpou-upgrade-from-4.12-azure-ipi-workers-rhel8-f28 |
|
@gpei: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/uncc jhou1 |
|
/pj-rehearse periodic-ci-openshift-openshift-tests-private-release-4.14-amd64-nightly-4.14-cpou-upgrade-from-4.12-azure-ipi-workers-rhel8-f28 |
|
@gpei: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
834374f to
d344941
Compare
|
/pj-rehearse periodic-ci-openshift-openshift-tests-private-release-4.14-amd64-nightly-4.14-cpou-upgrade-from-4.12-azure-ipi-workers-rhel8-f28 |
|
@gpei: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
@jiajliu @jianlinliu could you please help to review the change, it's for addressing the issue found in #65410 (comment) |
|
/lgtm |
From Before https://github.com/openshift/release/pull/65410/files#diff-440cd1be9ac67cd04d044f301ea04d31142584464112b57a6273baf49d6920b3L100, sounds like it need to skip @gpei do you mind to add |
|
/pj-rehearse periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-4.16-cpou-upgrade-from-4.14-azure-ipi-workers-rhel8-f28 |
|
@gpei: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-4.16-cpou-upgrade-from-4.14-azure-ipi-workers-rhel8-f28 periodic-ci-openshift-openshift-tests-private-release-4.14-amd64-nightly-4.14-cpou-upgrade-from-4.12-azure-ipi-workers-rhel8-f28 |
|
/pj-rehearse periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-4.16-cpou-upgrade-from-4.14-azure-ipi-workers-rhel8-f28 |
|
@gpei: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
...ushift/upgrade/cpou/unpause-worker-mcp/cucushift-upgrade-cpou-unpause-worker-mcp-commands.sh
Show resolved
Hide resolved
|
/pj-rehearse periodic-ci-openshift-openshift-tests-private-release-4.18-amd64-nightly-4.18-cpou-upgrade-from-4.16-azure-ipi-workers-rhel8-f28 |
|
@gpei: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
@gpei: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/pj-rehearse periodic-ci-openshift-openshift-tests-private-release-4.18-amd64-nightly-4.18-cpou-upgrade-from-4.16-azure-ipi-workers-rhel8-f28 |
|
@gpei: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
...ushift/upgrade/cpou/unpause-worker-mcp/cucushift-upgrade-cpou-unpause-worker-mcp-commands.sh
Show resolved
Hide resolved
|
[REHEARSALNOTIFIER]
A total of 161 jobs have been affected by this change. The above listing is non-exhaustive and limited to 25 jobs. A full list of affected jobs can be found here Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
|
/lgtm |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gpei, jiajliu, jianlinliu The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/pj-rehearse ack |
|
@gpei: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
…penshift#65478) * Wait for worker MCP to be updated before running RHEL worker upgrade * remove useless lines
In the previous CPOU upgrade workflow, after unpausing the worker MachineConfigPool (MCP), the process skipped waiting for worker MCP rolling updates if RHEL nodes were detected. This caused RHEL nodes to retain the original node MachineConfig after RPM package upgrades, resulting in the new kubelet version being configured with outdated launch parameters. Consequently, kubelet fails to start.
Example failure: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/pr-logs/pull/openshift_release/65410/rehearse-65410-periodic-ci-openshift-openshift-tests-private-release-4.14-amd64-nightly-4.14-cpou-upgrade-from-4.12-azure-ipi-workers-rhel8-f28/1927589971233869824
When updating the RHEL worker from 4.12 to 4.14, the kubelet service failed to be started as following:
The kubelet service start parameters on the problem RHEL node, coming from 4.12 node MachineConfig.
The kubelet service log also told the flag
--container-runtimeis a deprecated one and would be removed in k8s 1.27(OCP-4.14)In fresh installed 4.13 and 4.14 clusters, the kubelet service was configured without this Flag.
So we need to run the ansible playbook to update RHEL nodes only after the new MachineConfig is fully deployed to all nodes, to ensure configuration-version consistency.
P.S. For scenarios requiring immediate RHEL node upgrades without waiting for MCP updates (e.g., specific version upgrade test jobs to workaround MCO bugs), we still could use the existing
UPGRADE_RHEL_WORKER_BEFOREHANDenvironment variable to bypass this workflow for potential edge cases in the future.P.P.S. Added a workaround during 4.14 to 4.16 cluster upgrade to install the required cloud provider packages on the RHEL nodes, so that the unpausing worker mcp step could be executed successfully.