Skip to content

Conversation

@gpei
Copy link
Contributor

@gpei gpei commented May 29, 2025

In the previous CPOU upgrade workflow, after unpausing the worker MachineConfigPool (MCP), the process skipped waiting for worker MCP rolling updates if RHEL nodes were detected. This caused RHEL nodes to retain the original node MachineConfig after RPM package upgrades, resulting in the new kubelet version being configured with outdated launch parameters. Consequently, kubelet fails to start.

Example failure: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/pr-logs/pull/openshift_release/65410/rehearse-65410-periodic-ci-openshift-openshift-tests-private-release-4.14-amd64-nightly-4.14-cpou-upgrade-from-4.12-azure-ipi-workers-rhel8-f28/1927589971233869824

When updating the RHEL worker from 4.12 to 4.14, the kubelet service failed to be started as following:

May 29 21:18:30 ci-op-wx29c5np-9c8a2-jj2qq-rhel-1 systemd[1]: Starting Kubernetes Kubelet...
May 29 21:18:30 ci-op-wx29c5np-9c8a2-jj2qq-rhel-1 kubenswrapper[20864]: E0529 21:18:30.754260   20864 run.go:74] "command failed" err="failed to parse kubelet flag: unknown flag: --container-runtime"
May 29 21:18:30 ci-op-wx29c5np-9c8a2-jj2qq-rhel-1 systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
May 29 21:18:30 ci-op-wx29c5np-9c8a2-jj2qq-rhel-1 systemd[1]: kubelet.service: Failed with result 'exit-code'.
May 29 21:18:30 ci-op-wx29c5np-9c8a2-jj2qq-rhel-1 systemd[1]: Failed to start Kubernetes Kubelet.

The kubelet service start parameters on the problem RHEL node, coming from 4.12 node MachineConfig.

ExecStart=/usr/local/bin/kubenswrapper \
    /usr/bin/kubelet \
      --config=/etc/kubernetes/kubelet.conf \
      --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig \
      --kubeconfig=/var/lib/kubelet/kubeconfig \
      --container-runtime=remote \
      --container-runtime-endpoint=/var/run/crio/crio.sock \
...

The kubelet service log also told the flag --container-runtime is a deprecated one and would be removed in k8s 1.27(OCP-4.14)

May 29 13:20:17 ci-op-wx29c5np-9c8a2-jj2qq-rhel-1 kubenswrapper[4656]: Flag --container-runtime has been deprecated, will be removed in 1.27 as the only valid value is 'remote'

In fresh installed 4.13 and 4.14 clusters, the kubelet service was configured without this Flag.

ExecStart=/usr/local/bin/kubenswrapper \
    /usr/bin/kubelet \
      --config=/etc/kubernetes/kubelet.conf \
      --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig \
      --kubeconfig=/var/lib/kubelet/kubeconfig \
      --container-runtime-endpoint=/var/run/crio/crio.sock \
      --runtime-cgroups=/system.slice/crio.service \

So we need to run the ansible playbook to update RHEL nodes only after the new MachineConfig is fully deployed to all nodes, to ensure configuration-version consistency.

P.S. For scenarios requiring immediate RHEL node upgrades without waiting for MCP updates (e.g., specific version upgrade test jobs to workaround MCO bugs), we still could use the existing UPGRADE_RHEL_WORKER_BEFOREHAND environment variable to bypass this workflow for potential edge cases in the future.

P.P.S. Added a workaround during 4.14 to 4.16 cluster upgrade to install the required cloud provider packages on the RHEL nodes, so that the unpausing worker mcp step could be executed successfully.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 29, 2025
@openshift-ci openshift-ci bot requested review from jhou1 and jiajliu May 29, 2025 11:23
@openshift-ci-robot openshift-ci-robot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label May 29, 2025
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 29, 2025
@openshift-ci-robot openshift-ci-robot removed the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label May 29, 2025
@gpei
Copy link
Contributor Author

gpei commented May 29, 2025

/pj-rehearse periodic-ci-openshift-openshift-tests-private-release-4.14-amd64-nightly-4.14-cpou-upgrade-from-4.12-azure-ipi-workers-rhel8-f28

@openshift-ci-robot
Copy link
Contributor

@gpei: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@gpei
Copy link
Contributor Author

gpei commented May 29, 2025

/uncc jhou1
/uncc jiajliu

@openshift-ci openshift-ci bot removed request for jhou1 and jiajliu May 29, 2025 13:53
@gpei
Copy link
Contributor Author

gpei commented May 30, 2025

/pj-rehearse periodic-ci-openshift-openshift-tests-private-release-4.14-amd64-nightly-4.14-cpou-upgrade-from-4.12-azure-ipi-workers-rhel8-f28

@openshift-ci-robot
Copy link
Contributor

@gpei: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@gpei gpei force-pushed the debug_0529 branch 2 times, most recently from 834374f to d344941 Compare May 31, 2025 00:32
@gpei
Copy link
Contributor Author

gpei commented May 31, 2025

/pj-rehearse periodic-ci-openshift-openshift-tests-private-release-4.14-amd64-nightly-4.14-cpou-upgrade-from-4.12-azure-ipi-workers-rhel8-f28

@openshift-ci-robot
Copy link
Contributor

@gpei: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@gpei gpei changed the title debug RHEL upgrade cases Wait for worker MCP to be updated before running RHEL worker upgrade May 31, 2025
@gpei
Copy link
Contributor Author

gpei commented May 31, 2025

@jiajliu @jianlinliu could you please help to review the change, it's for addressing the issue found in #65410 (comment)

@jiajliu
Copy link
Contributor

jiajliu commented Jun 3, 2025

/lgtm

@openshift-ci openshift-ci bot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jun 3, 2025
@jianlinliu
Copy link
Contributor

jianlinliu commented Jun 3, 2025

For scenarios requiring immediate RHEL node upgrades without waiting for MCP updates (e.g., specific version upgrade test jobs to workaround MCO bugs), we still could use the existing UPGRADE_RHEL_WORKER_BEFOREHAND environment variable to bypass this workflow for potential edge cases in the future.

From openshift-upgrade-qe-test-cpou chain:

  - ref: cucushift-upgrade-cpou-pause-worker-mcp
  - ref: cucushift-chainupgrade-toimage
  - ref: cucushift-upgrade-cpou-unpause-worker-mcp
  - ref: cucushift-upgrade-rhel-worker

Before https://github.com/openshift/release/pull/65410/files#diff-440cd1be9ac67cd04d044f301ea04d31142584464112b57a6273baf49d6920b3L100, sounds like it need to skip check_mcp on purpose. while in this PR, it has to run check_mcp. Per the above steps in the cpou chain, once that, I guess the process would fail on check_mcp in cucushift-upgrade-cpou-unpause-worker-mcp, it would never come to cucushift-upgrade-rhel-worker.

@gpei do you mind to add azure-ipi-workers-rhel8-f28 to openshift-openshift-tests-private-release-4.16__amd64-nightly-4.16-cpou-upgrade-from-4.14.yaml and openshift-openshift-tests-private-release-4.18__amd64-nightly-4.18-cpou-upgrade-from-4.16.yaml for double confirm ?

@openshift-ci openshift-ci bot removed lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jun 3, 2025
@gpei
Copy link
Contributor Author

gpei commented Jun 3, 2025

/pj-rehearse periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-4.16-cpou-upgrade-from-4.14-azure-ipi-workers-rhel8-f28

@openshift-ci-robot
Copy link
Contributor

@gpei: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@gpei
Copy link
Contributor Author

gpei commented Jun 3, 2025

/pj-rehearse periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-4.16-cpou-upgrade-from-4.14-azure-ipi-workers-rhel8-f28 periodic-ci-openshift-openshift-tests-private-release-4.14-amd64-nightly-4.14-cpou-upgrade-from-4.12-azure-ipi-workers-rhel8-f28

@gpei
Copy link
Contributor Author

gpei commented Jun 5, 2025

/pj-rehearse periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-4.16-cpou-upgrade-from-4.14-azure-ipi-workers-rhel8-f28

@openshift-ci-robot
Copy link
Contributor

@gpei: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@gpei
Copy link
Contributor Author

gpei commented Jun 5, 2025

/pj-rehearse periodic-ci-openshift-openshift-tests-private-release-4.18-amd64-nightly-4.18-cpou-upgrade-from-4.16-azure-ipi-workers-rhel8-f28

@openshift-ci-robot
Copy link
Contributor

@gpei: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 5, 2025

@gpei: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/rehearse/periodic-ci-openshift-openshift-tests-private-release-4.14-amd64-nightly-4.14-cpou-upgrade-from-4.12-azure-ipi-workers-rhel8-f28 0ea154b link unknown /pj-rehearse periodic-ci-openshift-openshift-tests-private-release-4.14-amd64-nightly-4.14-cpou-upgrade-from-4.12-azure-ipi-workers-rhel8-f28

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@gpei
Copy link
Contributor Author

gpei commented Jun 5, 2025

/pj-rehearse periodic-ci-openshift-openshift-tests-private-release-4.18-amd64-nightly-4.18-cpou-upgrade-from-4.16-azure-ipi-workers-rhel8-f28

@openshift-ci-robot
Copy link
Contributor

@gpei: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 6, 2025
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 6, 2025
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 6, 2025
@gpei
Copy link
Contributor Author

gpei commented Jun 6, 2025

@openshift-ci-robot
Copy link
Contributor

[REHEARSALNOTIFIER]
@gpei: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name Repo Type Reason
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.16-nightly-x86-cpou-loaded-upgrade-from-4.14-loaded-upgrade-414to416-24nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.16-nightly-x86-cpou-loaded-upgrade-from-4.14-loaded-upgrade-414to416-120nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.16-nightly-x86-cpou-loaded-upgrade-from-4.14-loaded-upgrade-ipsec-414to416-120nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.18-nightly-x86-cpou-loaded-upgrade-from-4.16-loaded-upgrade-416to418-24nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.18-nightly-x86-cpou-loaded-upgrade-from-4.16-loaded-upgrade-416to418-120nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.18-nightly-x86-cpou-loaded-upgrade-from-4.16-loaded-upgrade-ipsec-416to418-120nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-metal-4.18-nightly-multi-cpou-loaded-upgrade-from-4.16-loaded-upgrade-multi-416to418-3nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.14-nightly-x86-cpou-loaded-upgrade-from-4.12-loaded-upgrade-412to414-24nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.14-nightly-x86-cpou-loaded-upgrade-from-4.12-loaded-upgrade-412to414-120nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.14-nightly-x86-cpou-loaded-upgrade-from-4.12-loaded-upgrade-ipsec-412to414-120nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.19-nightly-x86-cpou-loaded-upgrade-from-4.17-loaded-upgrade-417to419-24nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.19-nightly-x86-cpou-loaded-upgrade-from-4.17-loaded-upgrade-417to419-120nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.19-nightly-x86-cpou-loaded-upgrade-from-4.17-loaded-upgrade-ipsec-417to419-120nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-4.18-nightly-x86-cpou-upgrade-4.16-loaded-upgrade-416to418-aws redhat-chaos/prow-scripts presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-4.18-nightly-x86-cpou-upgrade-4.16-loaded-upgrade-416to418-gcp redhat-chaos/prow-scripts presubmit Registry content changed
periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.19-nightly-x86-cpou-loaded-upgrade-from-4.17-loaded-upgrade-ipsec-417to419-24nodes N/A periodic Registry content changed
periodic-ci-openshift-openshift-tests-private-release-4.16-multi-nightly-4.16-cpou-upgrade-from-4.14-azure-ipi-arm-mixarch-f28 N/A periodic Registry content changed
periodic-ci-openshift-openshift-tests-private-release-4.18-amd64-rollback-nightly-ibmcloud-ipi-f28 N/A periodic Registry content changed
periodic-ci-openshift-openshift-tests-private-release-4.18-amd64-nightly-4.18-upgrade-from-stable-4.17-baremetalds-ipi-ovn-ipv4-fips-canary-f28 N/A periodic Registry content changed
periodic-ci-openshift-openshift-tests-private-release-4.17-amd64-rollback-nightly-aws-ipi-byo-route53-f28 N/A periodic Registry content changed
periodic-ci-openshift-openshift-tests-private-release-4.19-amd64-rollback-nightly-ibmcloud-ipi-f28 N/A periodic Registry content changed
periodic-ci-openshift-openshift-tests-private-release-4.14-amd64-nightly-4.14-cpou-upgrade-from-4.12-ibmcloud-ipi-private-fips-f28 N/A periodic Registry content changed
periodic-ci-openshift-openshift-tests-private-release-4.19-multi-nightly-4.19-upgrade-from-stable-4.15-azure-ipi-ovn-ipsec-arm-mixarch-f28 N/A periodic Registry content changed
periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-rollback-nightly-aws-ipi-byo-route53-f28 N/A periodic Registry content changed
periodic-ci-openshift-openshift-tests-private-release-4.19-amd64-nightly-4.19-upgrade-from-stable-4.18-ibmcloud-ipi-proxy-private-rt-canary-f28 N/A periodic Registry content changed

A total of 161 jobs have been affected by this change. The above listing is non-exhaustive and limited to 25 jobs.

A full list of affected jobs can be found here

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@jiajliu
Copy link
Contributor

jiajliu commented Jun 6, 2025

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 6, 2025
@jianlinliu
Copy link
Contributor

/lgtm

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 6, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gpei, jiajliu, jianlinliu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gpei
Copy link
Contributor Author

gpei commented Jun 6, 2025

/pj-rehearse ack

@openshift-ci-robot
Copy link
Contributor

@gpei: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-ci-robot openshift-ci-robot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Jun 6, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit f78bc50 into openshift:master Jun 6, 2025
10 checks passed
mehabhalodiya pushed a commit to mehabhalodiya/release that referenced this pull request Jun 12, 2025
…penshift#65478)

* Wait for worker MCP to be updated before running RHEL worker upgrade

* remove useless lines
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. rehearsals-ack Signifies that rehearsal jobs have been acknowledged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants