Skip to content

OCPNODE-3029: WIP: handle required minimum kubelet version featuregate rollout #4929

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

haircommander
Copy link
Member

- What I did
When a cluster admin rolls out a minimum kubelet verison, the cluster-config-operator (eventually) will render
a new set of featuregates based on the new minimum version (enabling ones that are safe given the new minimum).

However, on a rollback, we need to make sure the MCS won't return a rendered config using the old features, in the very
odd case a cluster admin creates a newly old node and somehow overrides the osImage (unlikely). Thus, we use the similar condition
as MCS to serve the config--once one node has updated, we treat the MCP as using the new config. Then, we write the status to the
node config object so the node authorization plugin can allow older nodes that are now deemed safe.

- How to verify it

- Description for the changelog

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 19, 2025
@openshift-ci openshift-ci bot requested review from umohnani8 and yuqi-zhang March 19, 2025 20:43
Copy link
Contributor

openshift-ci bot commented Mar 19, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: haircommander
Once this PR has been reviewed and has the lgtm label, please assign djoshy for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@haircommander
Copy link
Member Author

some background openshift/enhancements#1766

@haircommander haircommander changed the title WIP: handle required minimum kubelet version featuregate rollout OCPNODE-3029: WIP: handle required minimum kubelet version featuregate rollout Mar 19, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 19, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 19, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Mar 19, 2025

@haircommander: This pull request references OCPNODE-3029 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.19.0" version, but no target version was set.

In response to this:

- What I did
When a cluster admin rolls out a minimum kubelet verison, the cluster-config-operator (eventually) will render
a new set of featuregates based on the new minimum version (enabling ones that are safe given the new minimum).

However, on a rollback, we need to make sure the MCS won't return a rendered config using the old features, in the very
odd case a cluster admin creates a newly old node and somehow overrides the osImage (unlikely). Thus, we use the similar condition
as MCS to serve the config--once one node has updated, we treat the MCP as using the new config. Then, we write the status to the
node config object so the node authorization plugin can allow older nodes that are now deemed safe.

- How to verify it

- Description for the changelog

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 21, 2025
}
go ctrl.writebackMinimumKubeletVersionIfAppropriate(updatedPools, renderedVersions, nodeConfig, func() ([]*mcfgv1.MachineConfigPool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any particular reason this is a separate goroutine?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it could be async in case it takes a while to have the mcps rollback and I didn't know what could be blocked by having it sync

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a better solution might be to push this onto one of the work queues by doing something like this:

func (ctrl *Controller) syncKubeletConfig(key string) error {
	// Key lookup stuff above here.
	// Here, we detect that we need to do this for the current kubeletconfig, so we just kick that off.
	if ctrl.writeMinimumKubeletVersion[kubeletCfg.Name] {
		defer func() {
			delete(ctrl.writeMinimumKubeletVersion, kubeletCfg.Name)
		}()
		return ctrl.writebackMinimumKubeletVersionIfAppropriate(...)
	}

	//
	// Bulk of function above here.
	//

	if ctrl.isMinimumKubeletVersionWritebackNeeded(...) {
		// Here, we update our internal state to indicate that we need to perform this action
		// and enqueue the action.
		ctrl.writeMinimumKubeletVersion[kubeletcfg.Name] = true
		return ctrl.enqueue(kubeletCfg)
	}

	// End of function
}

We'll probably want to use a sync.Map since the work queue has multiple workers that could mutate the controller struct state at any given time.

if node.Spec.MinimumKubeletVersion == node.Status.MinimumKubeletVersion ||
node.Status.MinimumKubeletVersion == renderedKubeletVersion ||
node.Spec.MinimumKubeletVersion != renderedKubeletVersion {
klog.InfoS("Skipping writeback to nodes.config.Status.MinimumKubeletVersion because situation not correct",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: why are these conditions not correct? isn't this first one just saying that the spec and status match what we expect?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the status matches the spec, no need to do the update (though I probably should actually check renderedKubeletVersion there too)
for the second, if the status already matches the rendered then we've already done the write back for this rendered version
for the third: we're rendering a different version than what is set in the spec..
I think I have to rework this condition though

for _, mcp := range mcps {
if mcp.Status.UpdatedMachineCount > 0 {
poolsUpdated[mcp.Name] = true
} else if _, updated := poolsUpdated[mcp.Name]; !updated {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the logic here. The if condition checks if at least 1 node is updated, and if not, whether a previous loop had updated it? Wouldn't it be easier to check if the mcp itself is updated directly, and then break out of the loop if any pool doesn't meet either condition?

Note that we should be careful of pools with no nodes in them, which I think would be covered if we copy the server check

@haircommander haircommander force-pushed the required-minimum-kubelet-version branch from bdece5e to b6c210d Compare March 26, 2025 15:51
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 26, 2025
@haircommander haircommander force-pushed the required-minimum-kubelet-version branch from b6c210d to c0c5483 Compare March 26, 2025 19:51
}
go ctrl.writebackMinimumKubeletVersionIfAppropriate(updatedPools, renderedVersions, nodeConfig, func() ([]*mcfgv1.MachineConfigPool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a better solution might be to push this onto one of the work queues by doing something like this:

func (ctrl *Controller) syncKubeletConfig(key string) error {
	// Key lookup stuff above here.
	// Here, we detect that we need to do this for the current kubeletconfig, so we just kick that off.
	if ctrl.writeMinimumKubeletVersion[kubeletCfg.Name] {
		defer func() {
			delete(ctrl.writeMinimumKubeletVersion, kubeletCfg.Name)
		}()
		return ctrl.writebackMinimumKubeletVersionIfAppropriate(...)
	}

	//
	// Bulk of function above here.
	//

	if ctrl.isMinimumKubeletVersionWritebackNeeded(...) {
		// Here, we update our internal state to indicate that we need to perform this action
		// and enqueue the action.
		ctrl.writeMinimumKubeletVersion[kubeletcfg.Name] = true
		return ctrl.enqueue(kubeletCfg)
	}

	// End of function
}

We'll probably want to use a sync.Map since the work queue has multiple workers that could mutate the controller struct state at any given time.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 1, 2025
@haircommander haircommander force-pushed the required-minimum-kubelet-version branch from c0c5483 to 9d06551 Compare April 10, 2025 18:47
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 10, 2025
@haircommander haircommander force-pushed the required-minimum-kubelet-version branch 2 times, most recently from 5c626cc to c1f6f9f Compare April 10, 2025 20:13
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 11, 2025
@haircommander haircommander force-pushed the required-minimum-kubelet-version branch from c1f6f9f to d827cc8 Compare April 11, 2025 17:31
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 11, 2025
@haircommander haircommander force-pushed the required-minimum-kubelet-version branch from d827cc8 to 9e14edf Compare April 11, 2025 20:27
When a cluster admin rolls out a minimum kubelet verison, the cluster-config-operator (eventually) will render
a new set of featuregates based on the new minimum version (enabling ones that are safe given the new minimum).

However, on a rollback, we need to make sure the MCS won't return a rendered config using the old features, in the very
odd case a cluster admin creates a newly old node and somehow overrides the osImage (unlikely). Thus, we use the similar condition
as MCS to serve the config--once one node has updated, we treat the MCP as using the new config. Then, we write the status to the
node config object so the node authorization plugin can allow older nodes that are now deemed safe.

Signed-off-by: Peter Hunt <[email protected]>
@haircommander haircommander force-pushed the required-minimum-kubelet-version branch from 9e14edf to 1a7271d Compare April 15, 2025 14:40
Copy link
Contributor

openshift-ci bot commented Apr 15, 2025

@haircommander: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/images 1a7271d link true /test images
ci/prow/okd-scos-e2e-aws-ovn 1a7271d link false /test okd-scos-e2e-aws-ovn
ci/prow/unit 1a7271d link true /test unit
ci/prow/bootstrap-unit 1a7271d link false /test bootstrap-unit
ci/prow/e2e-azure-ovn-upgrade-out-of-change 1a7271d link false /test e2e-azure-ovn-upgrade-out-of-change
ci/prow/e2e-gcp-op 1a7271d link true /test e2e-gcp-op
ci/prow/e2e-hypershift 1a7271d link true /test e2e-hypershift
ci/prow/e2e-aws-ovn 1a7271d link true /test e2e-aws-ovn
ci/prow/verify 1a7271d link true /test verify
ci/prow/e2e-aws-ovn-upgrade-out-of-change 1a7271d link false /test e2e-aws-ovn-upgrade-out-of-change
ci/prow/e2e-aws-ovn-upgrade 1a7271d link true /test e2e-aws-ovn-upgrade
ci/prow/e2e-agent-compact-ipv4 1a7271d link false /test e2e-agent-compact-ipv4
ci/prow/e2e-gcp-op-single-node 1a7271d link true /test e2e-gcp-op-single-node
ci/prow/e2e-gcp-op-techpreview 1a7271d link false /test e2e-gcp-op-techpreview
ci/prow/e2e-gcp-op-ocl 1a7271d link false /test e2e-gcp-op-ocl

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants