🐛 Perform draining and volume detachment once until completion #11590

Danil-Grigorev · 2024-12-17T17:11:22Z

What this PR does / why we need it:

The draining logic continues to access the cluster even after the hook is removed or completed, causing a deadlock on machine removal in a non KCP backed CP provider implementation.

In a cluster with a kubelet local mode, competed draining and etcd membership removal causes inability to access API server externally, so the operation continues to error out even after success. This does not allow underlying machine to be removed.

The solution makes the PreTeminateHook agnostic to the provider, and ensures that completion of the operation does not lead to further attempts to access the cluster.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #11591

/area machine

k8s-ci-robot · 2024-12-17T17:11:30Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign chrischdi for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Danil-Grigorev · 2024-12-17T17:17:06Z

/area bootstrap

chrischdi · 2024-12-17T19:04:30Z

internal/controllers/machine/machine_controller.go

@@ -657,14 +657,26 @@ func (r *Reconciler) isNodeDrainAllowed(m *clusterv1.Machine) bool {
 		return false
 	}

+	if conditions.IsTrue(m, clusterv1.DrainingSucceededCondition) {


We should not start to rely on this condition to be set as it is because its gonna be deprecated/removed in v1beta2.

Also propably a change in the behavior for all VMs (previously followed reconciliation may also run again through drain).

I unified the check under the PreTerminateDeleteHookSucceededCondition, which allows for both methods to run until completion. It is not listed in the proposal as a part of deprecation/removal AFAICT

It is not listed in the proposal as a part of deprecation/removal AFAICT

@Danil-Grigorev I think that was just not explicitly called out.

The proposal states the full list of new Machine conditions and then below

To better evaluate proposed changes, below you can find the list of current Machine's conditions:
Ready, InfrastructureReady, BootstrapReady, NodeHealthy, PreDrainDeleteHookSucceeded, VolumeDetachSucceeded, DrainingSucceeded.

(We mentioned the PreDrainDeleteHookSucceeded condition in the list of current conditions, but forgot to mention PreTerminateDeleteHookSucceeded)

Anyway. I talked to @fabriziopandini and want to propose an alternative

We already have .status.deletion.{nodeDrainStartTime,waitForNodeVolumeDetachStartTime}

We would propose to also add .status.deletion.{waitForPreDrainHookStartTime,waitForPreTerminateHookStartTime}

They should be set the same way. as the other timeouts

Then we can modify isNodeDrainAllowed & isNodeVolumeDetachingAllowed to skip drain / wait for volume detach if waitForPreTerminateHookStartTime is set

Additionally it would be nice to skip drain / wait for volume detach if the InfraMachine either doesn't exist anymore or has a deletionTimestamp set

WDYT?

Signed-off-by: Danil-Grigorev <[email protected]>

furkatgofurov7 · 2024-12-19T11:32:27Z

Thanks for review @chrischdi 👍🏼
cc @sbueringer @enxebre @fabriziopandini hey folks, would you please have a look at this when you have time since it is blocking us (CAPRKE2 provider) to bump and use CAPI v1.9.x series?

alexander-demicev · 2024-12-19T11:55:48Z

/lgtm

A backport would be very much appreciated 🙏

k8s-ci-robot · 2024-12-19T11:55:55Z

LGTM label has been added.

Git tree hash: 0bcc510758d20e84eb7b2d56a889fcf26def4356

Danil-Grigorev · 2024-12-19T12:50:52Z

/area machine

enxebre · 2025-01-07T15:21:38Z

is this a control plane Node? wouldn't this scenario make any other upcoming node deletion fail to query through the remote client as well?

Danil-Grigorev · 2025-01-08T18:23:16Z

Thanks @enxebre, I think in this case it is eventually passing the deletion. Once the ETCD member is drained, the API server healthz on the node fails, which causes exclusion from LB. It is important to get the code to remove infra machine, so it completes the deletion on the provider side. Node is removed once API server connectivity is restored, and then the Machine follow.

Thanks to @chrischdi suggestion the behavior is replicated in our CI with setting 2 annotations #11591 (comment), and it works well as a temporary solution. But other providers may hit the same issue.

sbueringer · 2025-01-21T13:36:42Z

Just to highlight this here in the "main thread" as well. Proposed an alternative implementation here: #11590 (comment)

k8s-triage-robot · 2025-04-21T14:22:41Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

chrischdi · 2025-05-14T14:37:28Z

@alexander-demicev do you have time to look into Stefans suggestion above? (#11590 (comment))

k8s-triage-robot · 2025-06-13T15:27:18Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 17, 2024

k8s-ci-robot added the do-not-merge/needs-area PR is missing an area label label Dec 17, 2024

k8s-ci-robot requested review from fabriziopandini and sbueringer December 17, 2024 17:11

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 17, 2024

k8s-ci-robot added area/bootstrap Issues or PRs related to bootstrap providers and removed do-not-merge/needs-area PR is missing an area label labels Dec 17, 2024

Danil-Grigorev mentioned this pull request Dec 18, 2024

Update CAPI support to 1.9 rancher/cluster-api-provider-rke2#525

Closed

1 task

Danil-Grigorev force-pushed the drain-once-until-completion branch from df2f637 to 76d0706 Compare December 18, 2024 10:44

Danil-Grigorev changed the title ~~🐛 Perform draining and volume detaching once until completion~~ 🐛 Perform draining and volume detachment once until completion Dec 18, 2024

Danil-Grigorev force-pushed the drain-once-until-completion branch from 76d0706 to f472e1a Compare December 18, 2024 16:56

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 18, 2024

chrischdi reviewed Dec 18, 2024

View reviewed changes

Perform draining and volume detaching once until completion

1021f8e

Signed-off-by: Danil-Grigorev <[email protected]>

Danil-Grigorev force-pushed the drain-once-until-completion branch from f472e1a to 1021f8e Compare December 19, 2024 10:09

k8s-ci-robot assigned alexander-demicev Dec 19, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 19, 2024

k8s-ci-robot added the area/machine Issues or PRs related to machine lifecycle management label Dec 19, 2024

enxebre mentioned this pull request Jan 7, 2025

Machine fails to finish draining/volume detachment after successful completion #11591

Open

sbueringer mentioned this pull request Jan 22, 2025

Consider executing "reconcileDelete" only if finalizer is set #11399

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 21, 2025

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 13, 2025

🐛 Perform draining and volume detachment once until completion #11590

Are you sure you want to change the base?

🐛 Perform draining and volume detachment once until completion #11590

Uh oh!

Conversation

Danil-Grigorev commented Dec 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Dec 17, 2024

Uh oh!

Danil-Grigorev commented Dec 17, 2024

Uh oh!

chrischdi Dec 17, 2024

Choose a reason for hiding this comment

Uh oh!

Danil-Grigorev Dec 19, 2024

Choose a reason for hiding this comment

Uh oh!

sbueringer Jan 13, 2025

Choose a reason for hiding this comment

Uh oh!

sbueringer Jan 13, 2025

Choose a reason for hiding this comment

Uh oh!

furkatgofurov7 commented Dec 19, 2024

Uh oh!

alexander-demicev commented Dec 19, 2024

Uh oh!

k8s-ci-robot commented Dec 19, 2024

Uh oh!

Danil-Grigorev commented Dec 19, 2024

Uh oh!

enxebre commented Jan 7, 2025

Uh oh!

Danil-Grigorev commented Jan 8, 2025

Uh oh!

sbueringer commented Jan 21, 2025

Uh oh!

k8s-triage-robot commented Apr 21, 2025

Uh oh!

chrischdi commented May 14, 2025

Uh oh!

k8s-triage-robot commented Jun 13, 2025

Uh oh!

Uh oh!

Danil-Grigorev commented Dec 17, 2024 •

edited

Loading