-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Perform draining and volume detachment once until completion #11590
base: main
Are you sure you want to change the base?
🐛 Perform draining and volume detachment once until completion #11590
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/area bootstrap |
df2f637
to
76d0706
Compare
76d0706
to
f472e1a
Compare
@@ -657,14 +657,26 @@ func (r *Reconciler) isNodeDrainAllowed(m *clusterv1.Machine) bool { | |||
return false | |||
} | |||
|
|||
if conditions.IsTrue(m, clusterv1.DrainingSucceededCondition) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not start to rely on this condition to be set as it is because its gonna be deprecated/removed in v1beta2.
Also propably a change in the behavior for all VMs (previously followed reconciliation may also run again through drain).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I unified the check under the PreTerminateDeleteHookSucceededCondition
, which allows for both methods to run until completion. It is not listed in the proposal as a part of deprecation/removal AFAICT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not listed in the proposal as a part of deprecation/removal AFAICT
@Danil-Grigorev I think that was just not explicitly called out.
The proposal states the full list of new Machine conditions and then below
To better evaluate proposed changes, below you can find the list of current Machine's conditions:
Ready, InfrastructureReady, BootstrapReady, NodeHealthy, PreDrainDeleteHookSucceeded, VolumeDetachSucceeded, DrainingSucceeded.
(We mentioned the PreDrainDeleteHookSucceeded condition in the list of current conditions, but forgot to mention PreTerminateDeleteHookSucceeded)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway. I talked to @fabriziopandini and want to propose an alternative
- We already have
.status.deletion.{nodeDrainStartTime,waitForNodeVolumeDetachStartTime}
- We would propose to also add
.status.deletion.{waitForPreDrainHookStartTime,waitForPreTerminateHookStartTime}
- They should be set the same way. as the other timeouts
- Then we can modify
isNodeDrainAllowed
&isNodeVolumeDetachingAllowed
to skip drain / wait for volume detach ifwaitForPreTerminateHookStartTime
is set - Additionally it would be nice to skip drain / wait for volume detach if the InfraMachine either doesn't exist anymore or has a deletionTimestamp set
WDYT?
Signed-off-by: Danil-Grigorev <[email protected]>
f472e1a
to
1021f8e
Compare
Thanks for review @chrischdi 👍🏼 |
/lgtm A backport would be very much appreciated 🙏 |
LGTM label has been added. Git tree hash: 0bcc510758d20e84eb7b2d56a889fcf26def4356
|
/area machine |
is this a control plane Node? wouldn't this scenario make any other upcoming node deletion fail to query through the remote client as well? |
Thanks @enxebre, I think in this case it is eventually passing the deletion. Once the ETCD member is drained, the API server healthz on the node fails, which causes exclusion from LB. It is important to get the code to remove infra machine, so it completes the deletion on the provider side. Node is removed once API server connectivity is restored, and then the Machine follow. Thanks to @chrischdi suggestion the behavior is replicated in our CI with setting 2 annotations #11591 (comment), and it works well as a temporary solution. But other providers may hit the same issue. |
What this PR does / why we need it:
The draining logic continues to access the cluster even after the hook is removed or completed, causing a deadlock on machine removal in a non KCP backed CP provider implementation.
In a cluster with a kubelet local mode, competed draining and etcd membership removal causes inability to access API server externally, so the operation continues to error out even after success. This does not allow underlying machine to be removed.
The solution makes the
PreTeminateHook
agnostic to the provider, and ensures that completion of the operation does not lead to further attempts to access the cluster.Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #11591
/area machine