Bug description
I have deployed Capsule with a single replica and noticed an issue if that single Capsule replica goes down[1], it brings the Kubernetes cluster down.
After reviewing existing issues, it seems the nodes.capsule.clastix.io webhook causes the issue if Capsule is unreachable. As per this comment, I set the failurePolicy to Ignore. Subsequently, the worker nodes recovered[2], but the master nodes moved to Ready,SchedulingDisabled status. From the logs[3], I observed that the issue persisted because Capsule was down. To fix this, I had to set failurePolicy to Ignore for owner.namespace.capsule.clastix.io mutating webhook and uncordon the master nodes[4].
Can anyone help me understand if the behavior I encountered is expected when Capsule goes down in the environment? If so, how can it be avoided? Also, what functionality of capsule will be impacted by setting failurePolicy to Ignore for owner.namespace.capsule.clastix.io mutating webhook?
Thanks in advance.
Steps to reproduce:
- Deploy Capsule with a single replica.
- Scale it to 0 or cause an OOM by reducing the resources of the pod.
- Eventually, the pods will be evicted and the nodes will go down as kubelet will fail to update the node status to kube-api.
Workaround:
- Set
failurePoliy of nodes.capsule.clastix.io and owner.namespace.capsule.clastix.io webhook to Ignore.
- Uncordon the nodes if required.
Expected behavior
- Kubernetes cluster shouldnt be impacted if capsule goes down
Additional context
- Capsule version: 0.3.3
- Helm Chart version: capsule-0.4.5
- Kubernetes version: v1.25.15
[1]
{"L":"ERROR","T":"2024-07-16T05:55:47.125Z","C":"kubeutils/kube_utils.go:330","M":"failed to update node with newly added labels [failed try 1] [retrying in 20 seconds] : Internal error occurred: failed calling webhook \"nodes.capsule.clastix.io\": failed to call webhook: Post \"https://capsule-webhook-service.capsule-system.svc:443/nodes?timeout=30s\": dial tcp 10.11.113.114:443: connect: connection refused"}
# kubectl get nodes
NAME STATUS ROLES AGE VERSION
10.239.0.121 Ready master 9d v1.25.15
10.239.0.122 Ready master 9d v1.25.15
10.239.0.123 Ready,SchedulingDisabled master 9d v1.25.15
10.239.0.124 NotReady worker 9d v1.25.15
10.239.0.125 NotReady worker 9d v1.25.15
[2]
# kubectl get nodes -w
NAME STATUS ROLES AGE VERSION
10.239.0.121 Ready,SchedulingDisabled master 9d v1.25.15
10.239.0.122 Ready,SchedulingDisabled master 9d v1.25.15
10.239.0.123 Ready,SchedulingDisabled master 9d v1.25.15
10.239.0.124 Ready worker 9d v1.25.15 <==
10.239.0.125 Ready worker 9d v1.25.15 <==
[3]
024-07-17 16:22:53] Name: \"kubernetes-dashboard\", Namespace: \"\" [2024-07-17 16:22:53] for: \"STDIN\": error when patching \"STDIN\": Internal error occurred:
failed calling webhook \"owner.namespace.capsule.clastix.io\": failed to call webhook: Post \"https://capsule-webhook-service.capsule-system.svc:443/namespace-own
er-reference?timeout=30s\": dial tcp 10.11.179.188:443: connect: connection refused],}"}
[4]
# kubectl get nodes -w
NAME STATUS ROLES AGE VERSION
10.239.0.121 Ready master 9d v1.25.15
10.239.0.122 Ready master 9d v1.25.15
10.239.0.123 Ready master 9d v1.25.15
10.239.0.124 Ready worker 9d v1.25.15
10.239.0.125 Ready worker 9d v1.25.15
Bug description
I have deployed Capsule with a single replica and noticed an issue if that single Capsule replica goes down[1], it brings the Kubernetes cluster down.
After reviewing existing issues, it seems the
nodes.capsule.clastix.iowebhook causes the issue if Capsule is unreachable. As per this comment, I set thefailurePolicytoIgnore. Subsequently, the worker nodes recovered[2], but the master nodes moved toReady,SchedulingDisabledstatus. From the logs[3], I observed that the issue persisted because Capsule was down. To fix this, I had to setfailurePolicytoIgnoreforowner.namespace.capsule.clastix.iomutating webhook and uncordon the master nodes[4].Can anyone help me understand if the behavior I encountered is expected when Capsule goes down in the environment? If so, how can it be avoided? Also, what functionality of capsule will be impacted by setting
failurePolicytoIgnoreforowner.namespace.capsule.clastix.iomutating webhook?Thanks in advance.
Steps to reproduce:
Workaround:
failurePoliyofnodes.capsule.clastix.ioandowner.namespace.capsule.clastix.iowebhook toIgnore.Expected behavior
Additional context
[1]
[2]
[3]
[4]