Skip to content

Kuberntes cluster goes down if capsule is down #1135

@pratik705

Description

@pratik705

Bug description

I have deployed Capsule with a single replica and noticed an issue if that single Capsule replica goes down[1], it brings the Kubernetes cluster down.

After reviewing existing issues, it seems the nodes.capsule.clastix.io webhook causes the issue if Capsule is unreachable. As per this comment, I set the failurePolicy to Ignore. Subsequently, the worker nodes recovered[2], but the master nodes moved to Ready,SchedulingDisabled status. From the logs[3], I observed that the issue persisted because Capsule was down. To fix this, I had to set failurePolicy to Ignore for owner.namespace.capsule.clastix.io mutating webhook and uncordon the master nodes[4].

Can anyone help me understand if the behavior I encountered is expected when Capsule goes down in the environment? If so, how can it be avoided? Also, what functionality of capsule will be impacted by setting failurePolicy to Ignore for owner.namespace.capsule.clastix.io mutating webhook?

Thanks in advance.

Steps to reproduce:

  • Deploy Capsule with a single replica.
  • Scale it to 0 or cause an OOM by reducing the resources of the pod.
  • Eventually, the pods will be evicted and the nodes will go down as kubelet will fail to update the node status to kube-api.

Workaround:

  • Set failurePoliy of nodes.capsule.clastix.io and owner.namespace.capsule.clastix.io webhook to Ignore.
  • Uncordon the nodes if required.

Expected behavior

  • Kubernetes cluster shouldnt be impacted if capsule goes down

Additional context

  • Capsule version: 0.3.3
  • Helm Chart version: capsule-0.4.5
  • Kubernetes version: v1.25.15

[1]

{"L":"ERROR","T":"2024-07-16T05:55:47.125Z","C":"kubeutils/kube_utils.go:330","M":"failed to update node with newly added labels [failed try 1] [retrying in 20 seconds] : Internal error occurred: failed calling webhook \"nodes.capsule.clastix.io\": failed to call webhook: Post \"https://capsule-webhook-service.capsule-system.svc:443/nodes?timeout=30s\": dial tcp 10.11.113.114:443: connect: connection refused"}

# kubectl get nodes
NAME           STATUS                     ROLES    AGE   VERSION
10.239.0.121   Ready                      master   9d    v1.25.15
10.239.0.122   Ready                      master   9d    v1.25.15
10.239.0.123   Ready,SchedulingDisabled   master   9d    v1.25.15
10.239.0.124   NotReady                   worker   9d    v1.25.15
10.239.0.125   NotReady                   worker   9d    v1.25.15

[2]

# kubectl get nodes -w
NAME           STATUS                     ROLES    AGE   VERSION
10.239.0.121   Ready,SchedulingDisabled   master   9d    v1.25.15
10.239.0.122   Ready,SchedulingDisabled   master   9d    v1.25.15
10.239.0.123   Ready,SchedulingDisabled   master   9d    v1.25.15
10.239.0.124   Ready                      worker   9d    v1.25.15 <==
10.239.0.125   Ready                      worker   9d    v1.25.15 <==

[3]

024-07-17 16:22:53] Name: \"kubernetes-dashboard\", Namespace: \"\" [2024-07-17 16:22:53] for: \"STDIN\": error when patching \"STDIN\": Internal error occurred:
failed calling webhook \"owner.namespace.capsule.clastix.io\": failed to call webhook: Post \"https://capsule-webhook-service.capsule-system.svc:443/namespace-own
er-reference?timeout=30s\": dial tcp 10.11.179.188:443: connect: connection refused],}"}

[4]

# kubectl  get nodes -w
NAME           STATUS   ROLES    AGE   VERSION
10.239.0.121   Ready    master   9d    v1.25.15
10.239.0.122   Ready    master   9d    v1.25.15
10.239.0.123   Ready    master   9d    v1.25.15
10.239.0.124   Ready    worker   9d    v1.25.15
10.239.0.125   Ready    worker   9d    v1.25.15

Metadata

Metadata

Assignees

No one assigned

    Labels

    blocked-needs-validationIssue need triage and validationbugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions