-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Validation Checklist
- I confirm that this is a Kubeflow-related issue.
- I am reporting this in the appropriate repository.
- I have followed the Kubeflow installation guidelines.
- The issue report is detailed and includes version numbers where applicable.
- I have considered adding my company to the adopters page to support Kubeflow and help the community, since I expect help from the community for my issue (see 1. and 2.).
- This issue pertains to Kubeflow development.
- I am available to work on this issue.
- You can join the CNCF Slack and access our meetings at the Kubeflow Community website. Our channel on the CNCF Slack is here #kubeflow-platform.
Version
master
Detailed Description
Description:
We've identified an issue where Pod Disruption Budgets (PDBs) for critical Knative core components, specifically:
- knative-serving/activator-pdb
- knative-serving/webhook-pdb
- knative-eventing/eventing-webhook
are configured with minAvailable: 80%. This default setting is causing drain failures on Kubernetes nodes when these PDBs are applied to their corresponding Deployments, which typically run with replicas: 1.
Problem Diagnosis:
PDB Configuration: minAvailable: 80% (or similar percentage resulting in 1 pod for single replica deployments).
Deployment Configuration: replicas: 1 for the affected core components.
Conflict: Mathematically, 80% of 1 replica is 0.8. Kubernetes PDBs interpret this as requiring minAvailable: 1 pod. This means the PDB will prevent the single running pod from being evicted, leading to node drain operations getting stuck.
This behavior impacts cluster maintenance, making it difficult to gracefully drain nodes without manual intervention or forcibly evicting critical Knative components.
Proposed Solutions:
We propose two primary approaches to ensure proper functionality and maintainability of Knative core components:
Change PDB minAvailable Default to 0 for these single-replica core components:
If minAvailable for these specific core component PDBs were 0, the single replica pod could be safely drained. This acknowledges that while these components are critical, a brief, temporary unavailability during a node drain (especially for webhook components that are often stateless or have quick startup times) might be acceptable compared to a stuck drain. This would simplify cluster operations significantly.
Set PDB minAvailable Default to 1 AND ensure replicas >= 2 for the targeted Deployments:
If minAvailable is 1 for these PDBs, then their corresponding Deployments (activator, webhook, eventing-webhook) should be explicitly configured to run with replicas >= 2 (e.g., replicas: 2). This would allow one replica to be drained while ensuring another remains available, thus satisfying the PDB. This option would require a slight increase in resource consumption for core components but would ensure higher availability during drains.
Rationale for Not Using replicas: 5 with minAvailable: 80%:
While technically possible to satisfy the PDB by increasing replicas (e.g., replicas: 5 with minAvailable: 80% would allow 1 replica to be drained), we believe this is overly redundant for core Knative components. Deploying 5 replicas when 1 or 2 would suffice for stability and maintenance adds unnecessary resource overhead, which is contrary to efficient cluster management. Therefore, we do not advocate for such a high default replica count.
Expected Outcome:
Implementing one of the proposed solutions (or a suitable combination) would enable seamless node drain operations in Knative clusters, improving the overall maintainability and reliability of the platform. It would remove the current bottleneck caused by the PDB configuration for these core components without introducing excessive resource requirements.
Steps to Reproduce
- create cluster
cat <<EOF | kind create cluster --name=kubeflow --config=-
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
image: kindest/node:v1.34.0@sha256:7416a61b42b1662ca6ca89f02028ac133a309a2a30ba309614e8ec94d976dc5a
kubeadmConfigPatches:
- |
kind: ClusterConfiguration
apiServer:
extraArgs:
"service-account-issuer": "https://kubernetes.default.svc"
"service-account-signing-key-file": "/etc/kubernetes/pki/sa.key"
EOF
- install kubeflow
while ! kustomize build example | kubectl apply --server-side --force-conflicts -f -; do echo "Retrying to apply resources"; sleep 20; done
- drain node
kubectl drain <worker-node>
Screenshots or Videos (Optional)
No response