Description
What happened:
We continue to hit the (max) timeout on our validation webhook when applying ingress manifests.
failed to call webhook: Post "https://ingress-nginx-controller-admission.ingress.svc:443/networking/v1/ingresses?timeout=30s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
It is consistently high, in the 20s mark, while general load or ingress applies in quick succession might push it to 30s where deploy pipelines start to fail.
The above image shows a graph of validation time, metric given by nginx itself, over 24 hours earlier this week.
This is me adding a label, to illustrate one simple update:
torvald@surdeig ~ $ time kubectl patch ing <ingress> --type='json' -p='[{"op": "add", "path": "/metadata/labels/testing", "value": "testing"}]'
ingress.networking.k8s.io/<ingress> patched
real 0m17.724s
user 0m0.396s
sys 0m0.057s
This is in a medium sized cluster,
- ~130 nodes
- 270 ingresses
- 3 pods for nginx a 8GB RAM (request/limit) and 5 CPUs (request)
- ~1000 rps at peak (see graph below)
- 9.9 MB nginx config file (296k lines, 187
server_names
, 4778locations
)
Request rate
Over the same time period as above.
Performance of pods
To comment on this, it looks and feels quite bearable. Spikes in CPUs are assumed to be nginx reloads and validations runs. Over the same time period as above.
90 days trends:
The image above show the number of ingresses over the last 90 days.
The image above shows the validation webhook duration over the last 90 days. This mostly support an organic growth of sorts, except the the quick changed marked in the picture above; this has been tracked down to 10 ingresses (serving the same host) that changed from 1 host to 3 so the collection of ~60 paths over 1 host became ~180 over 3 hosts.
See an example of such ingress post change
What you expected to happen:
I've seen people mention far better performance then 20-30s on their validation webhook in other issues around here, and that with larger clusters and larger nginx config files. So my expectations would be in the 1-5s mark.
This PR will probably help us in the cases where multiple ingresses at the same time gets applied - but one or a few single applies should probably not take 20s?
NGINX Ingress controller version
nginx/1.21.6, release v1.9.5
torvald@surdeig ~ $ kubectl exec -it nginx-ingress-controller-5d66477fb7-jttwl -- /nginx-ingress-controller --version
Defaulted container "nginx-ingress-controller" out of: nginx-ingress-controller, opentelemetry (init), sysctl (init), geoip-database-download (init)
-------------------------------------------------------------------------------
NGINX Ingress controller
Release: v1.9.5
Build: f503c4bb5fa7d857ad29e94970eb550c2bc00b7c
Repository: https://github.com/kubernetes/ingress-nginx
nginx version: nginx/1.21.6
-------------------------------------------------------------------------------
Kubernetes version (use kubectl version
):
torvald@surdeig ~ $ kubectl version --short
Client Version: v1.25.0
Kustomize Version: v4.5.7
Server Version: v1.27.10-gke.1055000
Environment:
- Cloud provider or hardware configuration: GCP, managed GKE; e2-custom-16-32768
- OS (e.g. from /etc/os-release): Container-Optimized OS with containerd (cos_containerd)
- Kernel (e.g.
uname -a
): 5.15.133+ - How was the ingress-nginx-controller installed: It probably originated from a helm chart once, but everything has evolved in our own git repo since then. I'll attach the relevant files.
- cat nginx-ingress-deployment-controller.yaml
- cat configmaps.yaml
- cat validatingwebhookconfiguration.yaml
-
kubectl get -n ingress all -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/nginx-ingress-controller-5d66477fb7-8qtfs 1/1 Running 0 20h 10.4.117.224 gke-k8s-prod-k8s-prod-standard-v8-611652ec-5lt5 <none> <none>
pod/nginx-ingress-controller-5d66477fb7-jttwl 1/1 Running 0 3h9m 10.4.71.52 gke-k8s-prod-k8s-prod-standard-v8-68b36906-wgdl <none> <none>
pod/nginx-ingress-controller-5d66477fb7-wlw6t 1/1 Running 0 20h 10.4.1.143 gke-k8s-prod-k8s-prod-standard-v8-7c2e0d29-s7vz <none> <none>
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
service/ingress-nginx-controller-admission ClusterIP 10.6.12.178 <none> 443/TCP 2y198d app=nginx-ingress,component=controller
service/ingress-nginx-controller-collector-metrics ClusterIP 10.6.4.251 <none> 8888/TCP 574d app=nginx-ingress,component=controller
service/ingress-nginx-controller-metrics ClusterIP 10.6.6.244 <none> 10254/TCP 2y198d app=nginx-ingress,component=controller
service/nginx-ingress-controller LoadBalancer 10.6.8.95 <redacted> 80:31151/TCP,443:30321/TCP 2y198d app=nginx-ingress,component=controller
NAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTOR
deployment.apps/nginx-ingress-controller 3/3 3 3 2y198d nginx-ingress-controller registry.k8s.io/ingress-nginx/controller:v1.9.5 app=nginx-ingress,component=controller
NAME DESIRED CURRENT READY AGE CONTAINERS IMAGES SELECTOR
replicaset.apps/nginx-ingress-controller-5d66477fb7 3 3 3 20h nginx-ingress-controller registry.k8s.io/ingress-nginx/controller:v1.9.5 app=nginx-ingress,component=controller,pod-template-hash=5d66477fb7
- Current state of ingress object, if applicable:
See an example of an ingress, the same as mentioned above in the «What happened» section.
How to reproduce this issue:
I think this would be unfeasible, but I'm happy to assist with more details.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
No status