Client.Timeout exceeded (30s) on validation webhooks when updating Ingress objects #11255




What happened:

We continue to hit the (max) timeout on our validation webhook when applying ingress manifests.

failed to call webhook: Post "https://ingress-nginx-controller-admission.ingress.svc:443/networking/v1/ingresses?timeout=30s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)

It is consistently high, in the 20s mark, while general load or ingress applies in quick succession might push it to 30s where deploy pipelines start to fail.

The above image shows a graph of validation time, metric given by nginx itself, over 24 hours earlier this week.

This is me adding a label, to illustrate one simple update:

torvald@surdeig ~ $ time kubectl patch ing <ingress> --type='json' -p='[{"op": "add", "path": "/metadata/labels/testing", "value": "testing"}]'<ingress> patched

real	0m17.724s
user	0m0.396s
sys	0m0.057s

This is in a medium sized cluster,

  • ~130 nodes
  • 270 ingresses
  • 3 pods for nginx a 8GB RAM (request/limit) and 5 CPUs (request)
  • ~1000 rps at peak (see graph below)
  • 9.9 MB nginx config file (296k lines, 187 server_names, 4778 locations)

Request rate
Over the same time period as above.

Performance of pods
To comment on this, it looks and feels quite bearable. Spikes in CPUs are assumed to be nginx reloads and validations runs. Over the same time period as above.

90 days trends:
The image above show the number of ingresses over the last 90 days.

The image above shows the validation webhook duration over the last 90 days. This mostly support an organic growth of sorts, except the the quick changed marked in the picture above; this has been tracked down to 10 ingresses (serving the same host) that changed from 1 host to 3 so the collection of ~60 paths over 1 host became ~180 over 3 hosts.

See an example of such ingress post change

What you expected to happen:

I've seen people mention far better performance then 20-30s on their validation webhook in other issues around here, and that with larger clusters and larger nginx config files. So my expectations would be in the 1-5s mark.

This PR will probably help us in the cases where multiple ingresses at the same time gets applied - but one or a few single applies should probably not take 20s?

NGINX Ingress controller version

nginx/1.21.6, release v1.9.5
torvald@surdeig ~ $ kubectl exec -it nginx-ingress-controller-5d66477fb7-jttwl -- /nginx-ingress-controller --version  
Defaulted container "nginx-ingress-controller" out of: nginx-ingress-controller, opentelemetry (init), sysctl (init), geoip-database-download (init)
NGINX Ingress controller
  Release:       v1.9.5
  Build:         f503c4bb5fa7d857ad29e94970eb550c2bc00b7c
  nginx version: nginx/1.21.6

Kubernetes version (use kubectl version):

torvald@surdeig ~ $ kubectl version --short
Client Version: v1.25.0
Kustomize Version: v4.5.7
Server Version: v1.27.10-gke.1055000


  • Cloud provider or hardware configuration: GCP, managed GKE; e2-custom-16-32768
  • OS (e.g. from /etc/os-release): Container-Optimized OS with containerd (cos_containerd)
  • Kernel (e.g. uname -a): 5.15.133+
  • How was the ingress-nginx-controller installed: It probably originated from a helm chart once, but everything has evolved in our own git repo since then. I'll attach the relevant files.
NAME                                                           READY   STATUS    RESTARTS   AGE     IP             NODE                                              NOMINATED NODE   READINESS GATES
pod/nginx-ingress-controller-5d66477fb7-8qtfs                  1/1     Running   0          20h   gke-k8s-prod-k8s-prod-standard-v8-611652ec-5lt5   <none>           <none>
pod/nginx-ingress-controller-5d66477fb7-jttwl                  1/1     Running   0          3h9m     gke-k8s-prod-k8s-prod-standard-v8-68b36906-wgdl   <none>           <none>
pod/nginx-ingress-controller-5d66477fb7-wlw6t                  1/1     Running   0          20h     gke-k8s-prod-k8s-prod-standard-v8-7c2e0d29-s7vz   <none>           <none>

NAME                                                               TYPE           CLUSTER-IP    EXTERNAL-IP           PORT(S)                      AGE      SELECTOR
service/ingress-nginx-controller-admission                         ClusterIP   <none>                443/TCP                      2y198d   app=nginx-ingress,component=controller
service/ingress-nginx-controller-collector-metrics                 ClusterIP    <none>                8888/TCP                     574d     app=nginx-ingress,component=controller
service/ingress-nginx-controller-metrics                           ClusterIP    <none>                10254/TCP                    2y198d   app=nginx-ingress,component=controller
service/nginx-ingress-controller                                   LoadBalancer     <redacted>          80:31151/TCP,443:30321/TCP   2y198d   app=nginx-ingress,component=controller

NAME                                            READY   UP-TO-DATE   AVAILABLE   AGE      CONTAINERS                      IMAGES                                                                                      SELECTOR
deployment.apps/nginx-ingress-controller        3/3     3            3           2y198d   nginx-ingress-controller                                             app=nginx-ingress,component=controller

NAME                                                       DESIRED   CURRENT   READY   AGE      CONTAINERS                        IMAGES                                                                                                                         SELECTOR
replicaset.apps/nginx-ingress-controller-5d66477fb7        3         3         3       20h      nginx-ingress-controller                                                                                app=nginx-ingress,component=controller,pod-template-hash=5d66477fb7
  • Current state of ingress object, if applicable:

See an example of an ingress, the same as mentioned above in the «What happened» section.

How to reproduce this issue:

I think this would be unfeasible, but I'm happy to assist with more details.




