Skip to content

Client.Timeout exceeded (30s) on validation webhooks when updating Ingress objects #11255

Open
@torvald

Description

@torvald

What happened:

We continue to hit the (max) timeout on our validation webhook when applying ingress manifests.

failed to call webhook: Post "https://ingress-nginx-controller-admission.ingress.svc:443/networking/v1/ingresses?timeout=30s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)

It is consistently high, in the 20s mark, while general load or ingress applies in quick succession might push it to 30s where deploy pipelines start to fail.

image
The above image shows a graph of validation time, metric given by nginx itself, over 24 hours earlier this week.

This is me adding a label, to illustrate one simple update:

torvald@surdeig ~ $ time kubectl patch ing <ingress> --type='json' -p='[{"op": "add", "path": "/metadata/labels/testing", "value": "testing"}]'
ingress.networking.k8s.io/<ingress> patched

real	0m17.724s
user	0m0.396s
sys	0m0.057s

This is in a medium sized cluster,

  • ~130 nodes
  • 270 ingresses
  • 3 pods for nginx a 8GB RAM (request/limit) and 5 CPUs (request)
  • ~1000 rps at peak (see graph below)
  • 9.9 MB nginx config file (296k lines, 187 server_names, 4778 locations)

Request rate
image
Over the same time period as above.

Performance of pods
image
To comment on this, it looks and feels quite bearable. Spikes in CPUs are assumed to be nginx reloads and validations runs. Over the same time period as above.

90 days trends:
image
The image above show the number of ingresses over the last 90 days.

image
The image above shows the validation webhook duration over the last 90 days. This mostly support an organic growth of sorts, except the the quick changed marked in the picture above; this has been tracked down to 10 ingresses (serving the same host) that changed from 1 host to 3 so the collection of ~60 paths over 1 host became ~180 over 3 hosts.

See an example of such ingress post change

What you expected to happen:

I've seen people mention far better performance then 20-30s on their validation webhook in other issues around here, and that with larger clusters and larger nginx config files. So my expectations would be in the 1-5s mark.

This PR will probably help us in the cases where multiple ingresses at the same time gets applied - but one or a few single applies should probably not take 20s?

NGINX Ingress controller version

nginx/1.21.6, release v1.9.5
torvald@surdeig ~ $ kubectl exec -it nginx-ingress-controller-5d66477fb7-jttwl -- /nginx-ingress-controller --version  
Defaulted container "nginx-ingress-controller" out of: nginx-ingress-controller, opentelemetry (init), sysctl (init), geoip-database-download (init)
-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:       v1.9.5
  Build:         f503c4bb5fa7d857ad29e94970eb550c2bc00b7c
  Repository:    https://github.com/kubernetes/ingress-nginx
  nginx version: nginx/1.21.6
-------------------------------------------------------------------------------

Kubernetes version (use kubectl version):

torvald@surdeig ~ $ kubectl version --short
Client Version: v1.25.0
Kustomize Version: v4.5.7
Server Version: v1.27.10-gke.1055000

Environment:

  • Cloud provider or hardware configuration: GCP, managed GKE; e2-custom-16-32768
  • OS (e.g. from /etc/os-release): Container-Optimized OS with containerd (cos_containerd)
  • Kernel (e.g. uname -a): 5.15.133+
  • How was the ingress-nginx-controller installed: It probably originated from a helm chart once, but everything has evolved in our own git repo since then. I'll attach the relevant files.
NAME                                                           READY   STATUS    RESTARTS   AGE     IP             NODE                                              NOMINATED NODE   READINESS GATES
pod/nginx-ingress-controller-5d66477fb7-8qtfs                  1/1     Running   0          20h     10.4.117.224   gke-k8s-prod-k8s-prod-standard-v8-611652ec-5lt5   <none>           <none>
pod/nginx-ingress-controller-5d66477fb7-jttwl                  1/1     Running   0          3h9m    10.4.71.52     gke-k8s-prod-k8s-prod-standard-v8-68b36906-wgdl   <none>           <none>
pod/nginx-ingress-controller-5d66477fb7-wlw6t                  1/1     Running   0          20h     10.4.1.143     gke-k8s-prod-k8s-prod-standard-v8-7c2e0d29-s7vz   <none>           <none>

NAME                                                               TYPE           CLUSTER-IP    EXTERNAL-IP           PORT(S)                      AGE      SELECTOR
service/ingress-nginx-controller-admission                         ClusterIP      10.6.12.178   <none>                443/TCP                      2y198d   app=nginx-ingress,component=controller
service/ingress-nginx-controller-collector-metrics                 ClusterIP      10.6.4.251    <none>                8888/TCP                     574d     app=nginx-ingress,component=controller
service/ingress-nginx-controller-metrics                           ClusterIP      10.6.6.244    <none>                10254/TCP                    2y198d   app=nginx-ingress,component=controller
service/nginx-ingress-controller                                   LoadBalancer   10.6.8.95     <redacted>          80:31151/TCP,443:30321/TCP   2y198d   app=nginx-ingress,component=controller

NAME                                            READY   UP-TO-DATE   AVAILABLE   AGE      CONTAINERS                      IMAGES                                                                                      SELECTOR
deployment.apps/nginx-ingress-controller        3/3     3            3           2y198d   nginx-ingress-controller        registry.k8s.io/ingress-nginx/controller:v1.9.5                                             app=nginx-ingress,component=controller

NAME                                                       DESIRED   CURRENT   READY   AGE      CONTAINERS                        IMAGES                                                                                                                         SELECTOR
replicaset.apps/nginx-ingress-controller-5d66477fb7        3         3         3       20h      nginx-ingress-controller          registry.k8s.io/ingress-nginx/controller:v1.9.5                                                                                app=nginx-ingress,component=controller,pod-template-hash=5d66477fb7
  • Current state of ingress object, if applicable:

See an example of an ingress, the same as mentioned above in the «What happened» section.

How to reproduce this issue:

I think this would be unfeasible, but I'm happy to assist with more details.

Metadata

Metadata

Assignees

No one assigned

    Labels

    lifecycle/frozenIndicates that an issue or PR should not be auto-closed due to staleness.needs-kindIndicates a PR lacks a `kind/foo` label and requires one.needs-priorityneeds-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.triage/needs-informationIndicates an issue needs more information in order to work on it.

    Type

    No type

    Projects

    • Status

      No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions