Skip to content

proxy-next-upstream (including default on error and timeout) does not always pick a different upstream depending on load balancer and concurrent requests #11852

Open
@marvin-roesch

Description

@marvin-roesch

What happened:
When one backend pod fails under a condition covered by proxy_next_upstream (e.g. http_404 for easy testing), if there's a large volume of requests, any one request may reuse the same backend for all tries rather than actually using the "next" backend. This happens for sure with the default round-robin balancer, but most likely with all balancer implementations.

What you expected to happen:
If a backend request fails due to one of the proxy_next_upstream conditions, it should be retried with at least one of the other available backends, regardless of the configured load balancer or any concurrent requests.

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.): 1.11.2

Kubernetes version (use kubectl version): 1.28.10

Environment:

  • Cloud provider or hardware configuration: MacBook Pro with Apple M2

  • OS (e.g. from /etc/os-release): Ubuntu 22.04.4 via Multipass on macOS 14.5

  • Kernel (e.g. uname -a): 5.15.0-119-generic

  • Install tools:

    • microk8s
  • Basic cluster related info:

    • kubectl version
      Client Version: v1.31.0
      Kustomize Version: v5.4.2
      Server Version: v1.28.10
      
    • kubectl get nodes -o wide
      NAME          STATUS   ROLES    AGE   VERSION    INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION       CONTAINER-RUNTIME
      microk8s-vm   Ready    <none>   24h   v1.28.10   192.168.64.9   <none>        Ubuntu 22.04.4 LTS   5.15.0-119-generic   containerd://1.6.28
      
  • How was the ingress-nginx-controller installed:

    • microk8s enable ingress
  • Current State of the controller:

    • kubectl describe ingressclasses
      Name:         public
      Labels:       <none>
      Annotations:  ingressclass.kubernetes.io/is-default-class: true
      Controller:   k8s.io/ingress-nginx
      Events:       <none>
      
      Name:         nginx
      Labels:       <none>
      Annotations:  <none>
      Controller:   k8s.io/ingress-nginx
      Events:       <none>
      
    • kubectl -n ingress get all -o wide
      NAME                                          READY   STATUS    RESTARTS   AGE   IP            NODE          NOMINATED NODE   READINESS GATES
      pod/nginx-ingress-microk8s-controller-4hrss   1/1     Running   0          85m   10.1.254.88   microk8s-vm   <none>           <none>
      
      NAME                                               DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE   CONTAINERS               IMAGES                                             SELECTOR
      daemonset.apps/nginx-ingress-microk8s-controller   1         1         1       1            1           <none>          25h   nginx-ingress-microk8s   registry.k8s.io/ingress-nginx/controller:v1.11.2   name=nginx-ingress-microk8s
      
    • kubectl -n ingress describe po nginx-ingress-microk8s-controller-4hrss
      Name:             nginx-ingress-microk8s-controller-4hrss
      Namespace:        ingress
      Priority:         0
      Service Account:  nginx-ingress-microk8s-serviceaccount
      Node:             microk8s-vm/192.168.64.9
      Start Time:       Fri, 23 Aug 2024 09:16:22 +0200
      Labels:           controller-revision-hash=5489ccb55d
                        name=nginx-ingress-microk8s
                        pod-template-generation=3
      Annotations:      cni.projectcalico.org/containerID: 94904e61580ee1449befe245d5c84ce11f0b93fb3cda52f9a2a74e26ea81d17b
                        cni.projectcalico.org/podIP: 10.1.254.88/32
                        cni.projectcalico.org/podIPs: 10.1.254.88/32
      Status:           Running
      IP:               10.1.254.88
      IPs:
        IP:           10.1.254.88
      Controlled By:  DaemonSet/nginx-ingress-microk8s-controller
      Containers:
        nginx-ingress-microk8s:
          Container ID:  containerd://56f41296d707602f46a6f5429eb834e401fa9a884ed91106644c5e71f48c73aa
          Image:         registry.k8s.io/ingress-nginx/controller:v1.11.2
          Image ID:      registry.k8s.io/ingress-nginx/controller@sha256:d5f8217feeac4887cb1ed21f27c2674e58be06bd8f5184cacea2a69abaf78dce
          Ports:         80/TCP, 443/TCP, 10254/TCP
          Host Ports:    80/TCP, 443/TCP, 10254/TCP
          Args:
            /nginx-ingress-controller
            --configmap=$(POD_NAMESPACE)/nginx-load-balancer-microk8s-conf
            --tcp-services-configmap=$(POD_NAMESPACE)/nginx-ingress-tcp-microk8s-conf
            --udp-services-configmap=$(POD_NAMESPACE)/nginx-ingress-udp-microk8s-conf
            --ingress-class=public
             
            --publish-status-address=127.0.0.1
          State:          Running
            Started:      Fri, 23 Aug 2024 09:16:36 +0200
          Ready:          True
          Restart Count:  0
          Liveness:       http-get http://:10254/healthz delay=10s timeout=5s period=10s #success=1 #failure=3
          Readiness:      http-get http://:10254/healthz delay=0s timeout=5s period=10s #success=1 #failure=3
          Environment:
            POD_NAME:       nginx-ingress-microk8s-controller-4hrss (v1:metadata.name)
            POD_NAMESPACE:  ingress (v1:metadata.namespace)
          Mounts:
            /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-rjbf6 (ro)
      Conditions:
        Type              Status
        Initialized       True 
        Ready             True 
        ContainersReady   True 
        PodScheduled      True 
      Volumes:
        lb-override:
          Type:      ConfigMap (a volume populated by a ConfigMap)
          Name:      nginx-load-balancer-override
          Optional:  false
        kube-api-access-rjbf6:
          Type:                    Projected (a volume that contains injected data from multiple sources)
          TokenExpirationSeconds:  3607
          ConfigMapName:           kube-root-ca.crt
          ConfigMapOptional:       <nil>
          DownwardAPI:             true
      QoS Class:                   BestEffort
      Node-Selectors:              <none>
      Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                                   node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                                   node.kubernetes.io/not-ready:NoExecute op=Exists
                                   node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                                   node.kubernetes.io/unreachable:NoExecute op=Exists
                                   node.kubernetes.io/unschedulable:NoSchedule op=Exists
      Events:                      <none>
      
  • Current state of ingress object, if applicable:

    • kubectl -n default get all,ing -o wide
      NAME                                       READY   STATUS    RESTARTS   AGE   IP            NODE          NOMINATED NODE   READINESS GATES
      pod/next-upstream-repro-5d9bb8d6cc-zsckn   1/1     Running   0          84m   10.1.254.90   microk8s-vm   <none>           <none>
      pod/next-upstream-repro-5d9bb8d6cc-5bn7k   1/1     Running   0          84m   10.1.254.91   microk8s-vm   <none>           <none>
      
      NAME                          TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE   SELECTOR
      service/next-upstream-repro   ClusterIP   10.152.183.21   <none>        80/TCP    86m   app=next-upstream-repro
      
      NAME                                  READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS   IMAGES   SELECTOR
      deployment.apps/next-upstream-repro   2/2     2            2           86m   nginx        nginx    app=next-upstream-repro
      
      NAME                                             DESIRED   CURRENT   READY   AGE   CONTAINERS   IMAGES   SELECTOR
      replicaset.apps/next-upstream-repro-5d9bb8d6cc   2         2         2       84m   nginx        nginx    app=next-upstream-repro,pod-template-hash=5d9bb8d6cc
      
      NAME                                            CLASS   HOSTS     ADDRESS     PORTS   AGE
      ingress.networking.k8s.io/next-upstream-repro   nginx   foo.bar   127.0.0.1   80      86m
      
    • kubectl -n <appnamespace> describe ing <ingressname>
      Name:             next-upstream-repro
      Labels:           <none>
      Namespace:        default
      Address:          127.0.0.1
      Ingress Class:    nginx
      Default backend:  <default>
      Rules:
        Host        Path  Backends
        ----        ----  --------
        foo.bar     
                    /   next-upstream-repro:http (10.1.254.90:80,10.1.254.91:80)
      Annotations:  nginx.ingress.kubernetes.io/proxy-next-upstream: error http_404 timeout
      Events:       <none>
      
  • Others:

    • The backend service is configured to always respond with status 404

How to reproduce this issue:

Install minikube/kind

Install the ingress controller

kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/baremetal/deploy.yaml

Install an application with at least 2 pods that will always respond with status 404

echo '
  apiVersion: apps/v1
  kind: Deployment
  metadata:
    name: next-upstream-repro
    namespace: default
  spec:
    replicas: 2
    selector:
      matchLabels:
        app: next-upstream-repro
    template:
      metadata:
        labels:
          app: next-upstream-repro
      spec:
        containers:
        - image: nginx
          imagePullPolicy: IfNotPresent
          name: nginx
          ports:
          - containerPort: 80
          volumeMounts:
            - name: conf
              mountPath: /etc/nginx/conf.d
        volumes:
          - name: conf
            configMap:
              name: next-upstream-repro
  ---
  apiVersion: v1
  kind: Service
  metadata:
    name: next-upstream-repro
    namespace: default
  spec:
    ports:
      - name: http
        port: 80
        targetPort: 80
        protocol: TCP
    type: ClusterIP
    selector:
      app: next-upstream-repro
  ---
  apiVersion: v1
  kind: ConfigMap
  metadata:
    name: next-upstream-repro
    namespace: default
  data:
    default.conf: |
      server {
        listen       80;
        server_name  localhost;

        location = / {
          return 404 "$hostname\n";
        }
      }
' | kubectl apply -f -

Create an ingress which tries next upstream on 404

echo "
  apiVersion: networking.k8s.io/v1
  kind: Ingress
  metadata:
    name: next-upstream-repro
    annotations:
      nginx.ingress.kubernetes.io/proxy-next-upstream: 'error http_404 timeout'
  spec:
    ingressClassName: nginx
    rules:
    - host: foo.bar
      http:
        paths:
        - path: /
          pathType: Prefix
          backend:
            service:
              name: next-upstream-repro
              port:
                name: http
" | kubectl apply -f -

Make many requests in parallel

POD_NAME=$(k get pods -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx -o NAME)
kubectl exec -it -n ingress-nginx $POD_NAME -- bash -c "seq 1 200 | xargs -I{} -n1 -P10 curl -H 'Host: foo.bar' localhost"

Observe in the ingress controller's access logs (kubectl logs -n ingress-nginx $POD_NAME) that many requests will have the same upstream in succession in $upstream_addr, e.g.

::1 - - [23/Aug/2024:08:49:42 +0000] "GET / HTTP/1.1" 404 1 "-" "curl/8.9.0" 70 0.000 [default-next-upstream-repro-http] [] 10.1.254.92:80, 10.1.254.92:80, 10.1.254.93:80 0, 0, 1 0.000, 0.000, 0.000 404, 404, 404 afa1e1e8964286bd7d1b7664f606bb2f
::1 - - [23/Aug/2024:08:53:21 +0000] "GET / HTTP/1.1" 404 1 "-" "curl/8.9.0" 70 0.001 [default-next-upstream-repro-http] [] 10.1.254.93:80, 10.1.254.93:80, 10.1.254.93:80 0, 0, 1 0.000, 0.000, 0.000 404, 404, 404 b753b1828cc200d3c95d6ecbc6ba80e6

Anything else we need to know:
The problem is exacerbated by few (like 2 in the repro case) backend pods being hit by a large request volume concurrently. There is basically a conflict between global load balancing behaviour and per-request retries at play here. For e.g. the default round-robin load balancer, the instance is obviously shared by all requests (on an nginx worker) for a particular backend.

Assuming a system with 2 backend endpoints for the sake of simplicity, the flow of information can be as follows:

  1. Request 1 reaches ingress nginx, gets routed to endpoint A by round robin balancer, waits for response from backend
  2. Round robin balancer state: Next endpoint is endpoint B
  3. Request 2 reaches ingress nginx, gets routed to endpoint B by round robin balancer, waits for response from backend
  4. Round robin balancer state: Next endpoint is endpoint A
  5. Response from endpoint A fails for request 1, proxy_next_upstream config requests another endpoint from the load balancing system, it gets routed to endpoint A by round robin balancer
  6. Round robin balancer state: Next endpoint is endpoint B
  7. Request 3 reaches ingress nginx, gets routed to endpoint B by round robin balancer, waits for response from backend
  8. Round robin balancer state: Next endpoint is endpoint A
  9. Response from endpoint B fails for request 2, proxy_next_upstream config requests another endpoint from the load balancing system, it gets routed to endpoint A by round robin balancer
  10. Responses from all endpoints for request 1, 2, and 3 succeed

As you can see, this means request 1 is only handled by endpoint A despite the proxy_next_upstream directive. Depending on the actual rate and order of requests etc, request 2 could have faced a similar fate, but request 3 came in before the initial response failed, so it happens to work out in that case.

This makes proxy-next-upstream extremely unreliable and behave in unexpected ways. An approach to fixing this would be that the Lua-based load balancing be made aware of what endpoints have already been tried. The semantics are hard to nail down exactly, however, since this might break the guarantees that some load balancing strategies aim to provide. On the other hand, having the next upstream choice work reliably at all is invaluable for bridging over requests in a failure scenario. A backend endpoint might become unreachable, which should result in it eventually being removed from the load balancing once probes have caught up to the fact. In the meantime, the default error timeout strategy would try the "next" available upstream for any requests trying that endpoint, but if everything aligns just right, the load balancer would always return the same endpoint, resulting in a 502 despite the system at large being perfectly capable of handling the request.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.needs-prioritytriage/acceptedIndicates an issue or PR is ready to be actively worked on.

    Type

    No type

    Projects

    • Status

      No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions