proxy-next-upstream (including default on error and timeout) does not always pick a different upstream depending on load balancer and concurrent requests

**What happened**:
When one backend pod fails under a condition covered by `proxy_next_upstream` (e.g. `http_404` for easy testing), if there's a large volume of requests, any one request may reuse the same backend for all tries rather than actually using the "next" backend. This happens for sure with the default round-robin balancer, but most likely with all balancer implementations.

**What you expected to happen**:
If a backend request fails due to one of the `proxy_next_upstream` conditions, it should be retried with at least one of the other available backends, regardless of the configured load balancer or any concurrent requests.

**NGINX Ingress controller version** (exec into the pod and run nginx-ingress-controller --version.): 1.11.2

**Kubernetes version** (use `kubectl version`): 1.28.10

**Environment**:

- **Cloud provider or hardware configuration**: MacBook Pro with Apple M2
- **OS** (e.g. from /etc/os-release): Ubuntu 22.04.4 via Multipass on macOS 14.5
- **Kernel** (e.g. `uname -a`): 5.15.0-119-generic
- **Install tools**:
  - microk8s
- **Basic cluster related info**:
  - `kubectl version`
     ```
     Client Version: v1.31.0
     Kustomize Version: v5.4.2
     Server Version: v1.28.10
     ```
  - `kubectl get nodes -o wide`
     ```
     NAME          STATUS   ROLES    AGE   VERSION    INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION       CONTAINER-RUNTIME
     microk8s-vm   Ready    <none>   24h   v1.28.10   192.168.64.9   <none>        Ubuntu 22.04.4 LTS   5.15.0-119-generic   containerd://1.6.28
     ```

- **How was the ingress-nginx-controller installed**:
  - `microk8s enable ingress`

- **Current State of the controller**:
  - `kubectl describe ingressclasses`
     ```
     Name:         public
     Labels:       <none>
     Annotations:  ingressclass.kubernetes.io/is-default-class: true
     Controller:   k8s.io/ingress-nginx
     Events:       <none>
     
     Name:         nginx
     Labels:       <none>
     Annotations:  <none>
     Controller:   k8s.io/ingress-nginx
     Events:       <none>
     ```
  - `kubectl -n ingress get all -o wide`
     ```
     NAME                                          READY   STATUS    RESTARTS   AGE   IP            NODE          NOMINATED NODE   READINESS GATES
     pod/nginx-ingress-microk8s-controller-4hrss   1/1     Running   0          85m   10.1.254.88   microk8s-vm   <none>           <none>
     
     NAME                                               DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE   CONTAINERS               IMAGES                                             SELECTOR
     daemonset.apps/nginx-ingress-microk8s-controller   1         1         1       1            1           <none>          25h   nginx-ingress-microk8s   registry.k8s.io/ingress-nginx/controller:v1.11.2   name=nginx-ingress-microk8s
     ```
  - `kubectl -n ingress describe po nginx-ingress-microk8s-controller-4hrss`
    ```
    Name:             nginx-ingress-microk8s-controller-4hrss
    Namespace:        ingress
    Priority:         0
    Service Account:  nginx-ingress-microk8s-serviceaccount
    Node:             microk8s-vm/192.168.64.9
    Start Time:       Fri, 23 Aug 2024 09:16:22 +0200
    Labels:           controller-revision-hash=5489ccb55d
                      name=nginx-ingress-microk8s
                      pod-template-generation=3
    Annotations:      cni.projectcalico.org/containerID: 94904e61580ee1449befe245d5c84ce11f0b93fb3cda52f9a2a74e26ea81d17b
                      cni.projectcalico.org/podIP: 10.1.254.88/32
                      cni.projectcalico.org/podIPs: 10.1.254.88/32
    Status:           Running
    IP:               10.1.254.88
    IPs:
      IP:           10.1.254.88
    Controlled By:  DaemonSet/nginx-ingress-microk8s-controller
    Containers:
      nginx-ingress-microk8s:
        Container ID:  containerd://56f41296d707602f46a6f5429eb834e401fa9a884ed91106644c5e71f48c73aa
        Image:         registry.k8s.io/ingress-nginx/controller:v1.11.2
        Image ID:      registry.k8s.io/ingress-nginx/controller@sha256:d5f8217feeac4887cb1ed21f27c2674e58be06bd8f5184cacea2a69abaf78dce
        Ports:         80/TCP, 443/TCP, 10254/TCP
        Host Ports:    80/TCP, 443/TCP, 10254/TCP
        Args:
          /nginx-ingress-controller
          --configmap=$(POD_NAMESPACE)/nginx-load-balancer-microk8s-conf
          --tcp-services-configmap=$(POD_NAMESPACE)/nginx-ingress-tcp-microk8s-conf
          --udp-services-configmap=$(POD_NAMESPACE)/nginx-ingress-udp-microk8s-conf
          --ingress-class=public
           
          --publish-status-address=127.0.0.1
        State:          Running
          Started:      Fri, 23 Aug 2024 09:16:36 +0200
        Ready:          True
        Restart Count:  0
        Liveness:       http-get http://:10254/healthz delay=10s timeout=5s period=10s #success=1 #failure=3
        Readiness:      http-get http://:10254/healthz delay=0s timeout=5s period=10s #success=1 #failure=3
        Environment:
          POD_NAME:       nginx-ingress-microk8s-controller-4hrss (v1:metadata.name)
          POD_NAMESPACE:  ingress (v1:metadata.namespace)
        Mounts:
          /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-rjbf6 (ro)
    Conditions:
      Type              Status
      Initialized       True 
      Ready             True 
      ContainersReady   True 
      PodScheduled      True 
    Volumes:
      lb-override:
        Type:      ConfigMap (a volume populated by a ConfigMap)
        Name:      nginx-load-balancer-override
        Optional:  false
      kube-api-access-rjbf6:
        Type:                    Projected (a volume that contains injected data from multiple sources)
        TokenExpirationSeconds:  3607
        ConfigMapName:           kube-root-ca.crt
        ConfigMapOptional:       <nil>
        DownwardAPI:             true
    QoS Class:                   BestEffort
    Node-Selectors:              <none>
    Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                                 node.kubernetes.io/not-ready:NoExecute op=Exists
                                 node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                                 node.kubernetes.io/unreachable:NoExecute op=Exists
                                 node.kubernetes.io/unschedulable:NoSchedule op=Exists
    Events:                      <none>
    ```

- **Current state of ingress object, if applicable**:
  - `kubectl -n default get all,ing -o wide`
      ```
      NAME                                       READY   STATUS    RESTARTS   AGE   IP            NODE          NOMINATED NODE   READINESS GATES
      pod/next-upstream-repro-5d9bb8d6cc-zsckn   1/1     Running   0          84m   10.1.254.90   microk8s-vm   <none>           <none>
      pod/next-upstream-repro-5d9bb8d6cc-5bn7k   1/1     Running   0          84m   10.1.254.91   microk8s-vm   <none>           <none>
      
      NAME                          TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE   SELECTOR
      service/next-upstream-repro   ClusterIP   10.152.183.21   <none>        80/TCP    86m   app=next-upstream-repro
      
      NAME                                  READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS   IMAGES   SELECTOR
      deployment.apps/next-upstream-repro   2/2     2            2           86m   nginx        nginx    app=next-upstream-repro
      
      NAME                                             DESIRED   CURRENT   READY   AGE   CONTAINERS   IMAGES   SELECTOR
      replicaset.apps/next-upstream-repro-5d9bb8d6cc   2         2         2       84m   nginx        nginx    app=next-upstream-repro,pod-template-hash=5d9bb8d6cc
      
      NAME                                            CLASS   HOSTS     ADDRESS     PORTS   AGE
      ingress.networking.k8s.io/next-upstream-repro   nginx   foo.bar   127.0.0.1   80      86m
      ```
  - `kubectl -n <appnamespace> describe ing <ingressname>`
      ```
      Name:             next-upstream-repro
      Labels:           <none>
      Namespace:        default
      Address:          127.0.0.1
      Ingress Class:    nginx
      Default backend:  <default>
      Rules:
        Host        Path  Backends
        ----        ----  --------
        foo.bar     
                    /   next-upstream-repro:http (10.1.254.90:80,10.1.254.91:80)
      Annotations:  nginx.ingress.kubernetes.io/proxy-next-upstream: error http_404 timeout
      Events:       <none>
      ```

- **Others**:
  - The backend service is configured to always respond with status 404


**How to reproduce this issue**:
## Install minikube/kind

- Minikube https://minikube.sigs.k8s.io/docs/start/
- Kind https://kind.sigs.k8s.io/docs/user/quick-start/

## Install the ingress controller
```bash
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/baremetal/deploy.yaml
```

## Install an application with at least 2 pods that will always respond with status 404
```bash
echo '
  apiVersion: apps/v1
  kind: Deployment
  metadata:
    name: next-upstream-repro
    namespace: default
  spec:
    replicas: 2
    selector:
      matchLabels:
        app: next-upstream-repro
    template:
      metadata:
        labels:
          app: next-upstream-repro
      spec:
        containers:
        - image: nginx
          imagePullPolicy: IfNotPresent
          name: nginx
          ports:
          - containerPort: 80
          volumeMounts:
            - name: conf
              mountPath: /etc/nginx/conf.d
        volumes:
          - name: conf
            configMap:
              name: next-upstream-repro
  ---
  apiVersion: v1
  kind: Service
  metadata:
    name: next-upstream-repro
    namespace: default
  spec:
    ports:
      - name: http
        port: 80
        targetPort: 80
        protocol: TCP
    type: ClusterIP
    selector:
      app: next-upstream-repro
  ---
  apiVersion: v1
  kind: ConfigMap
  metadata:
    name: next-upstream-repro
    namespace: default
  data:
    default.conf: |
      server {
        listen       80;
        server_name  localhost;

        location = / {
          return 404 "$hostname\n";
        }
      }
' | kubectl apply -f -
```

## Create an ingress which tries next upstream on 404
```bash
echo "
  apiVersion: networking.k8s.io/v1
  kind: Ingress
  metadata:
    name: next-upstream-repro
    annotations:
      nginx.ingress.kubernetes.io/proxy-next-upstream: 'error http_404 timeout'
  spec:
    ingressClassName: nginx
    rules:
    - host: foo.bar
      http:
        paths:
        - path: /
          pathType: Prefix
          backend:
            service:
              name: next-upstream-repro
              port:
                name: http
" | kubectl apply -f -
```

## Make many requests in parallel
```bash
POD_NAME=$(k get pods -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx -o NAME)
kubectl exec -it -n ingress-nginx $POD_NAME -- bash -c "seq 1 200 | xargs -I{} -n1 -P10 curl -H 'Host: foo.bar' localhost"
```

Observe in the ingress controller's access logs (`kubectl logs -n ingress-nginx $POD_NAME`) that many requests will have the same upstream in succession in `$upstream_addr`, e.g.
```
::1 - - [23/Aug/2024:08:49:42 +0000] "GET / HTTP/1.1" 404 1 "-" "curl/8.9.0" 70 0.000 [default-next-upstream-repro-http] [] 10.1.254.92:80, 10.1.254.92:80, 10.1.254.93:80 0, 0, 1 0.000, 0.000, 0.000 404, 404, 404 afa1e1e8964286bd7d1b7664f606bb2f
::1 - - [23/Aug/2024:08:53:21 +0000] "GET / HTTP/1.1" 404 1 "-" "curl/8.9.0" 70 0.001 [default-next-upstream-repro-http] [] 10.1.254.93:80, 10.1.254.93:80, 10.1.254.93:80 0, 0, 1 0.000, 0.000, 0.000 404, 404, 404 b753b1828cc200d3c95d6ecbc6ba80e6
```

**Anything else we need to know**:
The problem is exacerbated by few (like 2 in the repro case) backend pods being hit by a large request volume concurrently. There is basically a conflict between global load balancing behaviour and per-request retries at play here. For e.g. the default round-robin load balancer, the instance is obviously shared by all requests (on an nginx worker) for a particular backend.

Assuming a system with 2 backend endpoints for the sake of simplicity, the flow of information *can* be as follows:
 1. Request 1 reaches ingress nginx, gets routed to endpoint A by round robin balancer, waits for response from backend
 2. Round robin balancer state: Next endpoint is endpoint B
 3. Request 2 reaches ingress nginx, gets routed to endpoint B by round robin balancer, waits for response from backend
 4. Round robin balancer state: Next endpoint is endpoint A
 5. Response from endpoint A fails for request 1, `proxy_next_upstream` config requests another endpoint from the load balancing system, it gets routed to endpoint A by round robin balancer
 6. Round robin balancer state: Next endpoint is endpoint B
 7. Request 3 reaches ingress nginx, gets routed to endpoint B by round robin balancer, waits for response from backend
 8. Round robin balancer state: Next endpoint is endpoint A
 9. Response from endpoint B fails for request 2, `proxy_next_upstream` config requests another endpoint from the load balancing system, it gets routed to endpoint A by round robin balancer
 10. Responses from all endpoints for request 1, 2, and 3 succeed

As you can see, this means request 1 is only handled by endpoint A despite the `proxy_next_upstream` directive. Depending on the actual rate and order of requests etc, request 2 could have faced a similar fate, but request 3 came in before the initial response failed, so it happens to work out in that case.

This makes `proxy-next-upstream` extremely unreliable and behave in unexpected ways. An approach to fixing this would be that the Lua-based load balancing be made aware of what endpoints have already been tried. The semantics are hard to nail down exactly, however, since this might break the guarantees that some load balancing strategies aim to provide. On the other hand, having the next upstream choice work reliably *at all* is invaluable for bridging over requests in a failure scenario. A backend endpoint might become unreachable, which should result in it eventually being removed from the load balancing once probes have caught up to the fact. In the meantime, the default `error timeout` strategy would try the "next" available upstream for any requests trying that endpoint, but if everything aligns just right, the load balancer would always return the same endpoint, resulting in a 502 despite the system at large being perfectly capable of handling the request.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

proxy-next-upstream (including default on error and timeout) does not always pick a different upstream depending on load balancer and concurrent requests #11852

Install minikube/kind

Install the ingress controller

Install an application with at least 2 pods that will always respond with status 404

Create an ingress which tries next upstream on 404

Make many requests in parallel

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

proxy-next-upstream (including default on error and timeout) does not always pick a different upstream depending on load balancer and concurrent requests #11852

Description

Install minikube/kind

Install the ingress controller

Install an application with at least 2 pods that will always respond with status 404

Create an ingress which tries next upstream on 404

Make many requests in parallel

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions