Description
What happened:
When one backend pod fails under a condition covered by proxy_next_upstream
(e.g. http_404
for easy testing), if there's a large volume of requests, any one request may reuse the same backend for all tries rather than actually using the "next" backend. This happens for sure with the default round-robin balancer, but most likely with all balancer implementations.
What you expected to happen:
If a backend request fails due to one of the proxy_next_upstream
conditions, it should be retried with at least one of the other available backends, regardless of the configured load balancer or any concurrent requests.
NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.): 1.11.2
Kubernetes version (use kubectl version
): 1.28.10
Environment:
-
Cloud provider or hardware configuration: MacBook Pro with Apple M2
-
OS (e.g. from /etc/os-release): Ubuntu 22.04.4 via Multipass on macOS 14.5
-
Kernel (e.g.
uname -a
): 5.15.0-119-generic -
Install tools:
- microk8s
-
Basic cluster related info:
kubectl version
Client Version: v1.31.0 Kustomize Version: v5.4.2 Server Version: v1.28.10
kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME microk8s-vm Ready <none> 24h v1.28.10 192.168.64.9 <none> Ubuntu 22.04.4 LTS 5.15.0-119-generic containerd://1.6.28
-
How was the ingress-nginx-controller installed:
microk8s enable ingress
-
Current State of the controller:
kubectl describe ingressclasses
Name: public Labels: <none> Annotations: ingressclass.kubernetes.io/is-default-class: true Controller: k8s.io/ingress-nginx Events: <none> Name: nginx Labels: <none> Annotations: <none> Controller: k8s.io/ingress-nginx Events: <none>
kubectl -n ingress get all -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/nginx-ingress-microk8s-controller-4hrss 1/1 Running 0 85m 10.1.254.88 microk8s-vm <none> <none> NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE CONTAINERS IMAGES SELECTOR daemonset.apps/nginx-ingress-microk8s-controller 1 1 1 1 1 <none> 25h nginx-ingress-microk8s registry.k8s.io/ingress-nginx/controller:v1.11.2 name=nginx-ingress-microk8s
kubectl -n ingress describe po nginx-ingress-microk8s-controller-4hrss
Name: nginx-ingress-microk8s-controller-4hrss Namespace: ingress Priority: 0 Service Account: nginx-ingress-microk8s-serviceaccount Node: microk8s-vm/192.168.64.9 Start Time: Fri, 23 Aug 2024 09:16:22 +0200 Labels: controller-revision-hash=5489ccb55d name=nginx-ingress-microk8s pod-template-generation=3 Annotations: cni.projectcalico.org/containerID: 94904e61580ee1449befe245d5c84ce11f0b93fb3cda52f9a2a74e26ea81d17b cni.projectcalico.org/podIP: 10.1.254.88/32 cni.projectcalico.org/podIPs: 10.1.254.88/32 Status: Running IP: 10.1.254.88 IPs: IP: 10.1.254.88 Controlled By: DaemonSet/nginx-ingress-microk8s-controller Containers: nginx-ingress-microk8s: Container ID: containerd://56f41296d707602f46a6f5429eb834e401fa9a884ed91106644c5e71f48c73aa Image: registry.k8s.io/ingress-nginx/controller:v1.11.2 Image ID: registry.k8s.io/ingress-nginx/controller@sha256:d5f8217feeac4887cb1ed21f27c2674e58be06bd8f5184cacea2a69abaf78dce Ports: 80/TCP, 443/TCP, 10254/TCP Host Ports: 80/TCP, 443/TCP, 10254/TCP Args: /nginx-ingress-controller --configmap=$(POD_NAMESPACE)/nginx-load-balancer-microk8s-conf --tcp-services-configmap=$(POD_NAMESPACE)/nginx-ingress-tcp-microk8s-conf --udp-services-configmap=$(POD_NAMESPACE)/nginx-ingress-udp-microk8s-conf --ingress-class=public --publish-status-address=127.0.0.1 State: Running Started: Fri, 23 Aug 2024 09:16:36 +0200 Ready: True Restart Count: 0 Liveness: http-get http://:10254/healthz delay=10s timeout=5s period=10s #success=1 #failure=3 Readiness: http-get http://:10254/healthz delay=0s timeout=5s period=10s #success=1 #failure=3 Environment: POD_NAME: nginx-ingress-microk8s-controller-4hrss (v1:metadata.name) POD_NAMESPACE: ingress (v1:metadata.namespace) Mounts: /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-rjbf6 (ro) Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: lb-override: Type: ConfigMap (a volume populated by a ConfigMap) Name: nginx-load-balancer-override Optional: false kube-api-access-rjbf6: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists node.kubernetes.io/pid-pressure:NoSchedule op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists node.kubernetes.io/unschedulable:NoSchedule op=Exists Events: <none>
-
Current state of ingress object, if applicable:
kubectl -n default get all,ing -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/next-upstream-repro-5d9bb8d6cc-zsckn 1/1 Running 0 84m 10.1.254.90 microk8s-vm <none> <none> pod/next-upstream-repro-5d9bb8d6cc-5bn7k 1/1 Running 0 84m 10.1.254.91 microk8s-vm <none> <none> NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR service/next-upstream-repro ClusterIP 10.152.183.21 <none> 80/TCP 86m app=next-upstream-repro NAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTOR deployment.apps/next-upstream-repro 2/2 2 2 86m nginx nginx app=next-upstream-repro NAME DESIRED CURRENT READY AGE CONTAINERS IMAGES SELECTOR replicaset.apps/next-upstream-repro-5d9bb8d6cc 2 2 2 84m nginx nginx app=next-upstream-repro,pod-template-hash=5d9bb8d6cc NAME CLASS HOSTS ADDRESS PORTS AGE ingress.networking.k8s.io/next-upstream-repro nginx foo.bar 127.0.0.1 80 86m
kubectl -n <appnamespace> describe ing <ingressname>
Name: next-upstream-repro Labels: <none> Namespace: default Address: 127.0.0.1 Ingress Class: nginx Default backend: <default> Rules: Host Path Backends ---- ---- -------- foo.bar / next-upstream-repro:http (10.1.254.90:80,10.1.254.91:80) Annotations: nginx.ingress.kubernetes.io/proxy-next-upstream: error http_404 timeout Events: <none>
-
Others:
- The backend service is configured to always respond with status 404
How to reproduce this issue:
Install minikube/kind
- Minikube https://minikube.sigs.k8s.io/docs/start/
- Kind https://kind.sigs.k8s.io/docs/user/quick-start/
Install the ingress controller
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/baremetal/deploy.yaml
Install an application with at least 2 pods that will always respond with status 404
echo '
apiVersion: apps/v1
kind: Deployment
metadata:
name: next-upstream-repro
namespace: default
spec:
replicas: 2
selector:
matchLabels:
app: next-upstream-repro
template:
metadata:
labels:
app: next-upstream-repro
spec:
containers:
- image: nginx
imagePullPolicy: IfNotPresent
name: nginx
ports:
- containerPort: 80
volumeMounts:
- name: conf
mountPath: /etc/nginx/conf.d
volumes:
- name: conf
configMap:
name: next-upstream-repro
---
apiVersion: v1
kind: Service
metadata:
name: next-upstream-repro
namespace: default
spec:
ports:
- name: http
port: 80
targetPort: 80
protocol: TCP
type: ClusterIP
selector:
app: next-upstream-repro
---
apiVersion: v1
kind: ConfigMap
metadata:
name: next-upstream-repro
namespace: default
data:
default.conf: |
server {
listen 80;
server_name localhost;
location = / {
return 404 "$hostname\n";
}
}
' | kubectl apply -f -
Create an ingress which tries next upstream on 404
echo "
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: next-upstream-repro
annotations:
nginx.ingress.kubernetes.io/proxy-next-upstream: 'error http_404 timeout'
spec:
ingressClassName: nginx
rules:
- host: foo.bar
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: next-upstream-repro
port:
name: http
" | kubectl apply -f -
Make many requests in parallel
POD_NAME=$(k get pods -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx -o NAME)
kubectl exec -it -n ingress-nginx $POD_NAME -- bash -c "seq 1 200 | xargs -I{} -n1 -P10 curl -H 'Host: foo.bar' localhost"
Observe in the ingress controller's access logs (kubectl logs -n ingress-nginx $POD_NAME
) that many requests will have the same upstream in succession in $upstream_addr
, e.g.
::1 - - [23/Aug/2024:08:49:42 +0000] "GET / HTTP/1.1" 404 1 "-" "curl/8.9.0" 70 0.000 [default-next-upstream-repro-http] [] 10.1.254.92:80, 10.1.254.92:80, 10.1.254.93:80 0, 0, 1 0.000, 0.000, 0.000 404, 404, 404 afa1e1e8964286bd7d1b7664f606bb2f
::1 - - [23/Aug/2024:08:53:21 +0000] "GET / HTTP/1.1" 404 1 "-" "curl/8.9.0" 70 0.001 [default-next-upstream-repro-http] [] 10.1.254.93:80, 10.1.254.93:80, 10.1.254.93:80 0, 0, 1 0.000, 0.000, 0.000 404, 404, 404 b753b1828cc200d3c95d6ecbc6ba80e6
Anything else we need to know:
The problem is exacerbated by few (like 2 in the repro case) backend pods being hit by a large request volume concurrently. There is basically a conflict between global load balancing behaviour and per-request retries at play here. For e.g. the default round-robin load balancer, the instance is obviously shared by all requests (on an nginx worker) for a particular backend.
Assuming a system with 2 backend endpoints for the sake of simplicity, the flow of information can be as follows:
- Request 1 reaches ingress nginx, gets routed to endpoint A by round robin balancer, waits for response from backend
- Round robin balancer state: Next endpoint is endpoint B
- Request 2 reaches ingress nginx, gets routed to endpoint B by round robin balancer, waits for response from backend
- Round robin balancer state: Next endpoint is endpoint A
- Response from endpoint A fails for request 1,
proxy_next_upstream
config requests another endpoint from the load balancing system, it gets routed to endpoint A by round robin balancer - Round robin balancer state: Next endpoint is endpoint B
- Request 3 reaches ingress nginx, gets routed to endpoint B by round robin balancer, waits for response from backend
- Round robin balancer state: Next endpoint is endpoint A
- Response from endpoint B fails for request 2,
proxy_next_upstream
config requests another endpoint from the load balancing system, it gets routed to endpoint A by round robin balancer - Responses from all endpoints for request 1, 2, and 3 succeed
As you can see, this means request 1 is only handled by endpoint A despite the proxy_next_upstream
directive. Depending on the actual rate and order of requests etc, request 2 could have faced a similar fate, but request 3 came in before the initial response failed, so it happens to work out in that case.
This makes proxy-next-upstream
extremely unreliable and behave in unexpected ways. An approach to fixing this would be that the Lua-based load balancing be made aware of what endpoints have already been tried. The semantics are hard to nail down exactly, however, since this might break the guarantees that some load balancing strategies aim to provide. On the other hand, having the next upstream choice work reliably at all is invaluable for bridging over requests in a failure scenario. A backend endpoint might become unreachable, which should result in it eventually being removed from the load balancing once probes have caught up to the fact. In the meantime, the default error timeout
strategy would try the "next" available upstream for any requests trying that endpoint, but if everything aligns just right, the load balancer would always return the same endpoint, resulting in a 502 despite the system at large being perfectly capable of handling the request.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
No status