Description
What happened:
The setup:
- gRPC client/server with long lived http2 connections
- worker-shutdown-timeout: 30m
- nginx reloads, starting the shutdown of old workers
- a new version of the grpc server is released, causing all the pods to turn ov
What you expected to happen:
I would expect the workers to gracefully try to terminate the long lived connections, and to continue to route to valid targets in the meantime.
Instead, I find that grpc clients start to get UNAVAILABLE or UNIMPLEMENTED errors, which I presume is from traffic being routed by the old workers to IPs that don't exist anymore (or are assigned to different pods)
NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.):
v1.8.1
Kubernetes version (use kubectl version
): 1.27
Environment: EKS
-
Cloud provider or hardware configuration: AWS EKS
-
OS (e.g. from /etc/os-release): Bottlerocket
-
Kernel (e.g.
uname -a
): 5.15.160 -
Install tools: AWS EKS
-
Basic cluster related info: v1.27
-
How was the ingress-nginx-controller installed:
- Argo + Helm. Chart version 4.7.1
# ingress-nginx configuration
# https://github.com/kubernetes/ingress-nginx/blob/main/charts/ingress-nginx/values.yaml
ingress-nginx:
controller:
image:
registry: registry.k8s.io
priorityClassName: cluster-core-services
kind: Deployment
autoscaling:
minReplicas: 9
maxReplicas: 35
targetCPUUtilizationPercentage: 65
maxUnavailable: 2
resources:
requests:
cpu: 3000m
memory: 4500Mi
# the reload pod needs to see the nginx pid in the controller pod, so enable sharing of pids
shareProcessNamespace: true
# Name of the ingress class to route through this controller
ingressClassResource:
name: https-internal
default: true
controllerValue: k8s.io/ingress-https-internal
# Process IngressClass per name
ingressClassByName: true
# Use the dedicated ingress nodes
nodeSelector:
dedicated: ingress
# Tolerate taints on dedicated ingress nodes.
tolerations:
- key: dedicated
value: ingress
operator: Equal
effect: NoSchedule
service:
annotations:
# Use a (TCP) Network Load Balancer.
# https://docs.aws.amazon.com/eks/latest/userguide/network-load-balancing.html
service.beta.kubernetes.io/aws-load-balancer-type: external
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: preserve_client_ip.enabled=true,deregistration_delay.connection_termination.enabled=true,deregistration_delay.timeout_seconds=120
# "Internal" load balancing is enabled by default.
externalTrafficPolicy: Local
metrics:
enabled: true
serviceMonitor:
enabled: true
namespace: ingress-https-internal
scrapeInterval: 15s
## Additional command line arguments to pass to nginx-ingress-controller
## E.g. to specify the default SSL certificate you can use
## extraArgs:
## default-ssl-certificate: "<namespace>/<secret_name>"
extraArgs:
ingress-class: https-internal
# Instructs the controller to wait this many seconds before sending the quit signal to nginx.
# After receiving the quit signal, nginx will attempt to complete any requests before reaching the
# terminating grace period.
shutdown-grace-period: 360
autoscaling:
enabled: true
minReplicas: 6
maxReplicas: 25
targetCPUUtilizationPercentage: 85
targetMemoryUtilizationPercentage: null
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 1
periodSeconds: 180
scaleUp:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 2
periodSeconds: 60
# see per-env values for resource configuration
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app.kubernetes.io/instance: ingress-https-internal
maxUnavailable: 1
terminationGracePeriodSeconds: 2100 # 35 mins
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app.kubernetes.io/instance
operator: In
values:
- ingress-https-internal
topologyKey: kubernetes.io/hostname
allowSnippetAnnotations: true
config:
use-forwarded-headers: true
worker-shutdown-timeout: 30m
enable-underscores-in-headers: true
admissionWebhooks:
patch:
image:
registry: registry.k8s.io
How to reproduce this issue:
Still working on an easy way to reproduce this
Anything else we need to know:
https://github.com/kubernetes/ingress-nginx/blob/main/rootfs/etc/nginx/lua/balancer.lua#L297-L317
Seems to sync the list of endpoints periodically, based on the configuration that gets handled by the lua http server (https://github.com/kubernetes/ingress-nginx/blob/main/rootfs/etc/nginx/lua/configuration.lua#L245-L248)
That value is sent from the Go controller here: https://github.com/kubernetes/ingress-nginx/blob/main/internal/ingress/controller/nginx.go#L944-L987
The timer in the worker init though:
https://github.com/openresty/lua-nginx-module?tab=readme-ov-file#ngxtimerevery
timer will be created every delay seconds until the current Nginx worker process starts exiting
timer expiration happens when the Nginx worker process is trying to shut down, as in an Nginx configuration reload triggered by the HUP signal or in an Nginx server shutdown
Therefore I believe that if:
- worker-shutdown-timeout is long
- an nginx reload is triggered
- during the duration of worker-shutdown-timeout, the backing pod ips change (therefore the endpoints change)
- and a client is holding open a connection and sending requests
- Those requests will be routed to invalid targets, as the worker will no longer be getting an updated list of endpoints.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
No status