Long worker-shutdown-timeout causes long-lived connections to fail

**What happened**:

The setup:
- gRPC client/server with long lived http2 connections
- worker-shutdown-timeout: 30m
- nginx reloads, starting the shutdown of old workers
- a new version of the grpc server is released, causing all the pods to turn ov

**What you expected to happen**:
I would expect the workers to gracefully try to terminate the long lived connections, and to continue to route to valid targets in the meantime.

Instead, I find that grpc clients start to get UNAVAILABLE or UNIMPLEMENTED errors, which I presume is from traffic being routed by the old workers to IPs that don't exist anymore (or are assigned to different pods)

**NGINX Ingress controller version** (exec into the pod and run nginx-ingress-controller --version.):
v1.8.1

**Kubernetes version** (use `kubectl version`): 1.27

**Environment**: EKS

- **Cloud provider or hardware configuration**: AWS EKS
- **OS** (e.g. from /etc/os-release): Bottlerocket
- **Kernel** (e.g. `uname -a`): 5.15.160
- **Install tools**: AWS EKS
- **Basic cluster related info**: v1.27

- **How was the ingress-nginx-controller installed**:
  - Argo + Helm. Chart version 4.7.1
```
# ingress-nginx configuration
# https://github.com/kubernetes/ingress-nginx/blob/main/charts/ingress-nginx/values.yaml

ingress-nginx:
  controller:
    image:
      registry: registry.k8s.io
    priorityClassName: cluster-core-services
    kind: Deployment

    autoscaling:
      minReplicas: 9
      maxReplicas: 35
      targetCPUUtilizationPercentage: 65

    maxUnavailable: 2
    resources:
      requests:
        cpu: 3000m
        memory: 4500Mi

    # the reload pod needs to see the nginx pid in the controller pod, so enable sharing of pids
    shareProcessNamespace: true

    # Name of the ingress class to route through this controller
    ingressClassResource:
      name: https-internal
      default: true
      controllerValue: k8s.io/ingress-https-internal

    # Process IngressClass per name
    ingressClassByName: true

    # Use the dedicated ingress nodes
    nodeSelector:
      dedicated: ingress

    # Tolerate taints on dedicated ingress nodes.
    tolerations:
      - key: dedicated
        value: ingress
        operator: Equal
        effect: NoSchedule

    service:
      annotations:
        # Use a (TCP) Network Load Balancer.
        # https://docs.aws.amazon.com/eks/latest/userguide/network-load-balancing.html
        service.beta.kubernetes.io/aws-load-balancer-type: external
        service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
        service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: preserve_client_ip.enabled=true,deregistration_delay.connection_termination.enabled=true,deregistration_delay.timeout_seconds=120
        # "Internal" load balancing is enabled by default.

      externalTrafficPolicy: Local

    metrics:
      enabled: true
      serviceMonitor:
        enabled: true
        namespace: ingress-https-internal
        scrapeInterval: 15s

    ## Additional command line arguments to pass to nginx-ingress-controller
    ## E.g. to specify the default SSL certificate you can use
    ## extraArgs:
    ##   default-ssl-certificate: "<namespace>/<secret_name>"
    extraArgs:
      ingress-class: https-internal
      # Instructs the controller to wait this many seconds before sending the quit signal to nginx.
      # After receiving the quit signal, nginx will attempt to complete any requests before reaching the
      # terminating grace period.
      shutdown-grace-period: 360

    autoscaling:
      enabled: true
      minReplicas: 6
      maxReplicas: 25
      targetCPUUtilizationPercentage: 85
      targetMemoryUtilizationPercentage: null
      behavior:
        scaleDown:
          stabilizationWindowSeconds: 300
          policies:
            - type: Pods
              value: 1
              periodSeconds: 180
        scaleUp:
          stabilizationWindowSeconds: 300
          policies:
            - type: Pods
              value: 2
              periodSeconds: 60

    # see per-env values for resource configuration

    topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app.kubernetes.io/instance: ingress-https-internal

    maxUnavailable: 1
    terminationGracePeriodSeconds: 2100  # 35 mins

    affinity:
      podAntiAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                  - key: app.kubernetes.io/instance
                    operator: In
                    values:
                      - ingress-https-internal
              topologyKey: kubernetes.io/hostname

    allowSnippetAnnotations: true
    config:
      use-forwarded-headers: true
      worker-shutdown-timeout: 30m
      enable-underscores-in-headers: true

  admissionWebhooks:
    patch:
      image:
        registry: registry.k8s.io
```


**How to reproduce this issue**:
Still working on an easy way to reproduce this


**Anything else we need to know**:
https://github.com/kubernetes/ingress-nginx/blob/main/rootfs/etc/nginx/lua/balancer.lua#L297-L317
Seems to sync the list of endpoints periodically, based on the configuration that gets handled by the lua http server (https://github.com/kubernetes/ingress-nginx/blob/main/rootfs/etc/nginx/lua/configuration.lua#L245-L248)
That value is sent from the Go controller here: https://github.com/kubernetes/ingress-nginx/blob/main/internal/ingress/controller/nginx.go#L944-L987

The timer in the worker init though:
https://github.com/openresty/lua-nginx-module?tab=readme-ov-file#ngxtimerevery

> timer will be created every delay seconds until the current Nginx worker process starts exiting
> timer expiration happens when the Nginx worker process is trying to shut down, as in an Nginx configuration reload triggered by the HUP signal or in an Nginx server shutdown

Therefore I believe that if:
- worker-shutdown-timeout is long
- an nginx reload is triggered
- during the duration of worker-shutdown-timeout, the backing pod ips change (therefore the endpoints change)
- and a client is holding open a connection and sending requests
- Those requests will be routed to invalid targets, as the worker will no longer be getting an updated list of endpoints.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Long worker-shutdown-timeout causes long-lived connections to fail #11515

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Long worker-shutdown-timeout causes long-lived connections to fail #11515

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions