Skip to content

Long worker-shutdown-timeout causes long-lived connections to fail #11515

Open
@James-Quigley

Description

@James-Quigley

What happened:

The setup:

  • gRPC client/server with long lived http2 connections
  • worker-shutdown-timeout: 30m
  • nginx reloads, starting the shutdown of old workers
  • a new version of the grpc server is released, causing all the pods to turn ov

What you expected to happen:
I would expect the workers to gracefully try to terminate the long lived connections, and to continue to route to valid targets in the meantime.

Instead, I find that grpc clients start to get UNAVAILABLE or UNIMPLEMENTED errors, which I presume is from traffic being routed by the old workers to IPs that don't exist anymore (or are assigned to different pods)

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.):
v1.8.1

Kubernetes version (use kubectl version): 1.27

Environment: EKS

  • Cloud provider or hardware configuration: AWS EKS

  • OS (e.g. from /etc/os-release): Bottlerocket

  • Kernel (e.g. uname -a): 5.15.160

  • Install tools: AWS EKS

  • Basic cluster related info: v1.27

  • How was the ingress-nginx-controller installed:

    • Argo + Helm. Chart version 4.7.1
# ingress-nginx configuration
# https://github.com/kubernetes/ingress-nginx/blob/main/charts/ingress-nginx/values.yaml

ingress-nginx:
  controller:
    image:
      registry: registry.k8s.io
    priorityClassName: cluster-core-services
    kind: Deployment

    autoscaling:
      minReplicas: 9
      maxReplicas: 35
      targetCPUUtilizationPercentage: 65

    maxUnavailable: 2
    resources:
      requests:
        cpu: 3000m
        memory: 4500Mi

    # the reload pod needs to see the nginx pid in the controller pod, so enable sharing of pids
    shareProcessNamespace: true

    # Name of the ingress class to route through this controller
    ingressClassResource:
      name: https-internal
      default: true
      controllerValue: k8s.io/ingress-https-internal

    # Process IngressClass per name
    ingressClassByName: true

    # Use the dedicated ingress nodes
    nodeSelector:
      dedicated: ingress

    # Tolerate taints on dedicated ingress nodes.
    tolerations:
      - key: dedicated
        value: ingress
        operator: Equal
        effect: NoSchedule

    service:
      annotations:
        # Use a (TCP) Network Load Balancer.
        # https://docs.aws.amazon.com/eks/latest/userguide/network-load-balancing.html
        service.beta.kubernetes.io/aws-load-balancer-type: external
        service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
        service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: preserve_client_ip.enabled=true,deregistration_delay.connection_termination.enabled=true,deregistration_delay.timeout_seconds=120
        # "Internal" load balancing is enabled by default.

      externalTrafficPolicy: Local

    metrics:
      enabled: true
      serviceMonitor:
        enabled: true
        namespace: ingress-https-internal
        scrapeInterval: 15s

    ## Additional command line arguments to pass to nginx-ingress-controller
    ## E.g. to specify the default SSL certificate you can use
    ## extraArgs:
    ##   default-ssl-certificate: "<namespace>/<secret_name>"
    extraArgs:
      ingress-class: https-internal
      # Instructs the controller to wait this many seconds before sending the quit signal to nginx.
      # After receiving the quit signal, nginx will attempt to complete any requests before reaching the
      # terminating grace period.
      shutdown-grace-period: 360

    autoscaling:
      enabled: true
      minReplicas: 6
      maxReplicas: 25
      targetCPUUtilizationPercentage: 85
      targetMemoryUtilizationPercentage: null
      behavior:
        scaleDown:
          stabilizationWindowSeconds: 300
          policies:
            - type: Pods
              value: 1
              periodSeconds: 180
        scaleUp:
          stabilizationWindowSeconds: 300
          policies:
            - type: Pods
              value: 2
              periodSeconds: 60

    # see per-env values for resource configuration

    topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app.kubernetes.io/instance: ingress-https-internal

    maxUnavailable: 1
    terminationGracePeriodSeconds: 2100  # 35 mins

    affinity:
      podAntiAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                  - key: app.kubernetes.io/instance
                    operator: In
                    values:
                      - ingress-https-internal
              topologyKey: kubernetes.io/hostname

    allowSnippetAnnotations: true
    config:
      use-forwarded-headers: true
      worker-shutdown-timeout: 30m
      enable-underscores-in-headers: true

  admissionWebhooks:
    patch:
      image:
        registry: registry.k8s.io

How to reproduce this issue:
Still working on an easy way to reproduce this

Anything else we need to know:
https://github.com/kubernetes/ingress-nginx/blob/main/rootfs/etc/nginx/lua/balancer.lua#L297-L317
Seems to sync the list of endpoints periodically, based on the configuration that gets handled by the lua http server (https://github.com/kubernetes/ingress-nginx/blob/main/rootfs/etc/nginx/lua/configuration.lua#L245-L248)
That value is sent from the Go controller here: https://github.com/kubernetes/ingress-nginx/blob/main/internal/ingress/controller/nginx.go#L944-L987

The timer in the worker init though:
https://github.com/openresty/lua-nginx-module?tab=readme-ov-file#ngxtimerevery

timer will be created every delay seconds until the current Nginx worker process starts exiting
timer expiration happens when the Nginx worker process is trying to shut down, as in an Nginx configuration reload triggered by the HUP signal or in an Nginx server shutdown

Therefore I believe that if:

  • worker-shutdown-timeout is long
  • an nginx reload is triggered
  • during the duration of worker-shutdown-timeout, the backing pod ips change (therefore the endpoints change)
  • and a client is holding open a connection and sending requests
  • Those requests will be routed to invalid targets, as the worker will no longer be getting an updated list of endpoints.

Metadata

Metadata

Assignees

No one assigned

    Labels

    lifecycle/frozenIndicates that an issue or PR should not be auto-closed due to staleness.needs-kindIndicates a PR lacks a `kind/foo` label and requires one.needs-priorityneeds-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.triage/needs-informationIndicates an issue needs more information in order to work on it.

    Type

    No type

    Projects

    • Status

      No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions