Skip to content

Downtime after upgrading to 1.12.0 - Open "/tmp/nginx/nginx.pid" failed  #12645

Open
@thomaspeitz

Description

Keeping this open while our investigation is running. We cannot explain it yet.
Will fill up with more details as soon as we have understood it deeper.
As it broke only a few environments it is harder to debug.
But it it is warning to check your log lines duriung upgrade

/tmp/nginx/nginx.pid

What happened:
Upgraded our ingress-controller via helm from

version: 4.11.3

to

version: 4.12.0

Causing a major outage on 4/10 clusters. We can not understand yet why.
Kubernetes version 1.31.x

  |   | 2025-01-09 07:45:35.583 | nginx: [error] open() "/tmp/nginx/nginx.pid" failed (2: No such file or directory) |  
-- | -- | -- | -- | --
  |   | 2025-01-09 07:45:35.583 | 2025/01/09 07:45:35 [error] 215#215: open() "/tmp/nginx/nginx.pid" failed (2: No such file or directory) |  
  |   | 2025-01-09 07:45:35.583 | nginx: [error] open() "/tmp/nginx/nginx.pid" failed (2: No such file or directory) |  
  |   | 2025-01-09 07:45:35.583 | 2025/01/09 07:45:35 [error] 215#215: open() "/tmp/nginx/nginx.pid" failed (2: No such file or directory) |  
  |   | 2025-01-09 07:45:35.583 | nginx: [error] open() "/tmp/nginx/nginx.pid" failed (2: No such file or directory) |  
  |   | 2025-01-09 07:45:35.583 | 2025/01/09 07:45:35 [error] 215#215: open() "/tmp/nginx/nginx.pid" failed (2: No such file or directory) |  
  |   | 2025-01-09 07:45:35.000 | name=ingress-nginx-general-r6-controller-565c5966f7-8p4rq kind=Pod objectAPIversion=v1 objectRV=2931225444 eventRV=2931226571 reportingcontroller=nginx-ingress-controller sourcecomponent=nginx-ingress-controller reason=RELOAD type=Warning count=1 msg="Error reloading NGINX: exit status 1\n2025/01/09 07:45:35 [notice] 215#215: signal process started\n2025/01/09 07:45:35 [error] 215#215: open() \"/tmp/nginx/nginx.pid\" failed (2: No such file or directory)\nnginx: [error] open() \"/tmp/nginx/nginx.pid\" failed (2: No such file or directory)\n" |  
  |

What you expected to happen:
Ingress controller continues to work.

I am not sure yet. I keep it open while we investigate deeper.

Kubernetes version (use kubectl version):
v1.31.3-eks-59bf375

Environment:
AWS / EKS

  • Cloud provider or hardware configuration:
    AWS
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
    • Please mention how/where was the cluster created like kubeadm/kops/minikube/kind etc.
  • Basic cluster related info:
    • kubectl version
    • kubectl get nodes -o wide

Other data is going to follow after we did a breakdown

How to reproduce this issue:
Hard to reproduce as it is currently happening on the nodes which we cannot test again.

Update 10.01 - 00:10 - Tested again a deployment of the faulty version. Ssl certs were sendings as K8s Fake certs on some domains but the old version were sending the real letsencrypt certs. Looks like a TLS issue after upgrade.

Metadata

Assignees

No one assigned

    Labels

    needs-kindIndicates a PR lacks a `kind/foo` label and requires one.needs-priorityneeds-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions