Downtime after upgrading to 1.12.0 - Open "/tmp/nginx/nginx.pid" failed #12645
Description
Keeping this open while our investigation is running. We cannot explain it yet.
Will fill up with more details as soon as we have understood it deeper.
As it broke only a few environments it is harder to debug.
But it it is warning to check your log lines duriung upgrade
/tmp/nginx/nginx.pid
What happened:
Upgraded our ingress-controller via helm from
version: 4.11.3
to
version: 4.12.0
Causing a major outage on 4/10 clusters. We can not understand yet why.
Kubernetes version 1.31.x
| | 2025-01-09 07:45:35.583 | nginx: [error] open() "/tmp/nginx/nginx.pid" failed (2: No such file or directory) |
-- | -- | -- | -- | --
| | 2025-01-09 07:45:35.583 | 2025/01/09 07:45:35 [error] 215#215: open() "/tmp/nginx/nginx.pid" failed (2: No such file or directory) |
| | 2025-01-09 07:45:35.583 | nginx: [error] open() "/tmp/nginx/nginx.pid" failed (2: No such file or directory) |
| | 2025-01-09 07:45:35.583 | 2025/01/09 07:45:35 [error] 215#215: open() "/tmp/nginx/nginx.pid" failed (2: No such file or directory) |
| | 2025-01-09 07:45:35.583 | nginx: [error] open() "/tmp/nginx/nginx.pid" failed (2: No such file or directory) |
| | 2025-01-09 07:45:35.583 | 2025/01/09 07:45:35 [error] 215#215: open() "/tmp/nginx/nginx.pid" failed (2: No such file or directory) |
| | 2025-01-09 07:45:35.000 | name=ingress-nginx-general-r6-controller-565c5966f7-8p4rq kind=Pod objectAPIversion=v1 objectRV=2931225444 eventRV=2931226571 reportingcontroller=nginx-ingress-controller sourcecomponent=nginx-ingress-controller reason=RELOAD type=Warning count=1 msg="Error reloading NGINX: exit status 1\n2025/01/09 07:45:35 [notice] 215#215: signal process started\n2025/01/09 07:45:35 [error] 215#215: open() \"/tmp/nginx/nginx.pid\" failed (2: No such file or directory)\nnginx: [error] open() \"/tmp/nginx/nginx.pid\" failed (2: No such file or directory)\n" |
|
What you expected to happen:
Ingress controller continues to work.
I am not sure yet. I keep it open while we investigate deeper.
Kubernetes version (use kubectl version
):
v1.31.3-eks-59bf375
Environment:
AWS / EKS
- Cloud provider or hardware configuration:
AWS - OS (e.g. from /etc/os-release):
- Kernel (e.g.
uname -a
): - Install tools:
Please mention how/where was the cluster created like kubeadm/kops/minikube/kind etc.
- Basic cluster related info:
kubectl version
kubectl get nodes -o wide
Other data is going to follow after we did a breakdown
How to reproduce this issue:
Hard to reproduce as it is currently happening on the nodes which we cannot test again.
Update 10.01 - 00:10 - Tested again a deployment of the faulty version. Ssl certs were sendings as K8s Fake certs on some domains but the old version were sending the real letsencrypt certs. Looks like a TLS issue after upgrade.
Metadata
Assignees
Labels
Type
Projects
Status
No status