-
Notifications
You must be signed in to change notification settings - Fork 201
Description
Hi everybody👋
The csi-controller of the deployment restarts when we upgrade vcenter. This makes sense as the controller is connecting to vcenter:
{"level":"error","time":"2025-04-17T12:56:01.270266778Z","caller":"volume/listview.go:258","msg":"WaitForUpdates returned err: destroy property filter failed with Post \"URL/sdk\": read tcp IP->IP: read: connection reset by peer after failing to wait for updates: Post \"https: │
│ {"level":"error","time":"2025-04-17T12:56:01.27559112Z","caller":"vsphere/virtualcenter.go:321","msg":"failed to obtain user session with err: Post \"URL/sdk\": dial tcp IP: connect: connection refused","TraceId":"1bc50271-a80e-46e3-bc2b-1a7a44ff4290","stacktrace":"sigs.k8s.io/vsphere-csi-dri │
│ {"level":"error","time":"2025-04-17T12:56:01.275647877Z","caller":"vsphere/virtualcenter.go:270","msg":"Cannot connect to vCenter with err: Post \"URL/sdk\": dial tcp IP: connect: connection refused","TraceId":"1bc50271-a80e-46e3-bc2b-1a7a44ff4290","stacktrace":"sigs.k8s.io/vsphere-csi-driver │
│ {"level":"error","time":"2025-04-17T12:56:01.280137357Z","caller":"vsphere/virtualcenter.go:277","msg":"Could not logout of VC session. Error: Post \"URL/sdk\": dial tcp IP: connect: connection refused","TraceId":"1bc50271-a80e-46e3-bc2b-1a7a44ff4290","stacktrace":"sigs.k8s.io/vsphere-csi-dri │
│ {"level":"error","time":"2025-04-17T12:56:01.280174099Z","caller":"volume/listview.go:218","msg":"failed to connect to vCenter. err: Post \"URL/sdk\": dial tcp IP: connect: connection refused","TraceId":"1bc50271-a80e-46e3-bc2b-1a7a44ff4290","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg │
...
Issue Summary
When upgrading vCenter, the process can take anywhere from 30 to 60 minutes. During this time, we observe that the associated pod restarts repeatedly, which leads to several operational challenges:
Observed Problems
- False Alerts During Planned Downtime
Our monitoring setup triggers alerts on repeated pod restarts. As a result, we receive alerts during this planned outage. The only current workaround is to manually silence alerts during the upgrade, which is not ideal or scalable for our operations.
- Pod Rescheduling and Volume Attachment Failure
A more critical issue occurs if a pod is evicted or killed (e.g., by the chaos monkey) and rescheduled to another node. Since vCenter is down, the volume cannot be reattached, and the pod remains unscheduled. We assume the pod will remain pending until the upgrade completes and vCenter becomes available again?
- Liveness Probe Limitations
Adjusting liveness probes to prevent restarts doesn't seem practical. The vCenter downtime is too long, and loosening the probes would reduce their usefulness in detecting real failures.
Questions
- Has anyone else encountered this behavior during vCenter upgrades?
- Are there best practices or recommended workarounds to mitigate these issues?
- Any suggestions on how to handle volume reattachment failures more gracefully?
Setup
- K8s 1.31
- vSphere Client version 8.0.3.00400
- vsphere csi: v3.3.1