-
Notifications
You must be signed in to change notification settings - Fork 129
Description
TL;DR
CCM is stuck after a while if the API / Load Balancer is not available to accept requests. Manual restart of the pod fixed this - instantaneous.
Expected behavior
CCM should retry to send the request. Backoff mechanism is fine but no manual restart should be required.
Observed behavior
Today some Load Balancers (deployed via CCM) of our customers were unavailable (23.09.2025 / around 15:18 MESZ). We could see them in the UI and we were able to ping them. The Health Status showed that 16/16 checks are not healthy. Some Load Balancers recovered on their own.
One specific cluster had a Load Balancer that was offline for around two hours - for the cluster. We then restarted the CCM pod and all services were reachable again - instantaneous.
It seems that there's some kind of back off mechanism implemented that will cache / pause requests to the API. The API itself was available the whole time. That's fine but it shouldn't require a manual restart. The downtime of our other customers was around 30 minutes. From our perspective it looks like we had a unnecessary downtime of 1,5 hours due to this kind of mechanism.
Minimal working example
No response
Log output
Unfortunately no log output available due to the restart.
Additional information
No response