Skip to content

Check reconcile while API / Load Balancer is partially not available #1028

@JonnyBDev

Description

@JonnyBDev

TL;DR

CCM is stuck after a while if the API / Load Balancer is not available to accept requests. Manual restart of the pod fixed this - instantaneous.

Expected behavior

CCM should retry to send the request. Backoff mechanism is fine but no manual restart should be required.

Observed behavior

Today some Load Balancers (deployed via CCM) of our customers were unavailable (23.09.2025 / around 15:18 MESZ). We could see them in the UI and we were able to ping them. The Health Status showed that 16/16 checks are not healthy. Some Load Balancers recovered on their own.

One specific cluster had a Load Balancer that was offline for around two hours - for the cluster. We then restarted the CCM pod and all services were reachable again - instantaneous.

It seems that there's some kind of back off mechanism implemented that will cache / pause requests to the API. The API itself was available the whole time. That's fine but it shouldn't require a manual restart. The downtime of our other customers was around 30 minutes. From our perspective it looks like we had a unnecessary downtime of 1,5 hours due to this kind of mechanism.

Minimal working example

No response

Log output

Unfortunately no log output available due to the restart.

Additional information

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions