Description
Environment:
Kubernetes: v1.19.2
os: flatcar Container Linux by Kinvolk 2605.6.0
kernel: 5.4.67-flatcar
docker://19.3.12
cloudflared version >= 2020.7.0
We have a kubernetes deployment with 3 replicas that all run cloudflared tunnel as sidecar with the following arguments:
- --no-autoupdate
- --url=http://127.0.0.1:80
- --hostname=<my-hostname>
- --lb-pool=<my-lb-pool>
- --origincert=/etc/cloudflared/cert.pem
, where hostname is the name of a load balancer created under our cloudflaire account and lb pool name a defined pool.
That way containers are able to join the pool's origins and the lb sends traffic down to our pods successfully.
The issue:
When rolling pods we observe that the respective origins enter Critical
state for some time until cloudflare cleans them. From the cloudflared sidecar point of view we see the following logs:
time="2020-10-14T09:01:34Z" level=error msg="Register tunnel error from server side" connectionID=0 error="Server error: the origin list length must be in range [1, 5]: validation failed"
time="2020-10-14T09:01:34Z" level=info msg="Retrying in 1s seconds" connectionID=0
meaning that because we are using the full capacity of allowed origins the new pod cannot be registered until the cleanup happens.
It looks like this was introduced at first with commit: 8cc69f2 and one could work arounnd it via setting the flag:
--use-quick-reconnects=false
This could work until version 2020.6.6.
The behaviour changed again here: 2a3d486 when implementing named tunnels. As far as I can understand we cannot use named tunnels with a load balancer atm(?)
Changing our deployment to a statefulSet, in order to be able to set a constant --name
flag per pod to test the above, results in pods crashing after logging:
INFO[2020-10-14T10:36:51Z] Tunnel already created with ID 817524d6-4c5e-4faa-8cbb-dd1e98d5d5fc
INFO[2020-10-14T10:36:52Z] Load balancer <my-hostname> already uses pool <my-lb-pool> which has this tunnel as an origin
Is there a way to use the newer cloudflared images with a load balancer and preserve the behaviour (quickly remove critical
origins so we can successfully roll our deployments pods)?