Description
Describe the bug
We use digital ocean k8s cluster. Dkron v4 server is deployed using dkron helm chart.
Each application namespace has its own(or several) dkron agents to run jobs based on tags.
All is well and working until we hit what looks to be a scaling threshold( ~80-85 agent pods ) and then the serf becomes unstable.
Agents can't fully connect to cluster anymore, or rather set their tags. The same thing that worked when cluster was smaller now mostly fails, although if left on its own it sometimes can join and works until restarted.
Related logs from agent:
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [INFO] serf: EventMemberJoin: app1-dkron-worker-dfcd587f7-57wkn 10.244.27.220"
dkron time="2025-04-09T13:31:56Z" level=info msg="agent: Retry join LAN is supported for: aliyun aws azure digitalocean gce k8s linode mdns os packet scaleway softlayer tencentcloud triton vsphere" node=app1-dkron-worker-dfcd587f7-57wkn
dkron time="2025-04-09T13:31:56Z" level=info msg="agent: Joining cluster..." cluster=LAN node=app1-dkron-worker-dfcd587f7-57wkn
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [DEBUG] memberlist: Initiating push/pull sync with: 10.244.12.71:8946"
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [INFO] serf: EventMemberJoin: app2-service-dkron-worker-545cbd6958-hs7wk 10.244.7.245"
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [INFO] serf: EventMemberJoin: app3-dkron-worker-8578d5d74b-7b88w 10.244.6.221"
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [INFO] serf: EventMemberJoin: dkron4-server-1 10.244.8.31"
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [INFO] serf: EventMemberJoin: dkron4-agent-fd99df5cd-9k7fr 10.244.6.82"
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [INFO] serf: EventMemberJoin: dkron4-server-0 10.244.24.233"
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [INFO] serf: EventMemberJoin: dkron4-server-2 10.244.12.71"
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [INFO] serf: EventMemberJoin: dkron4-server-3 10.244.27.229"
...
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [DEBUG] memberlist: Initiating push/pull sync with: 10.244.24.233:8946"
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [DEBUG] memberlist: Initiating push/pull sync with: 10.244.8.31:8946"
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [DEBUG] memberlist: Initiating push/pull sync with: 10.244.27.229:8946"
dkron time="2025-04-09T13:31:56Z" level=info msg="agent: Join LAN completed. Synced with 4 initial agents" node=app1-dkron-worker-dfcd587f7-57wkn
...
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [INFO] serf: EventMemberUpdate: app1-dkron-worker-dfcd587f7-57wkn"
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [DEBUG] serf: messageJoinType: app1-dkron-worker-dfcd587f7-57wkn"
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [DEBUG] serf: messageJoinType: app1-dkron-worker-dfcd587f7-57wkn"
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [DEBUG] serf: messageJoinType: app1-dkron-worker-dfcd587f7-57wkn"
dkron time="2025-04-09T13:31:57Z" level=info msg="2025/04/09 13:31:57 [DEBUG] serf: messageJoinType: app1-dkron-worker-dfcd587f7-57wkn"
dkron time="2025-04-09T13:31:57Z" level=info msg="2025/04/09 13:31:57 [DEBUG] serf: messageJoinType: app1-dkron-worker-dfcd587f7-57wkn"
...
dkron Error: agent: Error setting tags: timeout waiting for update broadcast
dkron Usage:
dkron dkron agent [flags]
After this it simply goes into crashloop and retries until it starts to work, or more likely not.
To Reproduce
Steps to reproduce the behavior:
-
Setup dkron cluster using official helm chart(we have 4 server nodes in raft)
-
Create application pod with dkron agent - it simply runs "dkron agent" command and connects using following:
dkron.yml: |-
server: false
retry-join: ["dkron.dkron.svc.cluster.local"]
encrypt: custom-hash-value
profile: wan
statsd-addr: "127.0.0.1:9125"
tags:
app1: cron -
Add some jobs(we have 187), simple jobs nothing too fancy.
{
"id": "test_job_app",
"name": "test_job_app",
"displayname": "",
"timezone": "",
"schedule": "@every 1m",
"owner": "Gitlab",
"owner_email": "[email protected]",
"disabled": false,
"tags": {
"app1": "cron:1"
},
"metadata": null,
"retries": 0,
"dependent_jobs": null,
"parent_job": "",
"processors": {
"fluent": {
"fluent": "",
"project": "test_job"
}
},
"concurrency": "forbid",
"executor": "shell",
"executor_config": {
"allowed_exitcodes": "0, 199, 255",
"command": "echo OK",
"cwd": "/www/default/cron",
"mem_limit_kb": "inf",
"project": "test_job",
"shell": "true",
"timeout": "1m"
},
"status": "failed",
"next": "2025-04-09T09:30:26Z",
"ephemeral": false,
"expires_at": null
}, -
Then start adding more agents(we have ~85 agent pods) until at one point new agents start to become unstable and refuse to connect.
Expected behavior
I expect agents to continue connecting and taking jobs as before.
Specifications:
- OS: alpine:latest
- Version: dkron server 4.0.4; agent latest Devel
- Executor: shell
Additional context
We also have older dkron v3 implementation (Version: 3.1.10) which is being used in parallel, so for this new implementation we added encrypt setting to prevent cluster merging.
I suspect that serf simply takes more time to broadcast changes than timeout, that's why everything becomes unstable and only works sometimes and works well on smaller scale. Hope you can advice something, I have tried changing available timeout related settings like serf-reconnect-timeout, raft-multiplier, which did not lead to any noticeable change.